Text Preprocessing for NLP - DEV Community

Day 2: Text Preprocessing for NLP

As part of my #75DaysOfLLM journey, we dive into it Text pre-editing. Text preprocessing transforms raw text into clean, structured data for machine analysis. In this post, we’ll explore the steps involved in preprocessing text, from cleaning to tokenization, stopword removal, and more.

Clean up text

Text often contains unwanted elements such as HTML tags, punctuation, numbers, and special characters that do not add value. Text cleaning involves removing these elements to reduce noise and focus on meaningful content.

Sample code:

import re

# Sample text
text = "Hello! This is sample text with numbers (1234) and punctuation!!"

# Removing HTML tags and special characters
cleaned_text = re.sub(r'<./>', '', text)  # Remove HTML tags
cleaned_text = re.sub(r'(^a-zA-Z\s)', '', cleaned_text)  # Remove punctuation/numbers
print(cleaned_text)

Tokenization

Tokenization is the process of breaking text into smaller units, usually words or sentences. This step allows us to work with individual words (tokens) instead of a continuous stream of text. Each token serves as a unit of meaning that the NLP model can understand.

In tokenization, there are two common approaches:

Word tokenization: Splits the text into separate words.
Meaning tokenization: Splits the text into sentences, which is useful for certain applications, such as summarizing text.

Sample code:

from nltk.tokenize import word_tokenize

# Tokenization
tokens = word_tokenize(cleaned_text)
print(tokens)

Stop Word Removal

Stop words are common words (such as “the”, “is”, “and”) that occur frequently but don’t add much meaning. Removing them reduces the size of the dataset and focuses on more important words.

Sample code:

from nltk.corpus import stopwords

# Removing stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = (token for token in tokens if token not in stop_words)
print(filtered_tokens)

Stem formation and lemmatization

Both stemming and lemmatization reduce words to their base form, which helps standardize different forms of the same word (e.g., “running” and “ran”). Stemming is faster but less accurate, while lemmatization returns valid words based on vocabulary.

Tribes:Stemming involves truncating word endings to get to the root form, without worrying about whether the resulting word is valid. It is fast, but less accurate.
Lemmatization: Lemmatization reduces words to their base form, but uses vocabulary and morphological analysis to return valid words. This makes it more accurate than stemming.

General algorithms for stemming and lemmatization:

Porter Stemmer: One of the most popular voting algorithms.
Lancaster Tuner: A more aggressive stem form that may break words more drastically.
WordNet Lemmatization Program: Part of the NLTK library and uses a dictionary to find the correct lemma of a word.

Sample code:

from nltk.stem import PorterStemmer, WordNetLemmatizer

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = (stemmer.stem(token) for token in filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = (lemmatizer.lemmatize(token) for token in filtered_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Expanding contractions

Contractions are shortened versions of phrases, such as “can’t” for “cannot” or “I’m” for “I am.” Although contractions are natural in everyday language, it is often useful to expand them during text editing to maintain consistency.

Sample code:

import contractions

# Expanding contractions
expanded_text = contractions.fix("I can't do this right now.")
print(expanded_text)

Spell check

Spelling errors in text can affect NLP models. Automatic spell checking tools can detect and correct common spelling errors.

Sample code:

from textblob import TextBlob

# Spell check
text_with_typos = "Ths is an exmple of txt with speling errors."
corrected_text = str(TextBlob(text_with_typos).correct())
print(corrected_text)

Conclusion

Preprocessing is an essential first step in any NLP pipeline. Cleaning, tokenizing, removing stop words, and handling tasks like stemming and spell checking ensure that your text data is ready for analysis. Stay tuned for the next installment of my #75DaysOfLLM challenge as we dive deeper into NLP and language models!

Day 2: Text Preprocessing for NLP

Clean up text

Tokenization

Stop Word Removal

Stem formation and lemmatization

General algorithms for stemming and lemmatization:

Expanding contractions

Spell check

Conclusion

Related Posts

Veranstaltungen in Berlin aktuell: Veranstaltungstipps fürs Wochenende

Live Poker | EPT Barcelona: Stephen Song triumphiert beim Main Event

New light-based technique shows 90% accuracy in early detection of prostate cancer