I’m in the midst of unsteadily lurching through my first real Kaggle Competition, leaving behind the false comfort of the MNIST Digit recognition and other “Hello World” style machine learning problems. During this process it was very quickly made clear to me just how much of a beginner I am, and that, even with my background in philosophy, I’ve just barely scratched the surface when it comes to Natural Language Processing (NLP) in a Machine Learning context. This was made abundantly clear to me when, after looking at some of the leading kernels for inspiration, found they all practiced some form of NLP text preprocessing and analysis prior to feeding their data into the almighty Algorithm (typically BERT, in this scenario).
After doing a little digging (aka, furious Googling) I got a general sense of the why and how behind preprocessing text for NLP; I’ve outlined some of the more basic elements below, with the code to implement them. These methods are by no means unique, and certainly don’t cover all there is to know about preparing and cleaning text, but hopefully they’ll serve as a jumping off point for someone looking to start.A word of warning – this post assumes almost complete ignorance for machine learning and python. It’s a beginners guide in the truest sense, a reference that might be used by someone with some ideas and zero know-how. Anyway, on to some NLP text preprocessing!
Create a home for your processed text
One thing worth mentioning off the bat is to plan on keeping your original text as-is, and placing the cleaned and processed version in another variable (or, if using pandas, another DataFrame column). Not only is this generally good practice in a variety of contexts but it also saves you the hassle of having to reload your data after trying and failing to clean it in-place (not that I would know from experience…).
Remove Null Values
Starting simple. This is specific to working with DataFrames, but its good to have in the back of your mind. Find and remove any Null (empty) values from your body of text. This makes for cleaner data representation and generally falls under the “best practices” category for data science.
As note going forward text
refers to whatever body of text or DataFrame column you may be working with. Applying methods to DataFrame columns can be a little bit tricky, and I’ve included a straightforward way of doing so at the end of this post.
text.dropna(axis=0, inplace=True)
dropna()
: Pandas method for removing any row or column with a null orNaN
valueaxis=0
: select whether to remove the row (0) or the column(1)inplace=True
: applies this directly to the DataFrame object.
making the text lowercase
Pretty wild, I know, but, there’s a good reason for doing this. First, simpler statistical text analysis can’t often tell the difference between cased and uncased words; hence, Blue
and blue
would be interpreted as two separate words, when really you just need to account the amount of times the word appears in any form. Second, it makes down-stream text cleaning and processing easier. This can be done by simply implementing text = text.lower()
Regular Expressions: Your new best friend
Frankly, I (still) immensely dislike regular expressions. Although I’ve warmed up to them a good bit, we’re just on speaking terms, not yet BFF’s (there’s something irritatingly obtuse about "^\(*\d{3}\)*( |-)*\d{3}( |-)*\d{4}$"
, right?). Honestly I think most of my frustration is from my inexperience and general poor skill at implementing them off-the-cuff. Thank goodness for sites like regex101.com, which allow you to test expressions in real time and, and has saved me countless hours trying to put these things together. They even offer the option to change the programming languages, which can make all the difference when it comes to implementing a regular expression in your code.
The library for regular expressions in Python is re
, so remember to add import re
to the beginning of your code. Each of the functions below searches for a pattern within the input text and substitutes it with an empty string. The general approach for this is: re.sub(pattern, '', input_text)
# removes text within square brackets text = re.sub('\[.*?\]', '', text) # remove hyperlinks text = re.sub('https?://\S+|www\.\S+', '', text) # removes text within brackets text = re.sub('<.*?>+', '', text) # removes punctuation text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # removes new line characters text = re.sub('\n', '', text) # removes words with numbers in them text = re.sub('\w*\d\w*', '', text)
So that’s the essential cleaning out of the way; this this finished up we can start to pull out some of the basic characteristics of the text, using some neat statistical and visual methods, like plotting word frequency, or making Word Clouds and N-grams.
Preparing text with NLTK
Timme to bring in the big guns. import nltk
adds the Natural Language Toolkit, a powerful suite for preparing and manipulating your text. From here we can start getting into the more linguistically complex processes, such as stemming, lemmatization, removing stopwords, and tokenizing.
Stemmin’ and Lemmin’
Stemming is, I think, a somewhat controversial technique to apply and maybe should be used with caution. After doing the leg work on it, I decided it was unnecessary for my particular application but I’ve still included it here for my reference and yours. Stemming is the practice of removing the endings of words to produce the root; here’s an example:
program : program programs : program programmed : program programmer : program programmers : program
A truly thrilling concept. But, there are some pragmatic applications to be found here. For example, if you’re working with a corpus of text that is heavily focused around a particualr subject (let’s say, oh I don’t know, ‘programming’), it is probably unsurprising to and not at all useful to find out that the most common words are ‘program’, ‘programmed’, ‘programming’, etc. Stemming gives you the capacity to categorize these words and produce a more concise picture of the rest of the corpus.
from nltk.stem import PorterStemmer stemmer = PorterStemmer() words = ['program', 'programs ', 'programmed', 'programmer', 'programmers'] stemmed_words = [] for w in words: stemmed_words.append(stemmer.stem(w))
Lemmatization vs Stemming, and why the difference is important
Lemmatization certainly shares some of the same airspace as stemming, but differs on crucial methodological differences. While stemming simply reduces a words by amputating it’s ending, lemmatization offers a more nuanced approached that relies on a contextual analysis of the text in question. One example of how lemmatization can produce better results is with words like swam, for example; a simple stemming process may not recognize this as the past tense of swim, and may even go so far as to shorten it just to s. A Lemmatizer would be able to connect the words and properly associate them to each other.
Tokenization
Tokenization is the process of splitting each word into its own individual item; rather than taking in a questionably long string and attempting to parse through the data, tokenization itemizes words in a way that easily be consumed by your language processing model. Python actually has a built in tokenizer: the split()
function. If you’re familiar with it, you’ll know that the split()
function takes in a string and spits out a list based on the delimiter you specified.
While there tokenizers built into language modeling suites like nltk
, I think its worth bringing back our old friend the Regular Expression into the mix. While there’s certainly more to unpack in the process below (the re
module is very well documented), you can use re.split()
to generate a tokenized list of your corpus.
#creates a pattern that an be stored in a variable and reused pattern = re.compile(r'') # generating the tokenized list of strings tokenized_text = pattern.split(text)
Removing Stopwords
What are they? Should you really stop using them? Stop Words are by and large a list of commonly used, not very informative words that crop up frequently in language. They are useful for us humans, as they help use stitch the fabric of language and meaning together, but they themselves don’t necessarily contribute all that much to the conversation. If you’re goal is mine data from a text body, or perhaps apply some search functionality, stop words just get in the way. Here’s a Word Cloud of stop words taken from nltk.corpus
, to give you a better sense of what we’re working with…
The keen of sight might have picked up on a few patterns appearing in the Image – there are definitely some trends to unpack and explore here (for the more philosophically minded), and it’s an ongoing conversation in the field of NLP and Machine Learning as a whole
Apply NLP Text Preprocessing to a DataFrame
Here’s an easy way to implement all these ideas at once – make a shiny new text_preprocessing()
function! Making this a function allows you to easily reuse in the future, and also provides a very convenient method for apply these processes to a Pandas DataFrame. Here’s an example of such a function using the examples in this post:
def text_cleaner(text): text = text.lower() text = re.sub('\[.*?\]', '', text) text = re.sub('https?://\S+|www\.\S+', '', text) text = re.sub('<.*?>+', '', text) text = re.sub('[%s]' % re.escape(string.punctuation), '', text) text = re.sub('\n', '', text) text = re.sub('\w*\d\w*', '', text) pattern = re.compile(r'') tokenized_text = pattern.split(text) stemmer = PorterStemmer() stemmed_text = [] for w in tokenized_text: stemmed_text.append(stemmer.stem(w)) stop_words = set(stopwords.words('english')) clean_text = [w for w in stemmed_text if w not in stop_words] return clean_text
Once you have the text cleaning function defined, it’s a simple matter to apply it to your data in Pandas:
df['cleaned_text'] = df['original_text'].apply(lambda x: text_preprocessing(x))
That’s about it, a beginner’s beginner guide to getting your text preprocessed for natural language processing. I’ll hopefully be able to dive deeper the more I explore the field, and have some more in-depth content coming soon.