I just rounded up a project with a client that centred around Natural Language Processing. One of the things I enjoy about working at Elastacloud is the diversity of projects on a week to week basis. One week you may be working on building models for optimising energy usage and another week you could find yourself building a model for expanding abbreviations. What this does is that it helps to keep you on your toes with technology and gives one a new challenge every week. Back to my reason for today’s post.
The Natural Language Processing Kit (NLTK) library in Python is a gem! If you are working on any NLP project using Python, the NLTK library should be your anchor. With 50 corpora and lexicons, 9 stemmers, and dozens of algorithms and modules that can be used for tokenization, stemming, building n-grams, naïve Bayes, k-means, EM classifiers, and so much more, the NLTK is definitely a go-to for anyone working on any NLP project. That said, the NLTK has a steep learning curve. If you are however looking for a quick, easy to learn library, I will recommend TextBlob. It is built on the shoulders of NLTK and Pattern and offers some features from the NLTK such as sentiment analysis, pos-tagging, noun phrase extraction, etc. It’s pretty much easy to use!
Say we have two strings:
With TextBlob, we can do simple things like subset the data, convert all to lower/upper case or concatenate both sentences
One can also break down multiple sentences into individual sentences or words (tokens) and select a particular token
One of the problems we had to solve on the project involved working with data in context. In NLP problems, this could mean breaking your sentences/phrases into n-grams. An n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. TextBlob makes it easy to do this.
Say we want to convert string3 into an n-gram where n=3,
Very easy! Or 2-grams
To make it more interesting, we can add find out the sentiments of our sentences. The sentiment property returns a namedtuple of the formSentiment(polarity,subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
This implies that the sentence "Life is fun!" with a polarity of 0.375 is a positive statement and more of a factual statement (objective) than a personal opinion (subjective).
TextBlob has so much more capabilities (including spelling correction and translation) which you can find here. One final one I would like to show is the spell check. Given a string (string4), TextBlob can give suggestions as to correct spelling for one of the misspelt words.
Feel free to play with the TextBlob() library and it's 'big brother' - NLTK library. Have fun doing so!