Natural Language Processing(NLP) Python

In this tutorial, we will learn and implement the code for one of the most popular methodologies of Natural Language Processing(NLP) using Python, Stemming and Lemmatization. Python is a preferable language for NLP, because of it’s simple syntax and easy to execute code.

NLP – Natural Language Processing

Natural Language Processing is focused on making computers understand and process human languages. Computers are great at processing and learning tons of data, from spreadsheets. The language humans use is unstructured, computers need structured or organized form of data to understand

There exist multiple techniques for NLP, such as Sentiment Analysis, Named Entity Recognition, Stemming, Lemmatization, Bag of words, Term Frequency-Inverse Document Frequency, and Wordcloud.

Stemming in Natural Language Processing(NLP) Python

Stemming is a technique for eliminating affixes from words to obtain their basic form. It’s the same as pruning branches down to the trunk. The stem of the terms eating, eats, and eaten, for example, is eat. Search engines index words using stemming.

Let’s look at another example; the word “friends”, “friendships”, ”friendships” after stemming it is reduced to “friendship”

Two well-known libraries are available for the English language in nltk PorterStemmer and LancasterStemmer. These two algorithms are compatible with Python and easily executable.

PorterStemmer

Let’s implement the basic and simple PorterStemmer and understand how it works and how to implement it.

Step 1: Importing the library from NLTK

#importing the PorterStemmer Library
from nltk.stem import PorterStemmer

Step 2: Create a function and implement it with the library

#basic implementation
# PorterStemmer
porter = PorterStemmer()
print(porter.stem("friendship"))
Output:
friendship

Now here we can see that the output is friendship only instead it should be friend, But the PorterStemmer algorithm does not follow the linguistics, but it follows a set of 5 rules for different cases that are applied in phases to generate stems.

Lancaster Stemmer in NLP

Here, let’s implement the code for Lancaster Stemmer and understand how it executes.

The steps will be the same as of PorterStemmer, import the library and create a function with the library.

#Import the LancasterStemmer Library

#importing the LancasterStemmer Library
from nltk.stem import LancasterStemmer

#Defining the function using the library

#LancasterStemmer
lancaster=LancasterStemmer()
print(lancaster.stem("friendship"))
Output:
friend

Given the input friendship, the LancasterStemmer gave the output friend. So, here the output produced is correct and is accurate. LancasterStemmer performs heavy stemming because of iteration over-stemming happens. The output produced can be of no meaning.


There are several other libraries you can explore yourself; try SnowballStemme from nltk.stem

Lemmatization in Natural Language Processing(NLP) Python

Unlike, stemming lemmatization doesn’t stem out the word. Instead of truncating, it searches in the dictionary for the word. So it does requires the dictionary of the particular language in order to generate output.

This process of searching makes the lemmatization algorithm slower, but results are more accurate in comparison to stemming. If the speed is not the issue and the result is, lemmatization is a better option.

Let’s look at the implementation of Lemmatization.

Step 1: Import libraries

#import the libraries 
from nltk import WordNetLemmatizer

Step 2: Defining the function

lemmatizer = WordNetLemmatizer()
words = ['articles', 'friendship', 'studies', 'phones']
for word in words:
    print(lemmatizer.lemmatize(word))
Output:
article
friendship
study
phone


The output for the words articles, friendships, studies, and phones is article, friendship, study, phone. This depicts that the accuracy in Lemmatization is better compared to stemming. 

Lemmatization can generate different output for different Part of speech, such verb(v), noun(n), adjective(a), and adverb(r). The Default POS value in lemmatization is Noun so the output in the above example is of noun only.

Let’s try a different POS value (V)

from nltk import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ['be', 'is', 'are', 'were', 'was']

#Changing the pos value to verb (V)
for word in words:
    print(lemmatizer.lemmatize(word, pos='v'))
Output:
be
be
be
be
be

Now it is clear that POS value is V the output is only “be”.

Comparison between Stemming and Lemmatization.

The table below will help you to understand the difference between PorterStemmer, LancasterStemmer, and Lemmatization. It will shed the light on different results generated by each of the three algorithms.

Comparison for PorterStemmer, LancasterStemmer and Lemmatization with results for Natural Language Processing

Now here in this table we can see that how the results are different for each of the algorithm.

Summary

To conclude this tutorial, we learned about the most used techniques of NLP, which are Stemming and Lemmatization. The stemming technique is fast, but one cannot rely on this technique if accuracy is a preference. Lemmatization is not as fast as Stemming but it can definitely generate accurate results.

In upcoming tutorials, we will be exploring other techniques of NLP which are complicated but easy to learn and understand.

Do check out our article on “How to dynamically generate images with Node.js and Canvas

Hoping this article will help you understand about NLP Technique.

Natural Language Processing(NLP) Python

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top