Decoding Language: A Deep Dive into Natural Language Processing and Word Embeddings

Nimantha Gayan
8 min readNov 30, 2024

--

In the previous article, we discussed Large Language Models (LLMs) and how they’re transforming industries by enabling machines to understand and generate human-like language. But how do these models work at their core? The answer lies in Natural Language Processing (NLP).

In this article, we’ll explore the foundational concepts of NLP, how to implement them with Python, and how these techniques tie together to prepare data for real-world projects. By the end, you’ll not only understand the theory but also see these methods in action with a mini-project task.

Natural Language Toolkit (NLTK) is one of the largest Python libraries for various Natural Language Processing tasks. From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text, NLTK’s API covers everything.

Why Clean Data is Key: The Importance of Text Preprocessing

Have you ever wondered how computers understand human language? It’s all thanks to a process called text preprocessing. Think of it as cleaning up a messy room before you can organize it.

Why is Text Preprocessing So Important?

Noise Reduction: Imagine trying to understand a conversation with background noise. It’s hard, right? Similarly, text data often has “noise” like punctuation, extra spaces, or irrelevant symbols. Preprocessing removes this noise, making the text clearer and easier to analyze.

Standardizing Words: Words can have different forms (like “run,” “running,” and “ran”) but still mean the same thing. Preprocessing techniques like stemming and lemmatization standardize these variations, making it easier for computers to understand the underlying meaning.

Breaking Down Text: To analyze text, we need to break it down into smaller pieces, like words or phrases. This process is called tokenization. It helps computers focus on the important parts of the text.

Removing Unnecessary Words: Some words, like “the,” “and,” and “is,” are very common but don’t add much meaning. These are called stop words. Removing them can make the analysis more efficient.

Extracting Meaningful Features: Preprocessing helps extract important features from text, such as word frequencies or word relationships. These features are the building blocks for machine learning models.

Reducing Complexity: Text data can be very complex, with a huge variety of words and phrases. Preprocessing techniques like TF-IDF or dimensionality reduction can help simplify this complexity, making it easier for models to learn from the data.

So let's implement the text preprocessing steps by using the SMS Spam Collection Dataset

The data has 5572 rows and 2 columns. You can check the shape of data using data.shape function. Let’s check the dependent variable distribution between spam and ham.

Punctuation Removal

This step in text processing in NLP involves removing all the punctuation from the text. String library of Python contains some pre-defined list of punctuations such as ‘!”#$%&’()*+,-./:;?@[\]^_`{|}~’

We remove all the punctuations from v2 and store them in the clean_msg column, as shown in the above output.

Lowering the Text

Converting the text into the same case, preferably lowercase, is one of Python’s most common text preprocessing steps. However, doing this step for text processing every time you work on an NLP problem is unnecessary, as lower casing can lead to a loss of information for some problems.

For example, when dealing with a person’s emotions in any project, words written in upper case can signify frustration or excitement.

All the text of clean_msg column is converted into lowercase and stored in the msg_lower column

Tokenization

Tokenization in Python is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

The tokens could be words, numbers, or punctuation marks. In tokenization, smaller units are created by locating word boundaries. Wait what are word boundaries?

These are the ending point of a word and the beginning of the next word. These tokens are considered as a first step for stemming and lemmatization (the next stage in text preprocessing which we will cover in next ).

Types of Tokenization in Python?

Three simple types of tokenization in Python:

  1. Word Tokenization: Splitting a sentence into individual words.
  2. Sentence Tokenization: Breaking a paragraph into separate sentences.

Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text.

Let’s continue the above text preprocessing task on the SMS Spam Collection Dataset

Here we use word tokenization using regular expression.

Sentences are tokenized into words.

Stop Word Removal

We remove commonly used stopwords from the text because they do not add value to the analysis and carry little or no meaning.

NLTK library consists of a list of stopwords considered stopwords in the English language. Some of them are : [i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t]

However, using the provided list of stopwords is unnecessary, as they should be chosen wisely based on the project. For example, ‘How’ can be a stopword for a model but can be important for some other problem where we are working on customers’ queries. We can create a customized list of stopwords for different problems.

Stop words in the nltk library, such as in, until, to, I, and here, are removed from the tokenized text, and the rest are stored in the no_stopwords column.

Stemming

Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn’t have any meaning in machine learning. This step, known as text standardization, stems or reduces words to their root or base form. For example, we stem words like ‘programmer,’ ‘programming,’ and ‘program’ to ‘program.’

Advantages of Stemming

  • Improved model performance: Stemming reduces the number of unique words that need to be processed by an algorithm, which can improve its performance. Additionally, it can also make the algorithm run faster and more efficiently.
  • Grouping similar words: Words with a similar meaning can be grouped together, even if they have distinct forms. This can be a useful technique in tasks such as document classification, where it’s important to identify key topics or themes within a document.
  • Easier to analyze and understand: Since stemming typically reduces the size of the vocabulary, it’s much easier to analyze, compare, and understand texts. This is helpful in tasks such as sentiment analysis, where the goal is to determine the sentiment of a document.

Disadvantages of Stemming

  • Overstemming / False positives: This is when a stemming algorithm reduces separate inflected words to the same word stem even though they are not related; for example, the Porter Stemmer algorithm stems “universal”, “university”, and “universe” to the same word stem. Though they are etymologically related, their meanings in the modern day are from widely different domains. Treating them as synonyms will reduce relevance in search results.
  • Understemming / False negatives: This is when a stemming algorithm reduces inflected words to different word stems, but they should be the same. For example, the Porter Stemmer algorithm does not reduce the words “alumnus,” “alumnae,” and “alumni” to the same word stem, although they should be treated as synonyms.
  • Language challenges: As the target language’s morphology, spelling, and character encoding get more complicated, stemmers become more difficult to design; For example, an Italian stemmer is more complicated than an English stemmer because there is a higher number of verb inflections. A Russian stemmer is even more complex due to more noun declensions.

However, stemming can cause the root form to lose its meaning or not be reduced to a proper English word. We will see this in the steps below.

In the output , we can see how some words stem from their base.

Now, let’s see how Lemmatization is different from Stemming.

Lemmatization

It stems from the word lemmatize, which means reducing the different forms of a word to one single form. However, one must ensure that it does not lose meaning. Lemmatization has a pre-defined dictionary that stores the context of words and checks the word in the dictionary while diminishing.

Advantages of Lemmatization

  • Accuracy: Lemmatization does not merely cut words off as you see in stemming algorithms. Analysis of words is conducted based on the word’s POS to take context into consideration when producing lemmas. Also, lemmatization leads to real dictionary words being produced.

Disadvantages of Lemmatization

  • Time-consuming: Compared to stemming, lemmatization is a slow and time-consuming process. This is because lemmatization involves performing morphological analysis and deriving the meaning of words from a dictionary.

Let us now understand the difference between after stemming and after lemmatization:

The difference between stemming and lemmatization can be seen in the output.

In the first row- crazy has been changed to crazi which has no meaning, but for lemmatization, it remained the same, i.e. crazy.

In the last row- goes has changed to goe while stemming, but for lemmatization, it has converted into go, which is meaningful.

Conclusion

In this article, we have explored foundational NLP techniques, from tokenization to stemming, lemmatization, and preprocessing. These methods form the backbone of modern NLP workflows, especially for LLMs.

Next Steps: In the next article, we’ll dive into word embeddings, the magical representations that allow models to understand the meaning and context of words.

Start experimenting with these techniques and build your skills step by step. The future of NLP is in your hands!

--

--

Nimantha Gayan
Nimantha Gayan

Written by Nimantha Gayan

Software Engineering , University Of Kelaniya

No responses yet