Skip to Main Content

Text Analysis

Find tools and information about text mining and analysis.

Text analysis methods

Which method to choose?

There are several methods for analyzing textual data. The method selection should align with the specific research question. Before deciding on a method, consider what insights you hope to gain and how you want to present your results. Many of the methods outlined below can be used together throughout a research project. For instance, natural language processing techniques can identify the two words occurring together in the text, which can further be explored using network analysis to identify their relationships with other words. Most commonly used text analysis methods are outlined here.

Word frequency: This is the most basic technique for analyzing textual data.  By counting the frequency of each word or phrase, it is possible to identify the most common words, determine the overall distribution of word usage, and identify patterns in the data.

Term Frequency - Inverse Document Frequency (TF-IDF): Term frequency is same as word frequency. IDF determines how common or rare is the word across entire corpus. TF-IDF measures the importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general.

Topic Modeling: A method used to find patterns in text by grouping together words that often appear together. These groups show different topics, helping us see the main themes in a large set of documents. It allows search for similar topics in large body of text.

Network Analysis: A graphical respresentation on how texts are connected in a document or a corpus.

Named Entity Recognition:  A technique that identifies and classifies name entities such as a person, place, or thing, in text.

Sentiment Analysis: Analyzes if the text is expressing positive, negative, or neutral sentiment.

Collocation:  It determines the group of 2 or more words that appear close together more often than would be expected by chance.

Concordance: A method to find a list of words present in a text along with their after and before words. It is used to explore how words are used in different contexts within a text.

n-gram: They express the contiguous sequence of "n" items, typically words. For example, Unigrams are single words, such as, "cat", "dog"; Bigrams (2-grams) are pairs of consecutive words. such as, "natural language."

Resources on University libraries github page:

  1. Python for text analysis 
  2. R for text analysis