Skip to Main Content

Text Analysis

Find tools and information about text mining and analysis.

What is text analysis?

Text analysis, Text mining, or Natural Language Processing (NLP) is a subset of data mining where computational methods are used to study natural languages as unstructured data, the data that is not in a table or database. It allows us to find hidden patterns, trends, and relationships from a large amount of text. It is sometimes called "Distant Reading", but you should also do a close reading on a portion of the texts so that you are well informed on the content you are researching. 

We analyze texts to answer the research questions, such as:

  • What are these texts about, and how are they connected?
  • Which texts are similar, or which concepts occur together?
  • What emotions are expressed in the text, or how can the text be classified?
  • What key names can I find in text that represent entities such as person, place, or organization

Terminology

Content Mining is the overall concept of pulling together into one place a large corpus of text, data, or images from various sources and running scripts on them to answer a research question. 

  • Data Mining: Use of machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data. (Wikipedia) 
  • Webscraping (web harvesting, or web data extraction): A Method for data scraping used in extracting data from websites. The term typically refers to automated processes implemented using a bot or web crawler. 
  • Machine learning (ML): Scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. (Wikipedia)
  • Corpus: A collection of written texts, especially a body of writing on a particular subject. The plural is corpora. (Oxford dictionary)
  • Natural Language Processing (NLP) is s process of programming computers to process and analyze large amounts of natural language texts.

Getting Started

  1. What is your research question?
  2. Pull together a team of support: librarians, computer programmers, statisticians are on campus and here to help.
  3. These projects take time to pull together and perform.  Make sure you have this, especially for preparing the data.
  4. Pull together a machine-readable data set. This could be through:
    1. Scanning and OCRing content
    2. Scraping the web
    3. Requesting licensed data from a company or vendor.

Storing and Creating Machine Readable Text

Your data , text, or content must be machine readable and together in one place for the algorthims in any platform to work.  In other words, images like PNG or JPG will not work, so they must be OCRed or converted into text. PDFs may work if the text is searchable within the document.  Here are a few tools for converting your data into the proper text format.  TXT , TSV, or CSV are the best formats for working with data.

If your computer can no longer handle the size of the data set, contact DoIT Research Computing about their high performance computing (HPC) options.