Skip to Main Content

Text Analysis

Find tools and information about text mining and analysis.

What is It?

Text analysis is a subset of data mining, using computational methodes to study natural languages as unstructured data sets. It can allow you to explore connections between words and topics.   It can be large blocks of text, not in a table or database.  This research method is increasing in use, but should not be done alone.  It can be called Distant Reading, but you should also do a close reading on a portion of the texts so that you are well informed on the content you are researching. 

Terminology

Content Mining is the overall concept of pulling together into one place a large corpus of text, data, or images from various sources and running scripts on them to answer a research question. 

  • Text Mining is a research technique using computational analysis to uncover patterns in large text-based data sets. (Wikipedia)
  • Data Mining is to use machine-learning and statistical models to uncover clandestine or hidden patterns in a large volume of data (Wikipedia) 
  • Webscraping, web harvesting or web data extraction is data scraping used for extracting data from web sites. The term typically refers to automated processes implemented using a bot or web crawler. 
  • Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. (Wikipedia)
  • Topic Modeling: A form of text mining to identify patterns or topics in a large corpus of text. (Blei, 2012)
  • Natural Language Processing (NLP) is s process of programming computers to process and analyze large amounts of natural language texts.

Getting Started

  1. What is your research question?
  2. Pull together a team of support: librarians, computer programmers, statisticians are on campus and here to help.
  3. These projects take time to pull together and perform.  Make sure you have this, especially for preparing the data.
  4. Pull together a machine-readable data set. This could be through:
    1. Scanning and OCRing content
    2. Scraping the web
    3. Requesting licensed data from a company or vendor.

Common Text Analysis Methods

  • Word Frequency: Most themes and words used throughout the text
  • Significant Terms or TFIDF (Term Frequency and Inverse Document Frequency): Finding the most significant terms across documents
  • Collocation or Concordances: Finding connections or pairings of words
  • Topic Modeling: Searhing for similar topics in many books, a long running diary, a whole run of a journal or periodical
  • Sentiment Analysis: Mapping emotions in a body of texts

 

Storing and Creating Machine Readable Text

Your data , text, or content must be machine readable and together in one place for the algorthims in any platform to work.  In other words, images like PNG or JPG will not work, so they must be OCRed or converted into text. PDFs may work if the text is searchable within the document.  Here are a few tools for converting your data into the proper text format.  TXT , TSV, or CSV are the best formats for working with data.

If your computer can no longer handle the size of the data set, contact DoIT Research Computing about their high performance computing (HPC) options.