Text Analysis

Find tools and information about text mining and analysis.

Introductory Tools

These are wonderful, easy to use tools to explore whether you want to learn more about text analysis.  We are members of HathiTrust, which gives you access to their tools. Voyant is a free web browser and AntConc is application available to download.  Constellate is a product from JSTOR for  which we have a license.

Constellate (USC login required)

A text and data analysis licensed service from Ithaka's JSTOR for learning and performing text analysis, building datasets, and sharing analytics course materials. You can learn and teach text analysis by working with template Jupyter Notebooks to analyze texts from the JSTOR and other corpa. It offers live, free workshops for USC students and faculty.

First, if you are a USC Columbia faculty or student, create a JSTOR login through JSTOR or the Constellate Login at the top right of the page. If you are off campus, make sure to go through the Libraries' web pages and access JSTOR and Constellate through the Databases A-Z page from the main landing page.

After you sign into JSTOR, you should be able to access the tutorials and Jupyter lab notebooks. See the Contellate Guide for detailed instructions.

The web site offers Tutorials in the top right link on Beginner Python Lessons and Intermediate Text Analysis.  Their site will help you 

  • Manage the Constellate Lab system
  • Share Files in Constellate
  • Import Data into Constellate
  • Cite a Constellate Dataset
  • Build a Dataset

HathiTrust Research Center (USC login required)

Supported by Indiana University and University of Illinois at Urbana-Champaign, HathiTrust Research Center (HTRC) enables computational analysis of works in the HathiTrust Digital Library (HTDL) to facilitate non-profit research and educational use of the collection.

As institutional members of HathiTrust, Columbia faculty and students have access to the HathiTrust Research Center. To access this portal, you must first Login to HathiTrust using your USC network id. Once signed in, you can read about the HathiTrust Research Center here.  They have freely available text analytics algorithms that you can use on their data by signing up for a Research Center Analytics account. These include extracted feature sets, which include metadata from the many volumes in HathiTrust, Topic Modeling, Named Entity Recognizer,and Token Count or Word Clouds.  Also, try the BookWorm .

This wiki page will help you Get Started with the HathiTrust Research Center.

Advanced Text Analysis Tools

These tools will take some learning, but there are people in Research Computing and in the Libraries who can help you get started.

Data Cleaning Tools