Skip to Main Content

Digital Scholarship

Find tools and tutorials for all of your digital scholarship projects.

Content Mining

Introduction and Terminology

Topics on this page include content and text mining as well as topic modeling.

Content Mining is the overall concept of pulling together into one place a large corpus of text, data, or images from various sources and running scripts on them to answer a research question. 

  • Text Mining is a research technique using computational analysis to uncover patterns in large text-based data sets. (Wikipedia)
  • Data Mining is to use machine-learning and statistical models to uncover clandestine or hidden patterns in a large volume of data (Wikipedia) 
  • Webscraping, web harvesting or web data extraction is data scraping used for extracting data from web sites. The term typically refers to automated processes implemented using a bot or web crawler. 
  • Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. (Wikipedia)
  • Topic Modeling: A form of text mining to identify patterns or topics in a large corpus of text. (Blei, 2012)
  • Natural Language Processing (NLP) is s process of programming computers to process and analyze large amounts of natural language texts.

Tools for Content Mining

 

Freely Available Corpora for Mining

USC Licensed Content for Mining

Costs are incurred for working with some of this data. Please contact a USC librarian if you are interested.

Tutorials and Information

Example Projects

Further Resources