Skip to main content

Digital Scholarship

Find tools and tutorials for all of your digital scholarship projects.

Introduction and Terminology

Text and Data MiningTopics addressed on this page include content and text mining as well as topic modeling.

Content Mining is the overall concept of pulling together into one place a large corpus of text, data, or images from various sources and running scripts on them to answer a research question. 

Text Mining is a research technique using computational analysis to uncover patterns in large text-based data sets. (Wikipedia)


Data Mining is to use machine-learning and statistical models to uncover clandestine or hidden patterns in a large volume of data (Wikipedia) 

 

Webscraping, web harvesting or web data extraction is data scraping used for extracting data from web sites. The term typically refers to automated processes implemented using a bot or web crawler.

 

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. (Wikipedia)

Topic Modeling: A form of text mining to identify patterns or topics in a large corpus of text. (Blei, 2012.)  

Tools for Content Mining

Freely Available Corpora for Mining

USC Licensed Content for Mining

Costs are incurred for working with some of this data. Please contact a USC librarian if you are interested.

Adam Matthew   Adam Matthew makes available for USC LIbraries
  • American History: Settlement, Commerce, Revolution and Reform, 1493-1859
  • American History: Civil War, Reconstruction and the Modern Era, 1860-1945
  • Slavery, Abolition and Social Justice
If you are interested in data mining any of these collections, contact USC Libraries to initiate the process. 
Cambridge University Press Cost negotiated per request Contact USC Libraries to initiate the process.
Gale Primary Resources Some free: downloading large datasets incurs costs Contact USC Libraries to initiate the process. Gale Artemis: Primary Sources, which searches across 23 of our Gale primary source databases covering 1500-2012, has a Term Frequency search option and Term Clusters viewer. To download large datasets USC Libraries will have to request data on your behalf from our Gale sales representative. It can take up to 3 weeks to process requests.
IEEE cost negotiated per request Contact USC Libraries to initiate process.  Through a negotiation of the vendor license, the library facilitates on a case by case basis.
Newsbank Costs Incurred Contact USC Libraries to initiate the process. Restrictions in place; cost for TDM research between $6-8,000 and can take up to 6-8 weeks to process.
Oxford University Press Costs incurred Contact USC Libraries to initiate the process. Researchers may use resources for non-commercial text mining. However, OUP offers consultation services with technical project managers to assist in planning projects, including "avoidance of any technical safeguard triggers OUP has in place to protect stability and security of website."
ProQuest Cost negotiated per request and available TDM Studio platform Contact USC Libraries to initiate the process. Proquest does allow free text mining for the newspapers to which USC Libraries have purchased perpetual access licenses. USC Libraries will have to request this data on your behalf. 2019 platform: TDM Studio offers (for pay) select researchers to use ProQuest resources including newspapers for research.
SpringerLink Free (with subscription) Users can download subscribed and open access content for TDM purposes directly from the SpringerLink platform. Content can be downloaded via a web browser or with an HTTP GET request using a scripting tool such as curl, wget and Python's urllib, among others. No API key or other authentication is required. TDM researchers are requested to be considerate and limit their downloading speed to a reasonable rate.

Example Projects

Further Resources