Sign up for the Digital Research and Collections Insights Listerv and check out the latest on scholarly communication and digital scholarship on our blog.
Text analysis, Text mining, or Natural Language Processing (NLP) is a subset of data mining where computational methods are used to study natural languages as unstructured data, the data that is not in a table or database. It allows us to find hidden patterns, trends, and relationships from a large amount of text. It is sometimes called "Distant Reading", but you should also do a close reading on a portion of the texts so that you are well informed on the content you are researching.
We analyze texts to answer the research questions, such as:
Content Mining is the overall concept of pulling together into one place a large corpus of text, data, or images from various sources and running scripts on them to answer a research question.
Your data , text, or content must be machine readable and together in one place for the algorthims in any platform to work. In other words, images like PNG or JPG will not work, so they must be OCRed or converted into text. PDFs may work if the text is searchable within the document. Here are a few tools for converting your data into the proper text format. TXT , TSV, or CSV are the best formats for working with data.
If your computer can no longer handle the size of the data set, contact DoIT Research Computing about their high performance computing (HPC) options.