Sign up for the Digital Research and Collections Insights Listerv and check out the latest on scholarly communication and digital scholarship on our blog.
Text analysis is a subset of data mining, using computational methodes to study natural languages as unstructured data sets. It can allow you to explore connections between words and topics. It can be large blocks of text, not in a table or database. This research method is increasing in use, but should not be done alone. It can be called Distant Reading, but you should also do a close reading on a portion of the texts so that you are well informed on the content you are researching.
Content Mining is the overall concept of pulling together into one place a large corpus of text, data, or images from various sources and running scripts on them to answer a research question.
Your data , text, or content must be machine readable and together in one place for the algorthims in any platform to work. In other words, images like PNG or JPG will not work, so they must be OCRed or converted into text. PDFs may work if the text is searchable within the document. Here are a few tools for converting your data into the proper text format. TXT , TSV, or CSV are the best formats for working with data.
If your computer can no longer handle the size of the data set, contact DoIT Research Computing about their high performance computing (HPC) options.