What is text mining?

Text mining is a particular kind of data mining, meaning it is research that examines large datasets to identify patterns and relationships.

A text mining project combines two things: a corpus (a dataset of text) and a method (a program, tool, or technique that extracts a specific kind of information about the dataset). At the library, we can assist with both parts of your project.

Common corpora for text mining include newspaper archives, social media posts, and other large collections of "unstructured" text. Common methods include Natural Language Processing (NLP) techniques to identify people, places, topics, and feelings in texts. Advanced text mining is typically conducted by writing small programs in Python or R, but non-coding tools are available too.

Library support in a nutshell

We can provide:

  • This research guide with resources, contact information, other details
  • Workshops, class visits, and training for computational text analysis methods
  • Consultation with library staff on Northeastern-licensed platforms and datasets for mining and analysis
  • Annual payment for membership in HathiTrust
  • Negotiation with vendors to include general text mining provision in licenses for library resources

Unfortunately, we cannot generally provide:

  • Licensing for individual text mining projects -- generally, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
  • Additional payments for data to enable text mining -- the library is not funded for this activity
  • Secure storage for vendor data (this is sometimes possible, but depends on the case, so we cannot guarantee it)
  • Guarantees on enforcing user behavior and handling of vendor data

Featured resource: Constellate

Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who brought you JSTOR and Portico. It is a platform for teaching, learning, and performing text analysis using the world’s leading archival repositories of scholarly and primary source content.

Learn more and begin using Constellate

Featured Resource: ProQuest TDM Studio

TDM Studio is the text analytics service from ProQuest, one of the largest digital collections of text, which includes the historical archives of many of the biggest newspapers. TDM Studio includes both a Visualization Dashboard to carry out simple analytics without coding, and a Workbench Dashboard for more complex analysis with Python or R.

Learn more and begin using ProQuest TDM Studio

Word of advice

Generally, it's a bad idea to use automated means to scrape or download large amounts of data from any database to which the library subcribes. If a database provider allows text mining, they will want to provide the data for you in a secure manner with which they are comfortable.