Research Subject Guides: Text Mining and Analysis: Get Started

What is text mining?

Text mining is a particular kind of data mining, meaning it is research that examines large datasets to identify patterns and relationships.

A text mining project combines two things: a corpus (a dataset of text) and a method (a program, tool, or technique that extracts a specific kind of information about the dataset). At the library, we can assist with both parts of your project.

Common corpora for text mining include newspaper archives, social media posts, and other large collections of "unstructured" text. Common methods include Natural Language Processing (NLP) techniques to identify people, places, topics, and feelings in texts. Advanced text mining is typically conducted by writing small programs in Python or R, but non-coding tools are available too.

Library support in a nutshell

We can provide:

Workshops, class visits, and training for computational text analysis methods
Consultation with library staff on Northeastern-licensed platforms and datasets for mining and analysis
Annual payment for membership in HathiTrust
Negotiation with vendors to include general text mining provision in licenses for library resources

Unfortunately, we cannot generally provide:

Licensing for individual text mining projects -- generally, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
Additional payments for data to enable text mining -- the library is not funded for this activity
Secure storage for vendor data (this is sometimes possible, but depends on the case, so we cannot guarantee it)
Guarantees on enforcing user behavior and handling of vendor data

Featured Resource: ProQuest TDM Studio

TDM Studio is the text analytics service from ProQuest, one of the largest digital collections of text, which includes the historical archives of many of the biggest newspapers. TDM Studio includes both a Visualization Dashboard to carry out simple analytics without coding, and a Workbench Dashboard for more complex analysis with Python or R.

Learn more and begin using ProQuest TDM Studio(Opens in new window)

Featured Resource: Gale Digital Scholar Lab

The Gale Digital Scholar Lab is the text analytics service from Gale, home to many of the largest digital collections of historical materials. Using Gale’s Primary Sources archive, you can build content collections and analyze them. The tools cover document clustering, Ngrams, parts of speech, sentiment analysis, and topic modeling. The Lab also allows user to upload their own document collections for analysis.

Learn more and begin using the Gale Digital Scholar Lab(Opens in new window)

Note: this resource is not available to London users at this time.

Word of advice

Generally, it's a bad idea to use automated means to scrape or download large amounts of data from any database to which the library subcribes. If a database provider allows text mining, they will want to provide the data for you in a secure manner with which they are comfortable.

Constellate will sunset on on July 1, 2025

More information is available in their full announcement. All existing resources will remain available through June 30, including all currently-scheduled classes. (In fact, it's not too late to register for some of these excellent classes!)

Before July 1, 2025, you will want to download your datasets(Opens in new window), Constellate lab files, and snapshots(Opens in new window) of content in your lab. You will still have access to select class and webinar recordings on the Constellate YouTube channel, and notebooks and tutorials in the Constellate GitHub repository.

Contact Information

Librarian

Text Mining and Analysis : Get Started

What is text mining?

Library support in a nutshell

Featured Resource: ProQuest TDM Studio

Featured Resource: Gale Digital Scholar Lab

Word of advice

Constellate will sunset on on July 1, 2025

Contact Information

Librarian

Article

Bibliography

Call Number

Peer Review

Limiter

Dissertation

Database

Scholarly Source