How do platforms for text mining work?

Usually, text mining requires writing custom code to analyze a corpus that the researcher has collected on their own computer. Platforms allow researchers to mine licensed text collections without needing to download them. Often, these platforms are the only way to text mine copyrighted texts or licensed historical materials.

Most platforms offer two ways of interacting with texts: a "no code" option using their pre-built tools, and a specialized coding environment for more customized analysis.

Featured Resource: Gale Digital Scholar Lab

The Gale Digital Scholar Lab is the text analytics service from Gale, home to many of the largest digital collections of historical materials. Using Gale’s Primary Sources archive, you can build content collections and analyze them. The tools cover document clustering, Ngrams, parts of speech, sentiment analysis, and topic modeling. The Lab also allows user to upload their own document collections for analysis.

Learn more and begin using the Gale Digital Scholar Lab(Opens in new window)

Note: this resource is not available to London users at this time.

Featured Resource: ProQuest TDM Studio

TDM Studio is the text analytics service from ProQuest, one of the largest digital collections of text, which includes the historical archives of many of the biggest newspapers. TDM Studio includes both a Visualization Dashboard to carry out simple analytics without coding, and a Workbench Dashboard for more complex analysis with Python or R.

Learn more and begin using ProQuest TDM Studio(Opens in new window)

Comparison of Northeastern's licensed text mining platforms

  ProQuest TDM Studio Gale Digital Scholar Lab HathiTrust Research Center
texts available ProQuest contents: news, magazines, journals, dissertations, congressional hearings, and more Gale subscriptions: 17thC to 19thC newspapers and books, The Times, other British archives  18+ million books (Google Books and more)
built-in tools “visualization”: geographic map, topic modelling, sentiment analysis “analyze”: document clustering, named entity recognition, topic modelling, sentiment analysis “visualize”: word frequency “algorithms”: named entity recognition, topic modelling
custom code “workbench”: Python and R notebooks local Python notebooks to work with the outputs of "analyze" modules "data capsule": command line programming
dataset size 10,000 docs for visualizations; 2,000,000 for workbench 10,000 documents 50,000 documents (larger on request)

 

Constellate has sunset as of July 1, 2025

More information is available in their full announcement. You will still have access to select class and webinar recordings on the Constellate YouTube channel, and notebooks and tutorials in the Constellate GitHub repository.

The HathiTrust Research Center will sunset in 2026

More information is available in their FAQ. There are no changes to the HathiTrust Digital Library collection, which will remain available. By the end of 2026, HathiTrust will discontinue funding for the HathiTrust Research Center. Some of the data analysis services offered through HTRC today will continue in some form, whether directly through the HathiTrust Digital Library, via a HathiTrust partner, or through another independent entity. No specific changes have been announced regarding which HTRC services will be discontinued.