How do platforms for text mining work?
Usually, text mining requires writing custom code to analyze a corpus that the researcher has collected on their own computer. Platforms allow researchers to mine licensed text collections without needing to download them. Often, these platforms are the only way to text mine copyrighted texts or licensed historical materials.
Most platforms offer two ways of interacting with texts: a "no code" option using their pre-built tools, and a specialized coding environment for more customized analysis.
Comparison of Northeastern's licensed text mining platforms
ProQuest TDM Studio | Constellate | Gale Digital Scholar Lab | HathiTrust Research Center | |
---|---|---|---|---|
texts available | ProQuest contents: news, magazines, journals, dissertations, congressional hearings, and more | JSTOR, Portico, Chronicling America, Doc South, South Asia Open Archives, Reveal Digital | Gale subscriptions: 17thC to 19thC newspapers and books, The Times, other British archives | 18+ million books (Google Books and more) |
built-in tools | “visualization”: geographic map, topic modelling, sentiment analysis | “trends”: key phrases, document categories | “analyze”: document clustering, named entity recognition, topic modelling, sentiment analysis | “visualize”: word frequency “algorithms”: named entity recognition, topic modelling |
custom code | “workbench”: Python and R notebooks | “lab”: Python and R notebooks, with starter code | local Python notebooks to work with the outputs of "analyze" modules | "data capsule": command line programming |
dataset size | 10,000 docs for visualizations; 2,000,000 for workbench | 50,000 documents (larger on request) | 10,000 documents | 50,000 documents (larger on request) |
Constellate will sunset on on July 1, 2025
You can read their full announcement here. All existing resources will remain available through June 30, including all currently-scheduled classes. (In fact, it's not too late to register for these excellent classes here!)
Before July 1, 2025, you will want to download your datasets, Constellate lab files, and snapshots of content in your lab. You will still have access to select class and webinar recordings on the Constellate YouTube channel, and notebooks and tutorials in the Constellate GitHub repository.
The HathiTrust Research Center will sunset in 2026
You can read their FAQ here. There are no changes to the HathiTrust Digital Library collection, which will remain available. By the end of 2026, HathiTrust will discontinue funding for the HathiTrust Research Center. Some of the data analysis services offered through HTRC today will continue in some form, whether directly through the HathiTrust Digital Library, via a HathiTrust partner, or through another independent entity. No specific changes have been announced regarding which HTRC services will be discontinued.