What is text mining?
Text mining is a particular kind of data mining, meaning it is research that examines large datasets to identify patterns and relationships.
A text mining project combines two things: a corpus (a dataset of text) and a method (a program, tool, or technique that extracts a specific kind of information about the dataset). At the library, we can assist with both parts of your project.
Common corpora for text mining include newspaper archives, social media posts, and other large collections of "unstructured" text. Common methods include Natural Language Processing (NLP) techniques to identify people, places, topics, and feelings in texts. Advanced text mining is typically conducted by writing small programs in Python or R, but non-coding tools are available too.
Library support in a nutshell
We can provide:
- Workshops, class visits, and training for computational text analysis methods
- Consultation with library staff on Northeastern-licensed platforms and datasets for mining and analysis
- Annual payment for membership in HathiTrust
- Negotiation with vendors to include general text mining provision in licenses for library resources
Unfortunately, we cannot generally provide:
- Licensing for individual text mining projects -- generally, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
- Additional payments for data to enable text mining -- the library is not funded for this activity
- Secure storage for vendor data (this is sometimes possible, but depends on the case, so we cannot guarantee it)
- Guarantees on enforcing user behavior and handling of vendor data
Featured resource: Constellate
Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who brought you JSTOR and Portico. It is a platform for teaching, learning, and performing text analysis using the world’s leading archival repositories of scholarly and primary source content.
Featured Resource: ProQuest TDM Studio
TDM Studio is the text analytics service from ProQuest, one of the largest digital collections of text, which includes the historical archives of many of the biggest newspapers. TDM Studio includes both a Visualization Dashboard to carry out simple analytics without coding, and a Workbench Dashboard for more complex analysis with Python or R.
Featured Resource: Gale Digital Scholar Lab
The Gale Digital Scholar Lab is the text analytics service from Gale, home to many of the largest digital collections of historical materials. Using Gale’s Primary Sources archive, you can build content collections and analyze them. The tools cover document clustering, Ngrams, parts of speech, sentiment analysis, and topic modeling. The Lab also allows user to upload their own document collections for analysis.
Learn more and begin using the Gale Digital Scholar Lab
Note: this resource is not available to London users at this time.
Word of advice
Generally, it's a bad idea to use automated means to scrape or download large amounts of data from any database to which the library subcribes. If a database provider allows text mining, they will want to provide the data for you in a secure manner with which they are comfortable.