What is text mining?

Text mining is a particular kind of data mining, meaning it is research that examines large datasets to identify patterns and relationships.

A text mining project combines two things: a corpus (a dataset of text) and a method (a program, tool, or technique that extracts a specific kind of information about the dataset). At the library, we can assist with both parts of your project.

Common corpora for text mining include newspaper archives, social media posts, and other large collections of "unstructured" text. Common methods include Natural Language Processing (NLP) techniques to identify people, places, topics, and feelings in texts. Advanced text mining is typically conducted by writing small programs in Python or R, but non-coding tools are available too.

Workshop: Introduction to Data Cleaning in OpenRefine

Date: Wednesday, August 7, 2024
Time: 1:30pm - 3:00pm EST
Location: online

Do you ever get annoyed with a big spreadsheet that isn’t quite formatted correctly for your needs? Find yourself repeating simple tasks over and over? OpenRefine might be the answer to simplify and speed up your data cleaning, especially if you are working with text data. This ninety-minute hands-on online workshop will teach you how to install OpenRefine, set up a new project, and use a few of its most useful features. At the end, we’ll demonstrate some advanced features, including integration with WikiData, as inspiration for future projects. Sample data will be provided, but feel free to bring your own dataset too.

Register here!

Library support in a nutshell

We can provide:

  • Workshops, class visits, and training for computational text analysis methods
  • Consultation with library staff on Northeastern-licensed platforms and datasets for mining and analysis
  • Annual payment for membership in HathiTrust
  • Negotiation with vendors to include general text mining provision in licenses for library resources

Unfortunately, we cannot generally provide:

  • Licensing for individual text mining projects -- generally, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
  • Additional payments for data to enable text mining -- the library is not funded for this activity
  • Secure storage for vendor data (this is sometimes possible, but depends on the case, so we cannot guarantee it)
  • Guarantees on enforcing user behavior and handling of vendor data

Featured resource: Constellate

Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who brought you JSTOR and Portico. It is a platform for teaching, learning, and performing text analysis using the world’s leading archival repositories of scholarly and primary source content.

Learn more and begin using Constellate

Featured Resource: ProQuest TDM Studio

TDM Studio is the text analytics service from ProQuest, one of the largest digital collections of text, which includes the historical archives of many of the biggest newspapers. TDM Studio includes both a Visualization Dashboard to carry out simple analytics without coding, and a Workbench Dashboard for more complex analysis with Python or R.

Learn more and begin using ProQuest TDM Studio

Featured Resource: Gale Digital Scholar Lab

The Gale Digital Scholar Lab is the text analytics service from Gale, home to many of the largest digital collections of historical materials. Using Gale’s Primary Sources archive, you can build content collections and analyze them. The tools cover document clustering, Ngrams, parts of speech, sentiment analysis, and topic modeling. The Lab also allows user to upload their own document collections for analysis.

Learn more and begin using the Gale Digital Scholar Lab

Note: this resource is not available to London users at this time.

Word of advice

Generally, it's a bad idea to use automated means to scrape or download large amounts of data from any database to which the library subcribes. If a database provider allows text mining, they will want to provide the data for you in a secure manner with which they are comfortable.