Library support in a nutshell
We can provide:
- This research guide with resources, contact information, other details
- Consultation with library staff on Northeastern-licensed platforms and datasets for mining and analysis
- Annual payment for membership in HathiTrust
- Negotiation with vendors to include general text mining provision in licenses for library resources
Unfortunately, we cannot generally provide:
- Licensing for individual text mining projects -- generally, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
- Additional payments for data to enable text mining -- the library is not funded for this activity
- Secure storage for vendor data (this is sometimes possible, but depends on the case, so we cannot guarantee it)
- Guarantees on enforcing user behavior and handling of vendor data
Featured resource: Constellate
Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who brought you JSTOR and Portico. It is a platform for teaching, learning, and performing text analysis using the world’s leading archival repositories of scholarly and primary source content.
Text & data mining of library subscription databases
This site is intended to support to members of the Northeastern University community interested in text/data mining resources.
For researchers interested in using the Libraries’ subscription content for the purposes of text or data mining, here are some things you should know. The process can sometimes be a little arcane, so if you have any questions, please contact us -- we are here to help!
You may need a librarian’s assistance. Many publishers and vendors such as Thomson Reuters, EBSCO, and ProQuest restrict automated data scraping and large scale access to their journal articles and databases. They believe doing so is necessary to protect their copyright, so that end users do not abuse their subscription privileges by mass downloading and sharing with others who have not paid for a subscription. They also want to avoid heavy loads on their systems. We can help you contact publishers in order to gain access.
It might take some time. Text and data mining is sometimes negotiated with publishers on a case-by-case basis. Other University offices may be involved, and it may take weeks or months to work out terms with vendors. Some cases go relatively quickly, but many do not. Be sure to build this extra time into your research schedule.
Negotiations with publishers are mostly handled by the researcher. Librarians can help you make initial contact with vendors, but sometimes the responsibility to communicate with the vendor and acquire the data falls to you, the researcher. Depending on the contract, if you are asked to represent or be liable for the University, you will want to consult with the University Office of General Counsel. In some cases, vendors may ask for an addendum to the existing library license -- we will happily work with you in those cases.
It might cost you money (and resources). Only a few of our current vendor contracts include text or data mining access as part of the general license, and there is often a data preparation fee, though we are advocating on researchers' behalf to wherever possible include text and data mining access into our licenses with minimal additional fees. However, depending on the vendor, one time time access can cost thousands of dollars. Writing expected costs into grant proposals is a good way to ensure access.
You will need to have your own data storage provisions and tools. Depending on the license and level of security needed, the Library's Digital Repository Service may be available. Your license may require storage on a secure server (that is, not your laptop or desktop!) If your license requires high security, you may need to contact Research Computing for guidance, and/or establish access to the University's High Performance Computing Center.
Most TDM licenses allow for access only to Northeastern researchers. If you are working with colleagues from other institutions, it’s likely that each institution will need to seek access to the content through their individual libraries.
All licenses prohibit use of the files for any project to create another database or for any commercial uses.
Most require that you delete the files after your project is completed unless your grant requires that the files be deposited in a repository. You will need to clarify this with the data provider if you have special grant requirements.
For certain types of text mining research, Open Access journals and repositories can be good alternatives. Publishers such as Hindawi, PLoS, and BioMed Central welcome text mining and reuse of their content, as do some institutional and subject repositories such as PubMed Central.
We are continuing to advocate for the inclusion of text mining rights in our licenses, so if there are particular vendors in which you are interested, please let us know.
Adapted from text used by Indiana University - Bloomington Libraries, with thanks to Stacy Konkiel for providing the text.
Word of advice
Generally, it's a bad idea to use automated means to scrape or download large amounts of data from any database to which the library subcribes. If a database provider allows text mining, they will want to provide the data for you in a secure manner with which they are comfortable.