Datasets licensed for Northeastern use

Northeastern affiliates may use any of the datasets in the licensed data collection in the Digital Repository Service for text and data mining.

 

Text & data mining of library subscription databases

This site is intended to support to members of the Northeastern University community interested in text/data mining library-provided subscription resources. If you're planning to use Northeastern's subscription content for the purposes of text or data mining, here are some things you should know. The process can sometimes be a little arcane, so if you have any questions, please contact us -- we are here to help!

  1. You may need a librarian’s assistance. Many publishers and vendors such as Thomson Reuters, EBSCO, and ProQuest restrict automated data scraping and large­ scale access to their journal articles and databases. They believe doing so is necessary to protect their copyright, so that end ­users do not abuse their subscription privileges by mass downloading and sharing with others who have not paid for a subscription. They also want to avoid heavy loads on their systems. We can help you contact publishers in order to gain access.

  2. It might take some time. Text and data mining is sometimes negotiated with publishers on a case-­by­-case basis. Other University offices may be involved, and it may take weeks or months to work out terms with vendors. Some cases go relatively quickly, but many do not. Be sure to build this extra time into your research schedule.

  3. Negotiations with publishers are mostly handled by the researcher. Librarians can help you make initial contact with vendors, but sometimes the responsibility to communicate with the vendor and acquire the data falls to you, the researcher. Depending on the contract, if you are asked to represent or be liable for the University, you will want to consult with the University Office of General Counsel. In some cases, vendors may ask for an addendum to the existing library license -- we will happily work with you in those cases. 

  4. It might cost you money (and resources). Only a few of our current vendor contracts include text­ or data­ mining access as part of the general license, and there is often a data preparation fee, though we are advocating on researchers' behalf to wherever possible include text and data mining access into our licenses with minimal additional fees. However, depending on the vendor, one time time access can cost thousands of dollars. Writing expected costs into grant proposals is a good way to ensure access.

  5. You will need to have your own data storage provisions and tools. Depending on the license and level of security needed, the Library's Digital Repository Service may be available. Your license may require storage on a secure server (that is, not your laptop or desktop!)  If your license requires high security, you may need to contact Research Computing for guidance, and/or establish access to the University's High Performance Computing Center.  

  6. Most TDM licenses allow for access only to Northeastern researchers. If you are working with colleagues from other institutions, it’s likely that each institution will need to seek access to the content through their individual libraries.

  7. All licenses prohibit use of the files for any project to create another database or for any commercial uses.

  8. Most require that you delete the files after your project is completed unless your grant requires that the files be deposited in a repository. You will need to clarify this with the data provider if you have special grant requirements.

  9. For certain types of text­ mining research, Open Access journals and repositories can be good alternatives. Publishers such as Hindawi, PLoS, and BioMed Central welcome text­ mining and reuse of their content, as do some institutional and subject repositories such as PubMed Central.

We are continuing to advocate for the inclusion of text mining rights in our licenses, so if there are particular vendors in which you are interested, please let us know.

Adapted from text used by Indiana University - Bloomington Libraries, with thanks to Stacy Konkiel for providing the text.

Featured resource: Constellate

Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who brought you JSTOR and Portico. It is a platform for teaching, learning, and performing text analysis using the world’s leading archival repositories of scholarly and primary source content.

Learn more and begin using Constellate

Featured Resource: ProQuest TDM Studio

TDM Studio is the text analytics service from ProQuest, one of the largest digital collections of text, which includes the historical archives of many of the biggest newspapers. TDM Studio includes both a Visualization Dashboard to carry out simple analytics without coding, and a Workbench Dashboard for more complex analysis with Python or R.

Learn more and begin using ProQuest TDM Studio