MIT Libraries guide to freely available text resources
MIT Libraries has created a detailed guide to freely available resources for text and data mining. Expand the entries in their 'Freely available resources and tools' list to view additional information such as:
- Description/coverage notes
- Means of access
- Access restrictions, if any
- Limitations
- Contact for technical questions
- Additional information links
English-Corpora.org
English-Corpora.org hosts a diverse collection of of full-text data sets, from news content to the full text of Wikipedia to soap opera transcripts. Though some content is still under copyright, English-Corpora removes 5% of the text and makes the argument that the content is transformed and market value for it is eliminated in this process.
Downloadable Text Data
This is only a selective list; there are many open access sources of downloadable data. With particular thanks to the Carnegie Mellon Libraries Guide to Text and Data Mining.
- Caselaw Access ProjectBased at Harvard University, an amazing and fully downloadable database of 360 years of Unites States caselaw. Access via API or bulk data download.
- Biomed CentralOver 250,000 full-text, peer-reviewed Biomed Central articles are available for text and data mining.
- Data RepositoriesA substantive list of open data repositories across many disciplines.
- English Broadside Ballad Archive16th and 17th century English broadside ballads, hosted and provided by the University of California at Santa Barbara.
- FigshareFeatures content in many file formats, including figures, datasets, media, papers, posters, presentations and filesets.
- Folger Digital TextsShakespeare's play, sonnets, and poems, downloadable in multiple formats. Provided by the Folder Shakespeare Library.
- Hathi Trust: Public Domain DataNortheastern has a Hathi Trust membership which gives access to all data, but Hathi Trust also provides a sub-set of public domain items for any researcher.
- Internet ArchiveOver 8 million ebooks and texts in the public domain.
- JSTOR Data for ResearchJSTOR provides some freely available data and tools, including those for visualization and bulk downloads. See the FAQ for more information.
- Project GutenbergAlso see their Terms of Use.
- Public Library of Science (PLOS)Data available via two APIs, one for search (bring content into other web applications), and one for Article-Level Metrics (usage stats, citation counts, social media coverage).
- Text Creation Partnership16th, 17th, and 18th cenury English works, transcribed and encoded by libraries and released to the public domain. Hosted by the University of Oxford; contact the TCP for a bulk download. Includes the new EEBO release.
- University of Oxford Text ArchiveAlso see their FAQ for download info.
- University of Pennsylvania Online Books"Listing over 2 million free books on the Web", not necessarily with bulk download features but a good source for textual data nonetheless.