MIT Libraries guide to freely available text resources
MIT Libraries has created a detailed guide to freely available resources for text and data mining. Expand the entries in their 'Freely available resources and tools' list to view additional information such as:
- Description/coverage notes
- Means of access
- Access restrictions, if any
- Limitations
- Contact for technical questions
- Additional information links
English-Corpora.org
English-Corpora.org hosts a diverse collection of of full-text data sets, from news content to the full text of Wikipedia to soap opera transcripts. Though some content is still under copyright, English-Corpora removes 5% of the text and makes the argument that the content is transformed and market value for it is eliminated in this process.
Downloadable Text Data
This is only a selective list; there are many open access sources of downloadable data. With particular thanks to the Carnegie Mellon Libraries Guide to Text and Data Mining.
- Caselaw Access ProjectBased at Harvard University, an amazing and fully downloadable database of 360 years of Unites States caselaw. Access via API or bulk data download.
- Biomed CentralOver 250,000 full-text, peer-reviewed Biomed Central articles are available for text and data mining.
- Data RepositoriesA substantive list of open data repositories across many disciplines.
- English Broadside Ballad Archive16th and 17th century English broadside ballads, hosted and provided by the University of California at Santa Barbara.
- FigshareFeatures content in many file formats, including figures, datasets, media, papers, posters, presentations and filesets.
- Folger Digital TextsShakespeare's play, sonnets, and poems, downloadable in multiple formats. Provided by the Folder Shakespeare Library.
- Hathi Trust: Public Domain DataNortheastern has a Hathi Trust membership which gives access to all data, but Hathi Trust also provides a sub-set of public domain items for any researcher.
- Internet ArchiveOver 8 million ebooks and texts in the public domain.
- JSTOR Data for ResearchJSTOR provides some freely available data and tools, including those for visualization and bulk downloads.
- Project GutenbergAlso see their Terms of Use.
- Public Library of Science (PLOS)Data available viaText Creation Partnership16th, 17th, and 18th cenury English works, transcribed and encoded by libraries and released to the public domain. Hosted by the University of Oxford; contact the TCP for a bulk download. Includes the new EEBO release.University of Oxford Text ArchiveAlso see their FAQ for download info.University of Pennsylvania Online Books"Listing over 2 million free books on the Web", not necessarily with bulk download features but a good source for textual data nonetheless.