Whether you're depositing data for publishing or grant requirements, or just want to make the output of your research available to your colleagues, depositing data in a repository or digital archive will ensure your research will be discoverable and usable for a long time. This guide will review some factors to keep in mind when preparing the data for deposit in the Digital Repository Service, or elsewhere.
Selecting a repository for deposit
The DRS can accept most datasets under a 1TB, and will accept any file type, which makes it a suitable home for many research outputs. But, the DRS may not be the best place for you to store the data! Other professionals in your discipline or subject area may have a preferred repository that is better suited to the data produced by your research. Check out https://fairsharing.org/ to find data repositories in your discipline, or contact your subject librarian for help finding the right repository for you.
Consider a few factors when selecting a repository for deposit:
- Longevity: How sustainable is the repository? How long will it hold the data?
- Audience: where do others in your discipline go to find data?
- Requirements: Does the system impose limits or restrictions on accepted file types? What is the maximum size allowed for each file? What metadata is required?
Selecting data for deposit
You may be tempted to gather up every file used in the collecting, recording, and processing of the data, but how useful will that be for another researcher accessing the data for their own research? Or, how useful will that be for you five years from now? It's important to carefully think about the data that needs to be archived and how it should be packaged.
Here are a few things to consider when selecting data and data packages:
- What data is important? What is standard in your discipline?
- What data is essential for another researcher to use or reuse the data?
- What data is required for the research to be replicated or validated?
- What information about the data is important for the data to be reused, replicated, or validated?
Here are a few things to avoid when selecting the data:
- File or packages of files that are not clearly documented in some way
- Working data, data that may be changed, or data that is not ready to be shared for any reason
- Data with personally identifiable information. If retaining personally identifiable information is important to the data set, find a repository designed for securing sensitive data (most are not)
- Data that you do not hold the rights to distribute, or contains material that you do not hold the rights to distribute. If copyrighted material is important to the data, get permission from the copyright holder before depositing.
When preparing the data for archiving or sharing, it's a good practice to use file formats that are open and sustainable. This may not be possible for every file type, but using a recommended format will ensure your file will remain usable for a long time. The Library of Congress keeps a list of recommended file formats for many types of files here: https://www.loc.gov/preservation/resources/rfs/
Regardless of whether or not you use a recommended file format, always check the file type requirements for the deposit system to make sure your chosen file format is allowed.
Many deposit systems set a size limit for file uploads, usually ranging from 1GB to 5GB. These limits are fixed in some systems, but others may accept files that exceed the limit through a mediated process. File size limits for the deposit system may influence how you package your files, or what system you choose for deposit.
DRS users may deposit files up to 1GB. Library staff are available to assist with depositing files larger than 1GB. There are no size limits for individual file downloads, but you should take into consideration how easy or difficult it may be for consumers of the data to download the files. For this reason, we recommend grouping data in packages no larger than 15GB each when depositing in the DRS.
Using a clear and consistent method for naming your files will help ensure your files can be accessed easily. There are a few general rules to follow when creating a system for naming files:
- Create file names that are descriptive, but brief (fewer than 30 characters)
- Use numbers letters (either upper or lowercase)
- Avoid using special characters, especially those that may be misinterpreted by an operating system
- Use underscores instead of spaces
- Include a date in the file name formatted as YYYYMMDD (e.g. 20200101)
- Apply the chosen naming system consistently.
Like file names, the files themselves should be organized in a clear, consistent manner. Consider:
- Creating a browsable hierarchy or directory structure that uses the same naming conventions as the files
- Sorting files into related groupings, like by experiments or by date
You should consider packaging the data in naturally occurring groupings for the data or research, but also keep in mind how other researchers may expect to access or use the data. If possible, compress the data files into ZIP or TAR packages. Compressing your files will reduce the upload and download sizes, reduce the number of files you will have to deposit and the number of files to be downloaded, and will preserve the desired organization of the files. Another good reason is file names. Some systems, including the DRS, will change the file name for every file deposited to avoid collisions between similarly named files in the storage system and when packaging files for bulk downloading (for example, when project A uses data.csv and project B also uses data.csv, storing these files together will cause issues). If the file names are important to accessing the data, compressing data files into a zip package will ensure the original file names are retained.
Documentation and Description
Documentation is crucial to the reusability and reproducibility of data. If you have documentation that can be shared, like a codebook, those guiding documents should be deposited alongside the data. If you don't have shareable documentation, consider creating a README file that describes and provides context for the data. A README might include:
- General information, like project name, project and data author names, funding sources, and affiliated institutions
- Author or creator contact information
- A description of the data
- A description of the file naming convention
- Information about how to use or interpret the data
- Information about system or tool requirements for using the data (specific software versions or equipment, etc)
See the box on the right of this page for more information on README files.
Expect to provide information about each file you deposit, including:
- A descriptive title
- The names of the people responsible for creating the data, including the name of the organizations involved, like the college or Northeastern University generally
- A few keywords
- Date the file was created
- A general description of the data package
- A statement on how the data can be used or reused
- Contact information
Only a title and one keyword are required for deposit into the DRS, but names, dates, and descriptions are highly recommended. Supplying that information will help your data be discovered, and it will help users decide whether or not the data is useful to them.
Other things to consider
- If you're depositing to satisfy grant requirements, does the funder have explicit rules for where or how the data should be made available?
- Packaging the data in a useful way can be time consuming, and navigating the upload process for any system can be, as well, so be sure to allocate time for this part of the project when estimating project deadlines and work time.
- If the data supports research published in an article, find out from your publisher if you can deposit a copy of your article in the repository, as well
- Digital Object Identifiers (DOIs) are unique, permanent IDs for digital material. Library staff can generate DOIs for your research outputs, including data and reports - just ask.
- When sharing the data after deposit in the DRS, be sure to use the permanent URL. Sharing the permanent URL, rather than the system URL, will ensure the link persists and remains active through any future system changes. In the DRS, the permanent URL starts with http://hdl.handle.net. If your data has a DOI, you may share that, as well.
- If you need to store and manage very large sets of data, need a space to work collaboratively on data, or run resource-intensive processes on your data, consider using the Discovery Cluster, supported by Northeastern's Research Computing department: https://rc.northeastern.edu/