Data Ethics and Legal Compliance
Appendix A: Readme.txt file template
Purpose of this Policy
Data is generated and used in many ways by the Software Preservation Network, and all members of the community and affiliates are responsible for appropriately using and safeguarding that data. The policy establishes uniform data management standards for the Software Preservation Network and ensures the integrity of the data, and that the data efficiently and effectively serves the needs of the organization.
Scope of this Policy
This policy applies to all data gathered by members of the Software Preservation Network for the purposes of meeting the organization’s mission. This policy does not apply to ancillary data that is not related to a research question of importance to the organization.
The content of this policy should be followed as closely as is possible with limitations imposed by funders, institutional review boards (IRBs), and other external groups. Limitations on specific projects in regards to this policy should be documented by the people in charge of data management oversight for SPN.
Guiding Principles
- Procedures must be in place for how to treat data, ensure the security of that data, and to ensure that it is usable and understandable after the initial research is done.
- Data should be as openly available and reusable as it can be. This helps with transparency and allows the community to troubleshoot and fact-check findings.
- The location and details of all data will be recorded in a common public place so that SPN members and others can find it. If data is moved, its location will be updated on the list.
Data Management Oversight
To ensure this policy is followed, two Data Stewards will be assigned by SPN Steering committee. These co-leads will review the policy every year, review the various technology and services used, and work with working groups and members to ensure data management policy is followed.
Definitions
Data management
Data are items of recorded information considered collectively for reference or analysis. This may include, but isn’t limited to ”lab books, survey responses, software and code, measurements, images, audio recording, video, and physical samples.”
Data Management describes the organization, storage, preservation, and sharing of data collected and used in a research project.
Raw Data
Raw data refers to the data as it was collected before being cleaned up, worked on, or sanitized. Data Exports from surveys are a great example.
Working Data
This term refers to data that is actively being cleaned up, re-worked, or manipulated in some way.
Sanitized Data
The term sanitized data is used to describe data that has personally identifiable information(PII) removed. In the greater literature is also called de-identified data, de-identification, and data anonymization. The reason sanitized data is used is that this policy acknowledges that PII is not always the only information that shouldn’t be made public. Sanitized data refers to data where all non-public information has been removed.
Secure location
A location that SPN owns and maintains that is only accessible to SPN members who are actively working on the data.
Data Stewards
The term refers to the two people appointed by SPN to make sure this policy is followed.
Data Ethics and Legal Compliance
The SPN recognizes that some of the data gathered and used by its members may contain personal or sensitive information. The SPN data management steward(s) will be responsible for considering ethical and legal issues around how the data is collected, stored, made available and retained, with the SPN Steering Committee being ultimately accountable. Managing ethical concerns include anonymization of data; referral to appropriate ethics committees where data is being collected by a member institution; and formal consent agreements to allow data to be shared and reused.
Acquiring Data
There are two methods of acquiring data. The first is gathering it through a process like a survey or other methods, and the second is finding data already available and repurposing it. Data from both these efforts should be subject to this policy. For data that is being repurposed, the data should be treated just like data the organization has gathered. Both acquired data and gathered data will need to include a citation in a “readme.txt” file stored along with the data. The citation to the data should be recorded following ICPSR’s data citing recommendations:
- Author
- Title
- Distributor
- Date
- Version
- Persistent Identifier (such as the Digital Object Identifier, Uniform Resource Name URN, or Handle System).
Metadata for Data
Metadata is defined as “data about data”. The goal of metadata is to record all the information about the data that would help it be reconcilable and useful in the future. The following should be recorded for each set of data:
- Context: Project history, aim, objectives, and hypotheses
- Data collection methods: Sampling, data collection process, instruments used, hardware and software used, scale and resolution, temporal and geographic coverage and secondary data sources used
- Structure: How is it organized, the relationship between files
- Data validation: Checking, proofing, cleaning and quality assurance procedures carried out
- Changes: Changes made to data over time, identification of different versions of data files
- Rights: Information on access and use conditions or data confidentiality
- This information should be included in the “readme.txt” file that is with each set of data.
Organization of Data
File naming
File names are organizational tools that allow identification of files. Files should be named consistently and identify specific projects the file belongs to. File names can include:
- Project name or experiment name or acronym
- Date or date range of experiment
- Type of data
- Conditions
- The version number of files
- The three-letter file extension for application specific files.
The naming format and any abbreviations or codes used should be included in the “readme.txt” file that is with each set of data.
In names, avoid the following:
- Operating systems have a file path limit, so try to keep names as short as they can be. Make sure the path of the file does not exceed 248 characters. Keep file names as short as possible to prevent problems with nested folders.
- Avoid numbering files without leading zeros. (eg. 001,002,003 instead of 1,2,3)
- Avoid special characters (eg ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ ” | )
- Do not use spaces in the name. Instead use underscores (file_name.xxx), dashes (file-name.xxx), or camel case (fileName.xxx)
An example of a file name that meets this criteria:
(organization)_(Working group)_(survey type)_(number)_(type of data-raw,working,sanitized).(extension)
For example: spn_workinggroup_Survey01_001_santizied.csv with a “readme.txt” file explaining what spn means, and what Survey 01 was.
File formats
1st preference/priority are formats that are non-proprietary formats.
2nd preference are formats that have an open standard even if they are proprietary
Save the final copy of the working data and sanitized data set in any of these preferred formats:
- Containers: TAR, GZIP, ZIP
- Databases: XML, CSV, SQLite
- Geospatial: SHP, DBF, GeoTIFF, NetCDF
- Moving images: MOV, MPEG, MXF
- Sounds: WAVE, AIFF, MP3, MXF
- Statistics: UTF-8, DTA, POR
- Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
- Tabular data: CSV
- Text: XML, PDF/A, HTML, UTF-8
- Web archive: WARC
Storage of Data
The goal of the storage of research data for SPN is to have a known and heavily used data archive with versions of all of the SPN community’s key data preserved and curated for future use.
While data is being collected, the raw data should be exported and saved in CSV format (if applicable) in a secure location. If the data has any identifying information such as emails, names, or other identifiers, then the data should be stored in a location that only the direct researchers will have access to it. These exports should be saved at regular intervals during data collection. This is done in case there are issues with the system, and to reduce the damage of data loss or system issues.
Once the data collection has ended, the last version of the raw data can be saved, and the others deleted. This would be the first version of the data.
The group working on the data can edit the data and save it in different versions, making a record of the kinds of changes that were made in the data’s applicable “readme.txt” file. All the edited versions are working data. After the data cleanup and work is done, the group should also seek to sanitize the data of personally identifiable information and other information that should not be made public and make the data available to the SPN community as soon as possible in another version (again with notes about what was done to the data included in the data set’s “readme.txt” file). This final set of data is the sanitized data set.
Once the project is complete, the sanitized data and readme.txt file for the data will be made public with links and citations included in SPN document on data (more about this in the Sharing Section). The appointed Data steward will be responsible for maintaining these links and for editing the data. This should not be the only copy of the data. An exact copy of the sanitized should be kept in a secure storage area. A secure storage area can be online or offline, as long as access can be controlled. A final copy of the raw data, and a final version of the working data should also be kept in the same location clearly labeled.
Sharing of Data
Once the data is clean and ready to be made public, deposit the data in Zenodo. The data should be made public for as long as possible and left in Zenodo for as long as the service still meets SPN needs. The Data Management Appointee should be the primary person responsible for maintaining these files.
Data Retention Policy
All data versions data should be retained for the period required by any funders or partners (and in the case of multiple stakeholders, the longest retention period required). In the absence of no such policy, all data should be retained for 5 years after the completion of the project. A final copy of the raw data, the last version of the working data, and a copy of the sanitized data will remain in an SPN owned secure location. All exceptions to the retention policy should be documented along with the data.
The Data Stewards should review all data gathered by SPN every year to check on retention policies, public copies etc.
The Sanitized Data should be kept in a public location for as long as possible. Even when other versions of the data have been deleted, any sanitized data that was made public should be kept in a public archive.
Once the data has reached the end of the retention policy, the Data Stewards will ask if anyone in the group has use of the data. If no one is interested in the data then the Data Stewards can look for an institutional data archive willing to host it and the copy in the SPN archive can be deleted and only the public copy will be available.
If someone within the organization says they are interested, then the Data Stewards can set a new custom retention policy.
Appendix A: Readme.txt file template
This is adapted from Cornell University’s Research Data Management Service Group Template
——————-
GENERAL INFORMATION
——————-
Organization: Software Preservation Network
Data set title:
Name and contact information for investigators (Name, institution, Email):
Date (or date range) of data collection <suggested format YYYYMMDD>:
Geographic location of data collection <City, State, County, Country and/or GPS Coordinates or bounding boxes>:
——————-
SHARING/ACCESS INFORMATION
——————-
Licenses/restrictions placed on the data, or limitations of reuse:
Recommended citation for the data:
Citation for and links to publications that cite or use the data:
Links to other publicly accessible locations of the data:
Links/relationships to ancillary or related data sets:
——————-
DATA AND FILE OVERVIEW
——————-
File list (filenames, directory structure (for zipped files) and a brief description of all data files, including explanation of naming format and any abbreviations or codes used):
Relationship between files, if important for context:
Additional related data collected that was not included in the current data package:
If data was derived from another source, list source:
If there are there multiple versions of the dataset, list the file updated, when and why the update was made:
——————-
METHODOLOGICAL INFORMATION
——————-
Description of methods for data collection:
Description of methods for data processing:
——————-
DATA SPECIFIC-INFORMATION
——————-
Variable list, with full names and definitions of column headings if tabular data:
Units of measurement:
Definitions for codes or symbols used to record missing information:
***
Last updated Aug. 1, 2019