walk-through pydabu

your data

Your data is nothing more than a data bubble, until it is:

  • described
  • shared
  • published

pydabu can help you to describe your data. Think of a simple kind of a basic data management plan (cf. [wikipedia:DMP]; [RDA_DMP]), which is good research practice (cf. [DFG]; [Helmholtz]).

Like your data itself, a description of your data can be shared. For example if in-house a search platform (e. g. [Solr]) is running, you could share your description of your data and enable your colleagues to find your data.

During publication you will most probably need a description of your data in terms of metadata.

References:

[wikipedia:DMP]https://en.wikipedia.org/wiki/Data_management_plan
[RDA_DMP]Miksa, Tomasz and Walk, Paul and Neish, Peter; RDA DMP Common Standard for Machine-actionable Data Management Plans https://doi.org/10.15497/rda00039
[DFG]Deutsche Forschungsgemeinschaft; Guidelines for Safeguarding Good Research Practice. Code of Conduct https://doi.org/10.5281/zenodo.3923602
[Helmholtz]Good scientific practice https://www.helmholtz.de/en/about-us/the-association/good-scientific-practice/
[Solr]https://lucene.apache.org/solr/

creating a data bubble

First of all you have to collect all data belonging to you data bubble in a directory. Use you preferred way to copy/move your data. The directory could look like:

$ cd pydabu && ls -1a
doc/
.git
gpl.txt
install2home
INSTALL.txt
LICENSE.txt
manual_pydabu.pdf
PKG-INFO
pydabu_unittests/
README.md
setup.py
src/

Or storing big data it could look like:

$ cd foo && ls -1
glow_XIMAS1848000_001.zip
glow_XIMAS1848000_2020-08-05_00003_140552145659648.img
glow_XIMAS1848000.log
graphics/
info.txt
overview_XOMAS1848000_001.zip
overview_XOMAS1848000_2020-08-05_00004_140603729520384.img
overview_XOMAS1848000.log
pytwanrc_doc.pdf
result.pdf
result.rst
signals.pdf
twanrc_rf_trigger_AK06FZRP.log

Now let us create some description with pydabu create_data_bubble:

pydabu create_data_bubble -dir .

Two files “.dabu.json” and “.dabu.schema” are created as a draft for you. In “.dabu.schema” the json schema describes the structured data stored in the json instance “.dabu.json”.

The schema describes not only the type of some data, but also required metadata. You can yourself adapt it to your needs. Or you supervisor can describe his requirement there.

The instance describes your data and holds some simple format check results. You have to fill this draft with additional information and you should check it.

With every text editor you can look at the generated files. We will use a viewer:

firefox .dabu.json

checking and fixing a data bubble

You can check if your json instance is valid regarding the schema (e. g. for “pydabu” (from above) you will not get any output):

jsonschema -i .dabu.json .dabu.schema
pydabu check_data_bubble -dir .

At the moment the command pydabu check_data_bubble gives an overview of errors/warnings. Mainly you will see missing properties, which are required.

For example for the data in the directory “foo” (from above), you will get:

$ jsonschema -i .dabu.json .dabu.schema
u'data integrity control' is a required property

Since, at this point we did not edit “.dabu.json” manually it is easy to fix. Use [pfu] to create some checksums (if you have a few GB or more, this could take a while) and recreate the data bubble:

$ pfu.py create_checksum -dir . -store single
$ rm .dabu.json .dabu.schema
$ pydabu create_data_bubble -dir .
$ jsonschema -i .dabu.json .dabu.schema
...
u'license' is a required property

Instead of pfu you can also use your preferred checksumming tool.

Now you have to add a license, e. g. write a file “LICENSE.txt”:

$ rm .checksum.sha512 .dabu.json .dabu.schema
$ vim LICENSE.txt
$ pfu.py create_checksum -directory . -store single
$ pydabu create_data_bubble -dir .
$ jsonschema -i .dabu.json .dabu.schema

And all necessary (depends on “.dabu.schema”) metadata is collected in “.dabu.json”.

References:

[pfu]pfu – Python File Utilities, https://gitlab.dlr.de/pfu/pfu