Rogue Scholar

Published March 23, 2021

Have you ever given or gotten data as CSV? Are the meanings of the columns always clear? How are they made clear? Are the given column labels/names and the given file/sheet names always enough? If additional information beyond the CSV file is needed, how is that facilitated? A separate README file that travels with the CSV as part of a zipped archive file?

Natural Sciences

A JSON File That Knows Its Schema and Context

https://doi.org/10.59350/rbjh3-e8t46

Published March 19, 2021

Author Donny Winston

If you provide JSON, either as files or as API responses, you might be one step away from ensuring that anyone encountering that JSON has a portal to what it means. This step is to provide a single extra key-value pair in each JSON document – the key is “@context”, and the value is a URL. JSON-LD is “a JSON-based format to serialize Linked Data.

Natural Sciences

A Universal Relation for Data -- One Table and One Table Only

https://doi.org/10.59350/51f0d-td734

Published March 17, 2021

Author Donny Winston

You have a variety of entities, each with a variety of attributes, and each involved in a variety of relationships. One approach to manage such data is a collection/spreadsheet/table approach where you partition your entities. Each entity has a primary address in a document/row in one collection/sheet/table.

Natural Sciences

Data Sharing Is Hard Like Shared Variables Are Hard

https://doi.org/10.59350/kgs8e-tx279

Published March 15, 2021

Author Donny Winston

If you write a program that references a variable, and that variable points to a value, you likely don’t want that value to change unless you’re doing the changing. This gets tricky when you want to bring in more resources to help you get the job done faster – you might still be in control of the program, but the “you” in action may be multiple cores/threads.

Natural Sciences

Continuous Integration for Scientific Data

https://doi.org/10.59350/rbm49-g1s94

Published March 12, 2021

Author Donny Winston

How do you check potential changes to your published data? You might set a rhythm for releases, say monthly or quarterly. You generate a new release candidate, run a set of checks, and then release. You reproduce the whole thing rather than add to it, and you do your checks at the end. You might do incremental rather than batch processing. You apply changes to your last release, run checks, and then release.

Natural Sciences

Version Control for Data

https://doi.org/10.59350/592xg-6ch73

Published March 10, 2021

Author Donny Winston

Git is the common tool for version control of code. How does it work? It works by grouping events about lines. Lines are added or removed, and a group of add/remove events is a transaction, i.e. a “commit”. A sequence of these line-delta-group events can be replayed from the log to construct a snapshot of the codebase at any point in the commit history.

Natural Sciences

Conceptual Provenance

https://doi.org/10.59350/1hbqz-v7h12

Published March 8, 2021

Author Donny Winston

We often think of provenance as a physical thing, tracking the history of a sample and of what we measured. But the provenance of a result started when someone had the idea or the request to measure it. The metadata for a result is not just the parameters on the instrument, or how much sample, or which sample – it’s all those other steps upstream. Conceptual metadata are like tags, meaningful handles.

Natural Sciences

Shipping Context With Data -- Convention, Protocol, and Infrastructure

https://doi.org/10.59350/1e6e4-jpz26

Published March 5, 2021

Author Donny Winston

How do you effectively share a computational process? You could simply share a directory of source code and rely on shared conventions – shared programming language, shared tooling for build and runtime environments, a README.txt convention for communicating setup instructions, etc.

Natural Sciences

Are You Just Creating Another Data Silo?

https://doi.org/10.59350/b30ef-x3x14

Published March 3, 2021

Author Donny Winston

You have a data-intensive research problem. Custom software will help you solve it. Code is written. A dataset is collected to feed the code. Did you just create another silo? What would it mean to be data-centric, with only one data platform and with applications on top, where applications come and go? In your group, who decides what kind of database to use?

Natural Sciences

Relating versus Replacing

https://doi.org/10.59350/q3zsx-e4374

Published March 1, 2021

Author Donny Winston

“Centralization”. “Mapping”. These are overloaded concepts. They mean different things in different contexts. One hawk-eyed reader has rightly cried Ambiguous! on my usage of these terms in recent notes.

Donny Winston

A CSV File That Knows Its Schema and Context

A JSON File That Knows Its Schema and Context

A Universal Relation for Data -- One Table and One Table Only

Data Sharing Is Hard Like Shared Variables Are Hard

Continuous Integration for Scientific Data

Version Control for Data

Conceptual Provenance

Shipping Context With Data -- Convention, Protocol, and Infrastructure

Are You Just Creating Another Data Silo?

Relating versus Replacing