Natural SciencesHugo

Donny Winston

Donny Winston
Made as simple as possible, but not simpler.
Home PageAtom FeedMastodon
language
Published

Have you ever given or gotten data as CSV? Are the meanings of the columns always clear? How are they made clear? Are the given column labels/names and the given file/sheet names always enough? If additional information beyond the CSV file is needed, how is that facilitated? A separate README file that travels with the CSV as part of a zipped archive file?

Published

If you provide JSON, either as files or as API responses, you might be one step away from ensuring that anyone encountering that JSON has a portal to what it means. This step is to provide a single extra key-value pair in each JSON document – the key is “@context”, and the value is a URL. JSON-LD is “a JSON-based format to serialize Linked Data.

Published

If you write a program that references a variable, and that variable points to a value, you likely don’t want that value to change unless you’re doing the changing. This gets tricky when you want to bring in more resources to help you get the job done faster – you might still be in control of the program, but the “you” in action may be multiple cores/threads.

Published

How do you check potential changes to your published data? You might set a rhythm for releases, say monthly or quarterly. You generate a new release candidate, run a set of checks, and then release. You reproduce the whole thing rather than add to it, and you do your checks at the end. You might do incremental rather than batch processing. You apply changes to your last release, run checks, and then release.

Published

Git is the common tool for version control of code. How does it work? It works by grouping events about lines. Lines are added or removed, and a group of add/remove events is a transaction, i.e. a “commit”. A sequence of these line-delta-group events can be replayed from the log to construct a snapshot of the codebase at any point in the commit history.

Published

We often think of provenance as a physical thing, tracking the history of a sample and of what we measured. But the provenance of a result started when someone had the idea or the request to measure it. The metadata for a result is not just the parameters on the instrument, or how much sample, or which sample – it’s all those other steps upstream. Conceptual metadata are like tags, meaningful handles.

Published

You have a data-intensive research problem. Custom software will help you solve it. Code is written. A dataset is collected to feed the code. Did you just create another silo? What would it mean to be data-centric, with only one data platform and with applications on top, where applications come and go? In your group, who decides what kind of database to use?