Natural SciencesHugo

Donny Winston

Donny Winston
Made as simple as possible, but not simpler.
Home PageAtom FeedMastodon
language
Published

A lot of extract-transform-load (ETL) work requires unloading and un-transforming first. Rather than \(ETL\), it’s \(L_A^{-1}T_A^{-1}ET_BL_B\). What the data provider did is \(A\). What you want to do is \(B\). The data provider gave you a “dump” of their data. You don’t know what it means. If you did, you could extract (subset) from it according to your needs – filter entities by some meaningful criteria and collect selected attributes.

Published

Imagine a data system modeled as three parts: an interface, a processor, and a repository. The repository “contains” information. The processor receives symbol streams to alter or retrieve information from the repository, and the processor outputs symbol streams. The interface is the medium, the opaque surface, of symbol-stream exchange between you and the processor. 1 What information is “in” the repository?

Published

After my last note on identifiers, Leo Talirz pointed me to a great riff 1 on Tim Berners-Lee’s classic note 2 on “cool URIs”. In the “Cool DOIs” article, Fenner breaks down a DOI into three parts: proxy, prefix, and suffix. A proxy is a server that maintains a map from prefixes to registrants. Example proxies are https://doi.org/ and https://hdl.handle.net/. An example prefix is 10.5281.

Published

Data protocols vary over project lifetimes, and many projects involve parameter sweeps. You might see filesystem directory structures evolve naming schemes like the following 1 : # let's not overthink this at first. concentration_A_0.25/ # hierarchy is good, right? concentration_A/0.25/ # paths are getting too long. conc_A/0.25/ # change to percentages. clever!

Published

Against what bases are queries against your data evaluated? If you only expose a single “data base” that changes over time, then data citations cannot be a combination of query and basis. When citing a passage in a book, the edition/variation of its publication, i.e. the thing that is assigned an ISBN, is the basis. Optionally, a citation may include a “query” against this basis – a page number, page range, chapter number/title, etc.