Rogue Scholar

Published May 10, 2021

Many tests use oracles , where you know the answers for some inputs and you check those correspondences. To cover more of the input state space, you can generate random inputs and check some properties for each corresponding output. You don’t have an enumeration of exact answers like with oracles, but you can check things like the output always being greater than zero, etc.

Natural Sciences

Collaborative Custodianship of Data Objects

https://doi.org/10.59350/647f4-rmp28

Published May 7, 2021

Author Donny Winston

In a collaboration, data objects are produced at many sites. To make the data objects findable, you may steward a central, searchable index for their metadata. How then do you make the data objects accessible for download? One common solution is to centralize the custodianship – have all sites upload copies of their data objects to a central store. The central store may partition storage across several physical servers behind the scenes (e.g.

Natural Sciences

Exploratory Behavior With Generate-and-Test

https://doi.org/10.59350/fgzpy-5rm58

Published May 5, 2021

Author Donny Winston

One powerful mechanism of robustness is exploratory behavior, for which the desired outcome is produced by a generate-and-test mechanism. This organization allows the generator to work and be developed independently of the tester that accepts or rejects a particular result. One can make an analogy to biological evolution, where the generator is random mutation and the tester is natural selection.

Natural Sciences

Resource Description (Ontology/Schema) Versus Resource Layout (API)

https://doi.org/10.59350/99wdn-det51

Published April 30, 2021

Author Donny Winston

Resource description refers to defining concepts and relationships that represent the content and structure of some subject matter (ontology) or a database (schema) in a formal language. The relationship between ontology and database schema is nuanced – Uschold provides a nice comparison. ¹ You can formally describe resources using the resource description framework (RDF), SQL’s data definition language (DDL), etc.

Natural Sciences

Metadata Harvesting From Delimited-Path Key-Value Systems

https://doi.org/10.59350/es2rp-je898

Published April 28, 2021

Author Donny Winston

In the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a repository is a means of exposing metadata to harvesters . The OAI-PMH spec goes into great detail about how a data provider should implement a repository so that a harvester can simply be a client application that issues one of six possible HTTP requests.

Natural Sciences

Statecharts as a Logic of Effects

https://doi.org/10.59350/4ghyt-2fr07

Published April 27, 2021

Author Donny Winston

Do your programs only compute pure functions of data, or do they also perform effects such as dynamically reading input, writing output, transitioning database state, making network requests, etc.? One sense of the term “logic” is as a general subject, i.e. the study of how to draw valid conclusions.

Natural Sciences

Data Dictionaries for Humans and Machines

https://doi.org/10.59350/6t4rj-kkm17

Published April 23, 2021

Author Donny Winston

Shared datasets often have column/field names that are ambiguous in their meaning, or contain identical/related concepts with different names, hindering reuse. This ambiguity happens regardless of the method of sharing – via files, web pages, or APIs. The traditional solution for this is to provide documentation.

Natural Sciences

When ETL Is a Symptom

https://doi.org/10.59350/zvt3g-43x40

Published April 22, 2021

Author Donny Winston

When you have several different applications (e.g. to perform simulations and analyses) that each have their own data model, it’s typical for each to also maintain its own siloed data store. Then, in order to use all the applications in concert to complete a research project, or to support an ongoing research program, you need to run extract-transform-load (ETL) pipelines to sync the data.

Natural Sciences

"Lets Not Reinvent the Wheel"

https://doi.org/10.59350/y3kz2-yr663

Published April 19, 2021

Author Donny Winston

I was reading about hidden costs of “packaged” software solutions – that is, using existing software to solve problems – and came across this sentence: ¹ Huh? I typically do not distinguish development from implementation . What McComb is calling “implementation” I just call “installation”. Weird.

Natural Sciences

Data Reduction for Science

https://doi.org/10.59350/y09r9-4xp26

Published April 16, 2021

Author Donny Winston

Earlier this week, I wrote that As luck would have it, the U.S. Department of Energy (DOE) posted a funding opportunity announcement (FOA) yesterday on Data Reduction for Science: There have been efforts for decades to identify and deal with this issue, with cute acronyms for relevant data like ROT (Redundant Obsolete and Trivial), WORN (Write Once Read Never), and WORSE (Write Once Read Seldom if Ever). However, the DOE FOA highlights that it

Donny Winston

Metamorphic Tests for Domain-Specific Properties

Collaborative Custodianship of Data Objects

Exploratory Behavior With Generate-and-Test

Resource Description (Ontology/Schema) Versus Resource Layout (API)

Metadata Harvesting From Delimited-Path Key-Value Systems

Statecharts as a Logic of Effects

Data Dictionaries for Humans and Machines

When ETL Is a Symptom

"Lets Not Reinvent the Wheel"

Data Reduction for Science