Natural SciencesHugo

Donny Winston

Donny Winston
Made as simple as possible, but not simpler.
Home PageAtom FeedMastodon
language
Published

A scientific database cannot be everything to everyone. Jim Gray came up with the “20 queries” heuristic. What are the 20 most important questions the researchers want the data system to answer? 1 Five questions are not enough to see a broader pattern, and 100 questions would dilute focus. Also, the relative information in queries ranked by importance is likely to be logarithmic – a “long tail” distribution.

Published

Organizational capabilities can be divided into three categories: resources, processes, and priorities. Resources are what you use to achieve an outcome, processes are how you achieve it, and priorities are why . Understanding capabilities in this way can aid in strategy not only across a large organization but also within units, and even for individuals. 1 Resources are tangible assets.

Published

How do you source data relevant for some analysis? Once you “have” the data, how do you feed it to the analytic task? Traditional enterprise data integration joins paths across a handful of silos for a handful of specific analytic tasks. In data science, however, neither the set of relevant silos nor the set of relevant analytic tasks are both small and well-defined.

Published

I was reminded of the importance of approachable, low-barrier-to-entry tools for data management by Monica Granados and Lily Zhao in their presentation of the Frictionless Data toolkit. 1 They showcased use of a browser-based interface 2 for a simple yet valuable task: associating title and description metadata with potentially cryptic column header names in a CSV file, and exporting that metadata together with the raw data

Published

Laws are rules that a particular community recognizes as regulating the actions of its members. From this definition, Serena Peruzzo detailed how she sought to use tools from Natural Language Processing (NLP) to “find a representation of the rules that makes them more accessible and understandable.” 1 One proposed use case is to identify and highlight ambiguities.

Published

In an episode of the CoRecursive podcast 1 , Sam Ritchie uses the phrase “portal abstraction” to describe how the use of a particular term can open a portal – a gateway – to a world of relevant prior art. He discusses issues in analytics. One issue is distributing summative calculations over data both as batches and in real-time, specialized for “big” and “fast” data, respectively.

Published

As part of her introduction to ontology enginering, 1 Prof. Maria Keet has a slide depicting ontology as a layer apart from conceptual data models: Conceptual data models vs. ontologies. [source] I like this visualization of various project-specific conceptual models and their associated implementations in databases and codebases.

Published

There’s a Python library called monty that supports a convention for moving between JSON objects and Python class instances. The major components are a mix-in class, MSONable, along with subclasses of json.encoder.JSONEncoder and json.decoder.JSONDecoder. An appropriate JSON object will have two special keys: @module and @class.

Published

I’ve heard the phrase “just the tip of the iceberg” used as a positive phrase when revealing value of which an audience might not have been previously aware. In the context of disseminating scientific data, this tip might be a publication. A reader sees paragraphs and figures that describe and show data. A Supplemental Information section might link to a much greater volume of data – the rest of the iceberg.