Contribute to Open Source. Search issue labels to find the right project for you!

Merge in the Temporal Lens


The Temporal Lens provides a facility for identifying temporal trends. This is/can be used for - Extrapolating beyond the end of a dataset (Given N records, synthesize a table with N+K records) - Interpolating/Imputing fields (Could be plugged into the missing value lens)

Unfortunately, it’s a bit out of date — from before we split models and lenses into two separate fragments. We need someone to dig into the old code and pull out the model components, plug it into the missing value lens, and create a new extrapolation lens.

Plugging it into the missing value lens would also require adding in a statistics gathering tool that auto-detects when one or more columns represent some sort of temporal property that could be used to extrapolate trends.

Updated 04/07/2017 17:25

Make the Weka gods happy


As things stand right now, Weka) support for missing value imputation in Mimir is haphazardly implemented, and as a result it gets finicky.

Specific goals for this project: - [ ] Rewrite WekaModel to use a model that supports mixed datatypes (Text + Ordinal + Numeric). This may require adding some statistics-gathering functionality to evaluate whether a given column can be treated as ordinal or not. - [ ] Add support for more types of weka models. In other words, first assess whether there are clearly detectable cases where one model outperforms another, and see if we can detect those cases…

Updated 02/06/2017 17:35

Quality assessment lens


Second-order confidence is tricky: In some settings, a data cleaning tool can give you a quality score, but it can’t really give you a measure of how “good” this quality score is. For example: When you ask the Google Maps API to geocode an address, it gives you a confidence ranking for each potential match. Unfortunately, this confidence ranking may be complete garbage relative to knowledge available to you.

There are some basic tricks you can play: * Put constraints on the data. Output violating the constraints is clearly an error. * Manually inspect all of the data. A human can better assess what’s going on and correct errors.

These aren’t ideal: Coming up with a set of precise constraints is tricky, or in some cases even impossible (e.g., You might indicate that a geotagged address is supposed to be in a given city, but that doesn’t guarantee that the address will be tagged correctly). Manual inspection is more reliable, but slow.

It would be useful if we had a library within Mimir to optimize the time of a human trying to ask/answer questions about the data. One approach would be to integrate/implement something like in order to get a first degree approximation of how reliable the confidence scores are (e.g., to learn a transfer function from the Google Maps API confidence score to the actual confidence a human has in its output).

Updated 26/06/2017 13:56 1 Comments

Have CTExplainer leverage TupleBundler


As of right now, CTExplainer’s explainRow and explainCell methods rely on a hack to compute statistical metrics for values. Now that we have TupleBundler and compileForSamples() (at least in the TupleBundler branch), it would make sense to start leveraging this functionality to produce samples in the backend.

One quick caveat is that most databases will freak out with respect to schemas changing in the middle of a query, so this task is really dependant on #193 happening first.

Updated 26/05/2017 13:19

Convert TypeInference lens into an Adaptive Schema


When Arindam first created the TypeInference lens several years back, we didn’t have Adaptive Schemas. We do now. TypeInference is really something that belongs as an AdaptiveSchema. One nifty consequence is that we might be able to do knowledge inference entirely in the backend db, something along the lines of: CREATE TABLE TYPES(name varchar, regexp varchar); CREATE LENS TYPE_CHOICE AS SELECT 'col' as col,, COUNT(*) AS count FROM (SELECT col FROM input LIMIT 10000) input, types WHERE regexp_match(types.regexp, input.col)) WITH KEY_REPAIR(col, SCORE_BY(count))

Updated 26/05/2017 13:17 1 Comments

Regular expression lens


A modifiable lens for doing data extraction using regular expressions. Syntax should follow the general pattern:

WITH EXTRACT(A, 'regular expression pattern', OUTPUT1, OUTPUT2, ...)

The output then becomes defined as a table FOO with columns A, B, C, …, OUTPUT1, OUTPUT2, … OUTPUT1, OUTPUT2, … are taken by shredding column A according to the regular expression. For example, let’s say you’ve already loaded in the Detroit Crime Dataset. In case the link dies, an example location has the format: 01100 S PATRICIA (42.282°, -83.1481°) That is, there’s some text signifying the street location, followed by a parenthesis, and then lat & long data in parenthesis. You could get these out by running the following query CREATE LENS crime_with_coords AS SELECT * FROM crime_data WITH EXTRACT(location, '[^(]+\((-?[0-9.]+)˚?, +(-?[0-9.]+)˚\)', lat, long) The resulting table would have the 15 columns in the original dataset, plus two new columns: LAT and LONG Type inference functionality should also be built into the extract lens for regular expressions – the type inference model should be completely re-usable here.

Concretely, I’d like to see the following implemented as part of the lens: - [ ] Create new columns by using regular expressions to shred existing values in the table. - [ ] Infer types using the existing type inference model. - [ ] Flag errors: Annotate the extracted value as being uncertain when the regular expression does not come up with a match. The explanation for this error should include the regular expression and the original value. - [ ] Attempt to repair errors: If there is a low edit-distance ‘fix’ to either the input data or the regular expression that would make the one satisfy the other, apply the fix, but mark it. For example, something like, but in scala.

Updated 21/05/2017 16:06

Fork me on GitHub