A modifiable lens for doing data extraction using regular expressions. Syntax should follow the general pattern:
CREATE LENS FOO
AS SELECT A, B, C, ... FROM BAR
WITH EXTRACT(A, 'regular expression pattern', OUTPUT1, OUTPUT2, ...)
The output then becomes defined as a table FOO with columns A, B, C, …, OUTPUT1, OUTPUT2, …
OUTPUT1, OUTPUT2, … are taken by shredding column A according to the regular expression. For example, let’s say you’ve already loaded in the Detroit Crime Dataset.
In case the link dies, an example location has the format:
01100 S PATRICIA (42.282°, -83.1481°)
That is, there’s some text signifying the street location, followed by a parenthesis, and then lat & long data in parenthesis. You could get these out by running the following query
CREATE LENS crime_with_coords
AS SELECT * FROM crime_data
WITH EXTRACT(location, '[^(]+\((-?[0-9.]+)˚?, +(-?[0-9.]+)˚\)', lat, long)
The resulting table would have the 15 columns in the original dataset, plus two new columns:
Type inference functionality should also be built into the extract lens for regular expressions – the type inference model should be completely re-usable here.
Concretely, I’d like to see the following implemented as part of the lens:
- [ ] Create new columns by using regular expressions to shred existing values in the table.
- [ ] Infer types using the existing type inference model.
- [ ] Flag errors: Annotate the extracted value as being uncertain when the regular expression does not come up with a match. The explanation for this error should include the regular expression and the original value.
- [ ] Attempt to repair errors: If there is a low edit-distance ‘fix’ to either the input data or the regular expression that would make the one satisfy the other, apply the fix, but mark it. For example, something like https://hackerboss.com/approximate-regex-matching-in-python/, but in scala.