Edit: the single command written out to multiple (continued) lines in the PowerShell "long" form - i.e. no aliases and parameter names explicitly named.
Replying to self because I learned something new about PowerShell today :)
On the challenge github page dfinke used irm (Invoke-RestMethod) instead of iwr (Invoke-WebRequest).
It turns out that irm actually looks at the content type of the response and converts from json to objects implicitly if the content type indicates json.
Thanks to dfinke.
The above PowerShell one-liner can thus be written shorter (omitting the explicit ConvertFrom-Json invokation):
yeah. And unlike many of the Unixy solutions it will actually produce correct results even if the input data contained quotes ("), apostrophes ('), commas (,), backslashes (\) and other characters that could throw parsing and text synthesizing off.
* The iwr (Invoke-WebRequest) reads the text file from the uri.
* The ConvertFrom-Json cmdlet reads the content as json and parses it according to json rules. The output is a sequence of objects (not json objects, but PowerShell objects) with properties corresponding to the json fields. This will respect specially encoded characters.
* The ? (select-object) cmdlet can filter on complex expressions, but in this case it just filters on the presence of a creditcard property.
* The select (Select-Object) cmdlet selects just the properties we want to export, out of all of the properties for each object.
The epcsv (Export-Csv) exports objects from the pipeline and outputs them to a file with proper escaping for e.g. quotes, apostrophes etc.
By not relying on home-cooked text munging you can actually produce a robust solution that is also readable.
I recently found out about Microsoft Virtual Academy. There are some seriously good (free) online courses there, featuring Jeffrey Snover himself (author of the Monad Manifest and inventor of PowerShell). For example this one: http://www.microsoftvirtualacademy.com/training-courses/gett...
Downside is that you do not control the pace yourself as much as by written resources.
There's a very friendly community at http://powershell.org where you can ask questions, get advice etc.
If this was a list of cat pictures, I would not consider posting to a third party cheating. But seeing as how it's credit card data, even though it is already "in the wild," it is still worth not posting to another random website.
Considering these are credit card numbers and you are trying to protect the account holders, is it really a smart idea to upload the names and numbers to some third party website?
Notepadd++ search and replace using the regular expressions
^\{"name":("[^"]+)".+"creditcard":("[^"]+")\},?$
and replace it with
\1,\2
then sort the file using the TextFX plugin and delete all the lines without credit card number at the bottom. Type the header line and done. Five to ten minutes including some manual sanity checking. Of course don't forget to look at the calendar and save with the correct file name.
Lowest effort solution: Post to stackexchange/github/reddit with the question "I've got a programming challenge. Can you find the shortest code to convert this JSON to CSV?" Then sit back and watch the answers roll in.
It's been years since I wrote perl. Any kind soul care to comment some of this code? It looks like the logic starts at the bottom, but I must admit defeat in truly understanding how it works!
Perl is water in the cupped hands of our minds. I have yet to meet someone who can keep it from slipping between their fingers without consistent effort.
* <> reads from file(s) specified as command-line arguments (data.json in this case)
* Option -0777 tells Perl to slurp the whole file
* Option -MJSON loads the JSON module
There are actually 2 dates,the 24th and the 25th, in the data sample.
You can do this in a manner that's both fairly comprehensible and succint, for arbitrary number of dates, using a Json TypeProvider in F#.
#r @"../Fsharp.Data.dll"
open FSharp.Data
open System
type PersonsData = JsonProvider<"../data.sample.json">
let dateTriple (d:PersonsData.Root) = d.Timestamp.Year, d.Timestamp.Month,d.Timestamp.Day
let info = PersonsData.Load ("./data.json")
let uniqueDates = info |> Array.map dateTriple |> set
let createCsv (d : PersonsData.Root seq) =
d |> Seq.filter (fun p -> Option.isSome p.Creditcard)
|> Seq.map (fun p -> sprintf "%s,%s" p.Name p.Creditcard.Value )
|> String.concat "\n"
info |> Seq.groupBy dateTriple
|> Seq.iter (fun ((y,m,d), data) ->
IO.File.WriteAllText (sprintf "%d%02d%02d.csv" y m d, createCsv data))
With the below as a sample (though dataset itself could have been used since it's not so large):
I think it's usually better to do this with a reproducible script instead of manual editing in a text editor. There are too many times where the requirements change after the fact for "simple one-off" transformations.
I was going to say the same thing. I think of everything as black boxes, so choice of language is not as important as approach. I would use PHP because it's informal - it's a hybrid between shell and C so expressiveness and speed come basically for free and there isn't much friction. Being able to quickly iterate often makes relationships become apparent at a meta level. For example a quick lookup of a parsing error might reveal that the data was generated by a standard tool, so I could drop what I'm doing and grab that instead. It's not so much about individual choices, but a way of attacking problems that compounded over time leads to a great deal of leverage.
They shouldn't have put each record on a single line, that makes it way too easy for text editors. Assume that this a single line or there are arbitrary newlines inserted and then use a proper parser/filter, e.g. jq in the shell as mentioned in the comments.
Replace },{ with },\r\n{ and done. Using more sophisticated tools than a text editor only makes sense if you know them well, the task is more complex or the file larger.
Sure, which is why I suggested to make the task complex enough so that the "usual" *nix text tools aren't enough, or significantly more complex to use then a real parser. I mean it's a basically a rehash of "how do I parse HTML with regexes".
I have to admit that I have only skimmed the answers, but I didn't notice much in the way of testing that the entire file conforms to the assumptions, implicitly used in the design of a solution, about how it is formatted.
The first way I did it was just read the file, parse the json, check for nulls, and print. Then I piped this to a file. That is too boring though.
The next way I did it was with Vim replaces. There might be sexier ways of doing this?
1) A big replace to get rid of all the junk.
%s/\%Vemail.*creditcard/creditcard/g
2) A big replace to fix nulls in the credit card
%s/\%V.*creditcard.*null.*//g
Then I could either change my code to not check for nulls or keep cleaning up this data until I have a CSV.
This Vim flow is the exact thing I do every week or two at work when building the Chrome HSTS preload list into our product. Anyone know the sexier ways to make this really really fun?
I should note to those not super familiar with vim that may try this, my vim commands are over a visual block (the json string), since this was in the same file as my code.
The one thing that was not left fully specified: Should the filename in YYYYMMDD.csv be the date that the program is run, or should a file of that spec be created grouping the lines in the .json file by their date field? If it is the latter, none of the github solutions are coding this correctly.
My PostgreSQL 9.3 solution using json_populate_recordset and @cjauvin's Python solution are the only two which group by date. It wasn't much harder, but nudges it out of shell oneliner territory.
The R example posted on the GitHub thread is very nice. [1]
I use R every day, often parsing json and writing csv files. The solution posted by mrdwab is much more succinct than my go to method, especially the use of pipes. I've read about pipes in the dplyr/magrittr packages before, but this is the first time I've seen them used and thought they made complete sense.
I agree, the solution in R seems by far the most simple and intelligible. But then R is specifically designed to do this kind of thing, so that's not too surprising.
I am disappointed in this challenge on two levels: (1) A great many solutions fail to actually produce proper CSV: what would you do if any of the names or arbitrary credit card number inputs had quotes or commas in them? Big waste of time to roll your own, and you would have failed to prevent fraud likely without even realizing it. (2) Our JSON-dumping hackers didn't put quotes or commas in the strings to foil do-gooders.
Looks like you completely missed the point of the "challenge". This is reproducing a real-world situation to figure out what solutions different devs will think of first.
Though I guess "complain about the situation" is a valid answer to that question.
In 2015, we continue to develop incorrect CSV parsing and production when there are ready solutions in the wild for the spec (http://www.rfc-editor.org/rfc/rfc4180.txt - 2005), such as Ruby's csv package, Perl's Text::CSV, CL's CL-CSV, and so forth. It's a solved problem. I quite understand this challenge, but this issue is important to me because this same print "%s,%s\n" stuff shows up in the wild from professional developers who have the time on their hands to use the right tool, some of which I have personally worked with. Perhaps, like in this challenge, they are under pressure to get their feature finished, and this is the first thing they think of because, after all, it's just "comma-separated values".
This challenge is especially amusing in that it, were it a real-world situation, involves people's financial welfare, and developers in a rush could very well have screwed it up. Isn't that cause for worry?
That did cross my mind - my solution used a full json parser (jq), and I did a quick grep for commas in name or creditcard fields before using a "%s,%s"-style solution to generate the CSV - if there had been any present, I would have fallen back to a slightly longer form of python -c 'import csv...'
Single line, no assumptions about data previously stored, no invalid CSV with quotes getting the entire line and no useless spaces in the beginning of each line:
My first thought was to just manipulate it in Sublime Text, the file isn't too large to work on in memory. I also noticed the data seemed to be very consistently formatted.
Search "name":".+?".*?"creditcard":".+?" to grab every line with non-empty Name and CC. Alt+Enter to edit all lines at once. Remove all the other data and json syntax. This was pretty easy because "name" is always the first field on the line and "creditcard" always the final field.
When I see CSV I think RFC4180. It's not a real "standardised standard", but good enough to be one, and probably what a lot of tools will produce and consume if asked to process CSV.
In my experience, CSV is very fuzzy. Do you include headers? What character is used for separator? Comma? Semi colon? Tab? Are fields enclosed in quotes? How are quotes within quotes escaped? What encoding do you use for non-ascii symbols? &c &c .. Without are sample reference, these are all very common variables.
What character is used for separator? Comma? Semi colon? Tab?
It isn't Comma-Separated-Values if fields are not separated by commas, but the others are valid points. Failure to handle quoting correctly is the most common form of broken CSV support I've seen. Encoding is usually specified out-of-band.