Spark has sliding window operations: windowedWordCounts = pairs.reduceByKeyAndWi...

justinsaccount · on Sept 20, 2016

Yeah.. that's exactly what I used in the spark version i threw together yesterday.

reduceByKeyAndWindow is what my simple non-spark proof of concept is missing. otherwise the code is basically the same.

my non spark code is essentially

  while True:
    data = {}
    for line in input:
        rec = parse(line)
        data = aggregate(data, rec)
    data = filter(is_bad, data)
    pprint(data)

the spark version is 99% the same code:

  lines.map(parse).reduceByKeyAndWindow(add, sub, 3600, 60).
    filter(is_bad).pprint()

Figuring out how to do stateful processing is a little tricky, but updateStateByKey seems to do what i need.. I need to dedup the output per key for time period t. Though, some recommendations are just to use something like redis or memcached which would work.