Thursday, 5 September 2013

Removing corrupted JSON lines

Removing corrupted JSON lines

I'm submitting a job to Hadoop and it keeps failing on the map part. I've
no GUI which makes debugging v difficult, but after much trial and error
I've realised it's down to the input.
My input is some 1.5 million files in JSON format, in the form of a
defaultdict (mapping keys to lists). If I load about 2,000 files in, the
program works fine. This is leading me to believe that there is some
corrupt line (or something line that) somewhere in the code that is
stopping Hadoop.
Currently, I'm loading the JSON like this:
myDict = defaultdict(list)
for line in sys.stdin:
line = line.strip()
try:
myDict = json.loads(line)
except ValueError:
continue
I then start iterating through the defaultdict:
for key, value in myDict.iteritems():
In debugging, I stripped everything out but the very first for loop where
the json loads the line. It passed on the small input, failed on the large
one.
Is this segment correct? Is there another way I can search for corrupted
lines?
Input to Hadoop:
bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar \
-file $HOME/hadoop/positive_words.txt -file $HOME/hadoop/negative_words.txt \
-file $HOME/hadoop/hadoop-mapper.py -mapper $HOME/hadoop/hadoop-mapper.py \
-file $HOME/hadoop/hadoop-reducer.py -reducer
$HOME/hadoop/hadoop-reducer.py \
-input /smallDataset -output /parsedTweetsOutput

No comments:

Post a Comment