Friday, 23 August 2013

Strategy for normalising and interpreting location data

Strategy for normalising and interpreting location data

I'm a bit stuck with a problem involving the normalisation of location
input data from the user (which comes from a third party).
Aim
To logically breakdown and interpret the user location input field and
understand if it is listing single or multiple locations, from one or more
countries.



The Problem
The type of data that I am receiving from the user input is messy and has
no logical structure or consistency as show below. Daily geocodes are
limited from google, so I have to use them sparingly. I want to
efficiently process the location input from the user and send the correct
geocoding query to google to get the right result.



Data Sources
Data is irrational and irregular and could be supplied in some of the
following formats:
London, UK Alternative format
England, London Reversed order
London Generic location
London Sheffield, Newcastle Three separate locations in the
same country without consistent commas
London, Sales, Sales Assistant Non location content inserted
London [NOT SPECIFIED] Other non location content
inserted with non alphabet chars not separate with commas
London, Washington, Brazil, England Mix of unrelated locations,
including cities and countries
Washington, London, Kent Mix of places within a single country



Proposed Solution
Step 1: Breakdown data
Put each separated word in an array
Step 2: Sanitize data
Strip out invalid chars, commas, additional spaces etc
Strip out any words against a stoplist.txt (like job, sales, in, at, etc)
Step 3: Deterimine if valid location
See if each individual array item has been geocoded before, if not,
geocode and store
Log any words which have been geocoded with no result – add these into the
stoplist file to avoid pointless geocodes
Step 4: Interlink values
Compare if a places coordinate value falls within the range of another
array item. If it does, we know they are parents and we treat them as a
single item
London + England -> London coordinates falls within the co-ordinate range
of England so we know it is a single location, not two separate ones.



Issues
Issue 1: Kent, London, Sussex Technically, there is a Kent in the USA, and
it is the first one that comes up when you type it into google maps.
However since all the results are in England it is extremely unlikely that
the result we want is the USA one
Issue 2: England, Washington, New York There is a Washington in England,
but this doesnt seem likely to be the one in England



Question
Is my proposed solution of breaking the words down into separate entities
and relinking them the most logical solution? Any help or advice would be
much appreciated, I know it's not any easy problem.

No comments:

Post a Comment