wiki:datavali

Version 3 (modified by dennisw, 13 years ago) ( diff )

--

Data validation

There is a fairly large amount of data involved in this project. It's not uncommon for data to be incomplete. Possible scenarios will be listed here, together with their possible solutions. Let's assume for every scenario you have measured a 'regular and complete' round, and that your measurement equipment is fully functional (in other words, no half-broken GPS equip or such).

Invalid coordinate(s)

Detecting

The detection of invalid coordinates doesn't have to be very complicated. Values as '0.000' or 'null' should be easy to catch. The problem lies within 'valid but invalid' values. Say you have:

latitude, longitude
52.1000, 4.1000
52.2000, 100.0000
52.3000, 4.2000

It's obvious the longitude '100.0000' is out of place here. Since this project's focus is Leiden, a boundary could be set. Every value that exceeds that boundary can be marked as invalid.

The other problem could be that a value could be invalid, but still be inside the boundary:

latitude, longitude (range 52-4/55-7)
52.1000, 4.1000
54.5000, 4.2000
52.2000, 4.3000

The second latitude seems invalid, but is in this case still within the valid range. It might be possible to spot this value though. You could calculate an average offset in lat/lon, and everything that exceeds the average could be marked as invalid. The catch is that this might be heavy cpu-wise since a lot of calculation is needed.

Solutions

Let's say there is a missing coordinate like the following:

latitude, longitude
52.1000, 4.1000
52.2000, invalid value
52.3000, 4.2000

Missing values like these could be easily guessed by taking the first and last known value, and using the average of them as a replacement for the missing one. The newly calculated value shouldn't be that far off from the real one (except if you made a strange/unexpected turn at that specific value).

A harder case like this might need some thinking:

latitude, longitude
52.1000, 4.1000
invalid values (100 rows)
53.1000, 5.1000

100 rows of missing coordinates. First of all, the impact of these missing values depend on the speed you traveled/measured with. If your first coordinate was at the NE point, and the last was at the SW point of Leiden, there is quite a large gap. (A side note: you should increase your measurement intervals if this happens) For large gaps like these, it's hard to calculate an expected route. Even if within these missing values a random value would be measured (like a 10 missing/1 valid ratio), it might be wise to just ignore these values since it's hard to get a correct route on long distances.

Say we still have these 100 missing rows, but your first coordinate is at the start of a street, and the last at the end of that same street. This is more likely to occur when you measure at fair intervals. In this case, it can't hurt to calculate the estimated route. However, if the street has an 'L' shape which you followed, coordinates are likely to intersect with houses and such.

Invalid signal strength

Detecting

Again, the '0.000' and 'null'-like values won't be hard to spot. Besides that, it's hard to find 'valid but invalid' values. You might measure a 50% strength on exactly the same position for five continuous days, and a 25% on the sixth. This doesn't mean the 25% is invalid, since there are a lot of factors that can influence the strength you measure. Therefore it might be enough to just set a valid range, e.g. +35dBm/-100dBm, and go from there.

Solutions

Let's assume the following is measured from the same accespoint. (right now, signal strength is '100 + signal_dbm')

signal_dbm, strength %
-80, 20%
invalid values
-50, 50%

Again, it might be wise to calculate an average to replace the invalid values. But say, someone else measures the same accespoint around the same location, and he receives valid values. Our average would be 45%, but he get's a 90%, or maybe a 5%. For this, it might be better to look at the history of the accespoint (assuming there is one, if not, there probably will be one in time). You could take the most recent dbms measured at around the same location:

signal_dbm (older), strength % (older)
-75, 25%
-90, 10%
-60, 40%

The first and third value don't show a lot of difference, so it should be fairly safe to take the old second value, and use it for the new measurement.

Invalid SSID/BSSID

Detecting

Same thing with the 'null'-like values.

Solutions

Missing an SSID: Assuming people don't tinker with their BSSID, a missing SSID value can be easily solved if the BSSID is already in the database with an SSID. If the accespoint hasn't been measured before, and the SSID is missing in the measurement as well as in the database, a placeholder like 'unnamed' can be set. In case the SSID is measured later, the old one can simply be overwritten with the new one.

Missing a BSSID: If we have an SSID, but no BSSID, we can't simply look for a similair SSID in the database and use that BSSID as a replacement since SSIDs aren't unique. What might work: check the coordinates. If you measured the SSID at 50.000/4.000, and there is a two day old entry in the database matching the SSID and measured around the same location, there's a fair change they match. In that case you can take the BSSID and use it for your freshly measured SSID. Otherwise it could help to set a placeholder BSSID (same idea as before).

Missing both: When missing both, it's more of a question if you want to display unnamed accespoints on the map. Right now, only the Wireless Leiden node SSIDs are used. All other accespoint SSIDs and BSSIDs are just stored in the database for possible later use. It might be wise to just use placeholder names again, since it's wasteful to discard good measured data.

Note: See TracWiki for help on using the wiki.