Changes between Version 1 and Version 2 of datavali


Ignore:
Timestamp:
Jun 23, 2011, 2:23:17 PM (13 years ago)
Author:
dennisw
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • datavali

    v1 v2  
    11= Data validation =
    2 There is a fairly large amount of data involved in this project. It's not uncommon for data to be incomplete. Possible scenarios will be listed here, together with their possible solutions. Let's assume for every scenario you have measured a 'regular and complete' round.
     2There is a fairly large amount of data involved in this project. It's not uncommon for data to be incomplete. Possible scenarios will be listed here, together with their possible solutions. Let's assume for every scenario you have measured a 'regular and complete' round, and that your measurement equipment is fully functional (in other words, no half-broken GPS equip or such).
    33
    4 == Missing coordinate(s) ==
     4== Invalid coordinate(s) ==
     5=== Detecting ===
     6The detection of invalid coordinates doesn't have to be very complicated. Values as '0.000' or 'null' should be easy to catch. The problem lies within 'valid but invalid' values. Say you have:
     7{{{
     8latitude, longitude
     952.1000, 4.1000
     1052.2000, 100.0000
     1152.3000, 4.2000
     12}}}
     13It's obvious the longitude '100.0000' is out of place here. Since this project's focus is Leiden, a boundary could be set. Every value that exceeds that boundary can be marked as invalid.
     14
     15The other problem could be that a value could be invalid, but still be inside the boundary:
     16{{{
     17latitude, longitude (range 52-4/55-7)
     1852.1000, 4.1000
     1954.5000, 4.2000
     2052.2000, 4.3000
     21}}}
     22The second latitude seems invalid, but is in this case still within the valid range.
     23It might be possible to spot this value though. You could calculate an average offset in lat/lon, and everything that exceeds the average could be marked as invalid. The catch is that this might be heavy cpu-wise since a lot of calculation is needed.
     24
     25=== Solutions ===
    526Let's say there is a missing coordinate like the following:
    627{{{
    7 '''latitude, longitude'''
     28latitude, longitude
    82952.1000, 4.1000
    9 52.2000, null
     3052.2000, invalid value
    103152.3000, 4.2000
    1132}}}
     
    1435A harder case like this might need some thinking:
    1536{{{
    16 '''latitude, longitude'''
     37latitude, longitude
    173852.1000, 4.1000
    18 null, null (100 rows)
     39invalid values (100 rows)
    194053.1000, 5.1000
    2041}}}
     
    2243For large gaps like these, it's hard to calculate an expected route. Even if within these missing values a random value would be measured (like a 10 missing/1 valid ratio), it might be wise to just ignore these values since it's hard to get a correct route on long distances.
    2344
    24 Say we still have these 100 missing rows, but your first coordinate is at the start of a street, and the last at the end of that same street. This is more likely to occur when you measure at fair intervals.
     45Say we still have these 100 missing rows, but your first coordinate is at the start of a street, and the last at the end of that same street. This is more likely to occur when you measure at fair intervals. In this case, it can't hurt to calculate the estimated route. However, if the street has an 'L' shape which you followed, coordinates are likely to intersect with houses and such.
     46
     47== Invalid signal strength ==
     48You might encounter missing or invalid signal values. Let's assume the following is measured from the same accespoint. (right now, signal strength is '100 + signal_dbm')
     49{{{
     50signal_dbm, strength %
     51-80, 20%
     52invalid values
     53-50, 50%
     54}}}
     55Again, it might be wise to calculate an average to replace the invalid values. But say, someone else measures the same accespoint around the same location, and he receives valid values. Our average would be 45%, but he get's a 90%, or maybe a 5%.
     56For this, it might be better to look at the history of the accespoint (assuming there is one, if not, there probably will be one in time). You could take the most recent dbms measured at around the same location:
     57{{{
     58signal_dbm (older), strength % (older)
     59-75, 25%
     60-90, 10%
     61-60, 40%
     62}}}
     63The first and third value don't show a lot of difference, so it should be fairly safe to take the old second value, and use it for the new measurement.