Tuesday, August 5, 2008

Notes from 8/5/08 Patient Matching Call

1. The HQL-based data source reader (DSR) nearly complete. James needs to add patient_id (== person_id?) to the matching objects. About 1 day is needed to complete the DSR. James will move DSR to link VM for testing/implementation. He will test the DSR's ability to use multiple blocking columns and multiple data types as blocking columns

2. Nyoman completed data loading utility. Takes approximately 1 second per patient to load. He will post his code to the OpenMRS website and get others' feedback on speed, etc.

3. Nyoman has implemented introspection of person attributes. We discussed what to do if an OpenMRS implementer adds a new person_attribute after creating blocking runs in an operational system. Decision was that the existing blocking runs would need to be manually deleted by the user and new blocking runs created.

4. Nyoman to ensure the config.xml file contains the presets for random sampling (="true"); number of samples (100,000); uid_field=??.

5. James will review how the DSR interacts with, and functions appropriately with, the following processes: 1) random sampling, 2) EM analysis, 3) Scoring pairs. There may be some performance issues with Hibernate/caching that we'll need to address.

6. Transitive grouping function will be tested as part of the end-to-end workflow (Nyoman/James). The transitive grouping report requires a "Group ID". The transitive grouping report will initially generate a flat file.

7. (New code) Each blocking run typically has a different cut-off score, and that cut-off score has been manually determined in the past. A method is needed to automatically determine the cut-off score. The cut-off score will be calculated based on the total number of EM-estimated true matches.

8. The following work flow will need to be implemented to deliver the de-duplication functionality:
  1. User clicks "Generate De-dupe Report" (new code)
  2. XML matching configuration file is read, matching objects are created (code exists, needs testing, modification)
  3. Potential pairs are formed using the Blocking fields configured by the user. The patient_id (person_id) will be used to avoid redundant pairs being formed (code exists, needs testing, modification)
  4. The pairs are randomly sampled to calculate u-values (code exists, needs testing, modification)
  5. The pairs are evaluated by EM to calculate m-values and estimate the number of true matches (code exists, needs testing, modification)
  6. The match cut-off score is calculated (new code)
  7. True matches (determined by score cut-off) from all 3 blocking runs are "squeezed" based on unique ID's (new code)
  8. The "squeezed" pairs are processed by transitive grouping function. The transitive grouping function should include a "Group ID" to identify records that belong to the same duplicate group. (code exists, needs testing, modification)
  9. The grouped pairs are output to a flat file. (new code)
  10. The OpenMRS User reviews the flat file for duplicates

No comments: