Thursday, October 23, 2008

Notes from 10/21/2008 Patient Matching Call

(A) We're experiencing out of memory errors in Tomcat with 100,000 patients in OpenMRS and 512 MB allocated to Tomcat. The first blocking run (blocking on postal code) appears to be completing; the out of memory error appears to occur during the analysis phase (either during random sampling or during Expectation Maximization) of the second blocking run (blocking on SSN).

Memory snapshots from the profiler reveal that a large amount of memory is allocated for the MySQL database connection. Large result sets are not being released and closed. Even though we are attempting to close the result set, it remains open, we believe because unidentified resources are still accessing the result set.

James has updated the code to address releasing the result set, and Win is testing the revision on his computer
If this code update doesn't resolve the issue, we'll re-examine the profiler output for other places where memory use could be further optimized.

Update 10/22/2008: The new code successfully ran with no out-of-memory errors observed with the new code. Therefore


(B) In some cases, the duplicate report listing may be large. So rather than displaying the report in a web browser (which is not designed to display large amounts of line-list data), it may be more practical to output the duplicate report directly to a file. Consequently, we're in the process of modifying the module to accommodate this. Win is implementing a more robust reporting work flow using AJAX. The user will be able to initiate a report, navigate away from the administrator interface, return to check status, and when the report is complete, a link to that report is displayed.

We’ve implemented a lock-out feature so that once a report is started, no reports can be initiated until the current report is complete.


(C) Currently we have been unable to run the “Yourkit” java profiler on Linux. We are using Windows to profile memory usage. If this issue further hinders progress, we will need to address the barrier we face to run Yourkit on Linux.

Tuesday, September 23, 2008

Notes from 9/23/2008 Patient Matching Call

We modified our tactics for matching because the matching process was taking longer to complete than we hoped.

We believe the performance issues relate to the myriad Hibernate queries that are required to create blocks of potential pairs. For example, if the patient table contains 4,500 unique SSN's, then 4,500 different Hibernate queries must be called to create potential pairs where SSN's match.

To minimize the number of Hibernate queries, the batch de-duplication process first examines the following tables: person, patient, patient_identifiers and person_attributes. A "flat", non-normalized table is then created with all fields from the above 4 tables.

All further analyses and scoring are performed against the flat, non-normalized table.

Recent timing test found that 10,000 patients could be extracted and stored in the the flat table in about 20 minutes, a rate of about 9 patients extracted from OpenMRS per second.

Once extracted from OpenMRS, analyzing and scoring the flat table took about 20 seconds. (SSN was the blocking variable and email/SSN were the include variables)

Next Steps Include:

1. Verify that the module can handle multiple blocking runs, and will join the multiple runs appropriately for the human readable report.

2. Verify that the current patient extraction process execution time increases linearly with the number of patients. We need to load 20-40,000 patients into OpenMRS and measure how long it takes to extract patients into a flat table. If time is not linear, we will need to consider other optimizations.

3. Create blocking runs that use neither patient_identifiers table nor the person_attributes table and measure how long these take to complete. I suspect that the "stacked" nature of these tables impacts efficiency.

4. Examine which specific tasks in the data extraction process are taking up time. To do so, we discussed the following approach:
  • Comment out PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
  • Access only 1 or 2 properties in PatientToRecord. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
  • Fully execute the PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
5. Because the de-duplication process will need to run when an OpenMRS system is not heavily loaded, it will likely need to be scheduled. We need to explore implementing a scheduling component.

6. We need to create two separate modules for each distinct linkage process: One for the batch duplication use-case, and one for the real-time matching use case (NBS). We envision creating a common package of linkage utilities that can be re-used in both modules.

Tuesday, September 2, 2008

Notes from 9/2/2008 Patient Matching Call

1) Implemented: Display additional matching variables in the web based report so the user can better evaluate the nature/quality of the match

2) Implemented: Create a flat file output containing (at least) all blocking and include variables, if not additional variables

3) Implemented: Because different blocking runs may have different include variables, we need to be able to include the union of all fields used across all blocking runs

4) Testing: We need to test the de-dupe module with different data sets. We need to test both for A) performance (how long does it take to match larger numbers of patients), and B) bugs -- what issues will we run into with different data?

Results to date: An out of memory error occurs with an OpenMRS patient table containing 10,000 patients (likely a Tomcat memory error)

  • Had increased Tomcat memory allocation to 1024M, no change
  • We may need to use a software profiler to see what processes are using memory
  • Question: are match results objects (MR's) stored in memory? MR's should not be stored in memory -- to minimize memory use, each MR should be handled and written to a file, etc.

Tuesday, August 5, 2008

Notes from 8/5/08 Patient Matching Call

1. The HQL-based data source reader (DSR) nearly complete. James needs to add patient_id (== person_id?) to the matching objects. About 1 day is needed to complete the DSR. James will move DSR to link VM for testing/implementation. He will test the DSR's ability to use multiple blocking columns and multiple data types as blocking columns

2. Nyoman completed data loading utility. Takes approximately 1 second per patient to load. He will post his code to the OpenMRS website and get others' feedback on speed, etc.

3. Nyoman has implemented introspection of person attributes. We discussed what to do if an OpenMRS implementer adds a new person_attribute after creating blocking runs in an operational system. Decision was that the existing blocking runs would need to be manually deleted by the user and new blocking runs created.

4. Nyoman to ensure the config.xml file contains the presets for random sampling (="true"); number of samples (100,000); uid_field=??.

5. James will review how the DSR interacts with, and functions appropriately with, the following processes: 1) random sampling, 2) EM analysis, 3) Scoring pairs. There may be some performance issues with Hibernate/caching that we'll need to address.

6. Transitive grouping function will be tested as part of the end-to-end workflow (Nyoman/James). The transitive grouping report requires a "Group ID". The transitive grouping report will initially generate a flat file.

7. (New code) Each blocking run typically has a different cut-off score, and that cut-off score has been manually determined in the past. A method is needed to automatically determine the cut-off score. The cut-off score will be calculated based on the total number of EM-estimated true matches.

8. The following work flow will need to be implemented to deliver the de-duplication functionality:
  1. User clicks "Generate De-dupe Report" (new code)
  2. XML matching configuration file is read, matching objects are created (code exists, needs testing, modification)
  3. Potential pairs are formed using the Blocking fields configured by the user. The patient_id (person_id) will be used to avoid redundant pairs being formed (code exists, needs testing, modification)
  4. The pairs are randomly sampled to calculate u-values (code exists, needs testing, modification)
  5. The pairs are evaluated by EM to calculate m-values and estimate the number of true matches (code exists, needs testing, modification)
  6. The match cut-off score is calculated (new code)
  7. True matches (determined by score cut-off) from all 3 blocking runs are "squeezed" based on unique ID's (new code)
  8. The "squeezed" pairs are processed by transitive grouping function. The transitive grouping function should include a "Group ID" to identify records that belong to the same duplicate group. (code exists, needs testing, modification)
  9. The grouped pairs are output to a flat file. (new code)
  10. The OpenMRS User reviews the flat file for duplicates

Tuesday, July 29, 2008

Notes from 7/29/08 Patient Matching call

OpenMRS De-duplication Process:

1. As a first pass, James will use an SQL query approach to create a data source reader (DSR) to access OpenMRS patient data from the Patient, Name, Address, and Attribute tables. To accomplish this we will need to map java object properties to SQL database field names. We will eventually evaluate using HQL after demonstrating functionality using SQL.

2. Nyoman will load "test5" synthetic data into OpenMRS. To do this he will need to create the following attributes: SSN, CC Num, CCV, CC expiration date. We may need to ask OPENMRS-DEV if any existing functions exist to do this (semi) automatically.

3. Nyoman will add functionality to introspect the attribute table, and will add attribute fields to the management web interface.

4. After the web GUI and DSR are implemented, James and Nyoman will ensure that the standard XML linkage configuration file is created. This will persist the deduplication configuration data. The XML config file will have defaults for certain parameters file. These defaults include data source (OpenMRS), number of random samples (100,000), string comparator (exact match)

5. James will test the new FormPairs implementation that avoids redundant pairs; he will determine whether the EM analysis and random sampling analysis are using the non-redundant FormPairs when deduplicating.

6. James will test the nascent transitive grouping functionality, which ultimately identifies the groups of potential duplicates.


User Workflow script for OpenMRS deduplication:

1. User logs in to OpenMRS

2. User selects admin page

3. User selects "Manage Configuration" from Patient Matching Module section

4. User creates 1-to-n "blocking runs" configuration from Patient, Address, Name, and Attribute table fields

5. User clicks generate report


GUI for Manual Review of Record Pairs:

1. As an initial "low-hanging fruit" strategy, the Manual Review GUI will access MatchResult (MR) objects in memory, rather than in a persistent database. Ultimately we want to persist MR's in a relational data model, but for initial testing and prototyping, we will use in-memory MatchResults.

2. Nyoman will design and prototype the Manual Review GUI features using the Power Point slide to prioritize features.

3. The Manual review GUI will be incorporated into the larger RecMatch GUI application.

Tuesday, July 22, 2008

Notes from 7/22/08 Patient Matching call

1. Ubuntu VM up and running

2. OpenMRS re-installed

3. Ticket 897 reviewed -- no major revisions

4. Minor Issues to address:
  • Lock the global config panel size (not adjustable in the vertical)
  • Ensure the config file history is updated in a timely fashion
  • Review De-duplicate check box work flow (Data source config complete check box is inactivated when un-checking de-duplicate check box)
  • Re-ordering blocking sessions is not saved
5. We briefly reviewed the evolving data model for the RecMatch process. Shaun will review further with Nyoman/James at a later time.

6. Nyoman will review requirements for the manual review GUI.

Still to be done:

7. Generate synthetic data (Shaun)

8. Load synthetic data into OpenMRS (James/Nyoman)

9. James to evaluate (speed/efficiency) using an HQL query approach to accessing patient data in OpenMRS

10. James/Nyoman to integrate UID/de-duplicate workflow; build uid rules into FormPairs

Notes from 7/15/08 Patient Matching call

1. Re-instantiate VM (Shaun/James)

2. Reload OpenMRS on the VM (Nyoman)

3. Generate synthetic data (Shaun)

4. Load synthetic data into OpenMRS (James/Nyoman)

5. James to evaluate (speed/efficiency) using an HQL query approach to accessing patient data in OpenMRS

6. James/Nyoman to integrate UID/de-duplicate workflow; build uid rules into FormPairs

7. Shaun to test ticket 897 changes