Tuesday, September 23, 2008

Notes from 9/23/2008 Patient Matching Call

We modified our tactics for matching because the matching process was taking longer to complete than we hoped.

We believe the performance issues relate to the myriad Hibernate queries that are required to create blocks of potential pairs. For example, if the patient table contains 4,500 unique SSN's, then 4,500 different Hibernate queries must be called to create potential pairs where SSN's match.

To minimize the number of Hibernate queries, the batch de-duplication process first examines the following tables: person, patient, patient_identifiers and person_attributes. A "flat", non-normalized table is then created with all fields from the above 4 tables.

All further analyses and scoring are performed against the flat, non-normalized table.

Recent timing test found that 10,000 patients could be extracted and stored in the the flat table in about 20 minutes, a rate of about 9 patients extracted from OpenMRS per second.

Once extracted from OpenMRS, analyzing and scoring the flat table took about 20 seconds. (SSN was the blocking variable and email/SSN were the include variables)

Next Steps Include:

1. Verify that the module can handle multiple blocking runs, and will join the multiple runs appropriately for the human readable report.

2. Verify that the current patient extraction process execution time increases linearly with the number of patients. We need to load 20-40,000 patients into OpenMRS and measure how long it takes to extract patients into a flat table. If time is not linear, we will need to consider other optimizations.

3. Create blocking runs that use neither patient_identifiers table nor the person_attributes table and measure how long these take to complete. I suspect that the "stacked" nature of these tables impacts efficiency.

4. Examine which specific tasks in the data extraction process are taking up time. To do so, we discussed the following approach:
  • Comment out PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
  • Access only 1 or 2 properties in PatientToRecord. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
  • Fully execute the PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
5. Because the de-duplication process will need to run when an OpenMRS system is not heavily loaded, it will likely need to be scheduled. We need to explore implementing a scheduling component.

6. We need to create two separate modules for each distinct linkage process: One for the batch duplication use-case, and one for the real-time matching use case (NBS). We envision creating a common package of linkage utilities that can be re-used in both modules.

Tuesday, September 2, 2008

Notes from 9/2/2008 Patient Matching Call

1) Implemented: Display additional matching variables in the web based report so the user can better evaluate the nature/quality of the match

2) Implemented: Create a flat file output containing (at least) all blocking and include variables, if not additional variables

3) Implemented: Because different blocking runs may have different include variables, we need to be able to include the union of all fields used across all blocking runs

4) Testing: We need to test the de-dupe module with different data sets. We need to test both for A) performance (how long does it take to match larger numbers of patients), and B) bugs -- what issues will we run into with different data?

Results to date: An out of memory error occurs with an OpenMRS patient table containing 10,000 patients (likely a Tomcat memory error)

  • Had increased Tomcat memory allocation to 1024M, no change
  • We may need to use a software profiler to see what processes are using memory
  • Question: are match results objects (MR's) stored in memory? MR's should not be stored in memory -- to minimize memory use, each MR should be handled and written to a file, etc.