Tuesday, September 23, 2008

Notes from 9/23/2008 Patient Matching Call

We modified our tactics for matching because the matching process was taking longer to complete than we hoped.

We believe the performance issues relate to the myriad Hibernate queries that are required to create blocks of potential pairs. For example, if the patient table contains 4,500 unique SSN's, then 4,500 different Hibernate queries must be called to create potential pairs where SSN's match.

To minimize the number of Hibernate queries, the batch de-duplication process first examines the following tables: person, patient, patient_identifiers and person_attributes. A "flat", non-normalized table is then created with all fields from the above 4 tables.

All further analyses and scoring are performed against the flat, non-normalized table.

Recent timing test found that 10,000 patients could be extracted and stored in the the flat table in about 20 minutes, a rate of about 9 patients extracted from OpenMRS per second.

Once extracted from OpenMRS, analyzing and scoring the flat table took about 20 seconds. (SSN was the blocking variable and email/SSN were the include variables)

Next Steps Include:

1. Verify that the module can handle multiple blocking runs, and will join the multiple runs appropriately for the human readable report.

2. Verify that the current patient extraction process execution time increases linearly with the number of patients. We need to load 20-40,000 patients into OpenMRS and measure how long it takes to extract patients into a flat table. If time is not linear, we will need to consider other optimizations.

3. Create blocking runs that use neither patient_identifiers table nor the person_attributes table and measure how long these take to complete. I suspect that the "stacked" nature of these tables impacts efficiency.

4. Examine which specific tasks in the data extraction process are taking up time. To do so, we discussed the following approach:
  • Comment out PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
  • Access only 1 or 2 properties in PatientToRecord. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
  • Fully execute the PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
5. Because the de-duplication process will need to run when an OpenMRS system is not heavily loaded, it will likely need to be scheduled. We need to explore implementing a scheduling component.

6. We need to create two separate modules for each distinct linkage process: One for the batch duplication use-case, and one for the real-time matching use case (NBS). We envision creating a common package of linkage utilities that can be re-used in both modules.

1 comment:

Paul Biondich said...

Shaun, I know you recognize this, but the performance delta is likely due to a host of issues: the hibernate layer being one of them. Additionally, working from a referent point of a fairly relational model of a "person" likely contributes equally, if not more, to performance constraints. Single queries vs. 4-5x queries create considerable speed deltas.

Just thought it was worth the comment for historical purposes, but thanks a ton for documenting the process. I follow your blog religiously. :)