We believe the performance issues relate to the myriad Hibernate queries that are required to create blocks of potential pairs. For example, if the patient table contains 4,500 unique SSN's, then 4,500 different Hibernate queries must be called to create potential pairs where SSN's match.
To minimize the number of Hibernate queries, the batch de-duplication process first examines the following tables: person, patient, patient_identifiers and person_attributes. A "flat", non-normalized table is then created with all fields from the above 4 tables.
All further analyses and scoring are performed against the flat, non-normalized table.
Recent timing test found that 10,000 patients could be extracted and stored in the the flat table in about 20 minutes, a rate of about 9 patients extracted from OpenMRS per second.
Once extracted from OpenMRS, analyzing and scoring the flat table took about 20 seconds. (SSN was the blocking variable and email/SSN were the include variables)
Next Steps Include:
1. Verify that the module can handle multiple blocking runs, and will join the multiple runs appropriately for the human readable report.
2. Verify that the current patient extraction process execution time increases linearly with the number of patients. We need to load 20-40,000 patients into OpenMRS and measure how long it takes to extract patients into a flat table. If time is not linear, we will need to consider other optimizations.
3. Create blocking runs that use neither patient_identifiers table nor the person_attributes table and measure how long these take to complete. I suspect that the "stacked" nature of these tables impacts efficiency.
4. Examine which specific tasks in the data extraction process are taking up time. To do so, we discussed the following approach:
- Comment out PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
- Access only 1 or 2 properties in PatientToRecord. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
- Fully execute the PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
6. We need to create two separate modules for each distinct linkage process: One for the batch duplication use-case, and one for the real-time matching use case (NBS). We envision creating a common package of linkage utilities that can be re-used in both modules.