Science and A Life of Adventure: 2008

Thursday, October 23, 2008

Notes from 10/21/2008 Patient Matching Call

(A) We're experiencing out of memory errors in Tomcat with 100,000 patients in OpenMRS and 512 MB allocated to Tomcat. The first blocking run (blocking on postal code) appears to be completing; the out of memory error appears to occur during the analysis phase (either during random sampling or during Expectation Maximization) of the second blocking run (blocking on SSN).

Memory snapshots from the profiler reveal that a large amount of memory is allocated for the MySQL database connection. Large result sets are not being released and closed. Even though we are attempting to close the result set, it remains open, we believe because unidentified resources are still accessing the result set.

James has updated the code to address releasing the result set, and Win is testing the revision on his computer
If this code update doesn't resolve the issue, we'll re-examine the profiler output for other places where memory use could be further optimized.

Update 10/22/2008: The new code successfully ran with no out-of-memory errors observed with the new code. Therefore

(B) In some cases, the duplicate report listing may be large. So rather than displaying the report in a web browser (which is not designed to display large amounts of line-list data), it may be more practical to output the duplicate report directly to a file. Consequently, we're in the process of modifying the module to accommodate this. Win is implementing a more robust reporting work flow using AJAX. The user will be able to initiate a report, navigate away from the administrator interface, return to check status, and when the report is complete, a link to that report is displayed.

We’ve implemented a lock-out feature so that once a report is started, no reports can be initiated until the current report is complete.

(C) Currently we have been unable to run the “Yourkit” java profiler on Linux. We are using Windows to profile memory usage. If this issue further hinders progress, we will need to address the barrier we face to run Yourkit on Linux.

Tuesday, September 23, 2008

Notes from 9/23/2008 Patient Matching Call

We modified our tactics for matching because the matching process was taking longer to complete than we hoped.

We believe the performance issues relate to the myriad Hibernate queries that are required to create blocks of potential pairs. For example, if the patient table contains 4,500 unique SSN's, then 4,500 different Hibernate queries must be called to create potential pairs where SSN's match.

To minimize the number of Hibernate queries, the batch de-duplication process first examines the following tables: person, patient, patient_identifiers and person_attributes. A "flat", non-normalized table is then created with all fields from the above 4 tables.

All further analyses and scoring are performed against the flat, non-normalized table.

Recent timing test found that 10,000 patients could be extracted and stored in the the flat table in about 20 minutes, a rate of about 9 patients extracted from OpenMRS per second.

Once extracted from OpenMRS, analyzing and scoring the flat table took about 20 seconds. (SSN was the blocking variable and email/SSN were the include variables)

Next Steps Include:

1. Verify that the module can handle multiple blocking runs, and will join the multiple runs appropriately for the human readable report.

2. Verify that the current patient extraction process execution time increases linearly with the number of patients. We need to load 20-40,000 patients into OpenMRS and measure how long it takes to extract patients into a flat table. If time is not linear, we will need to consider other optimizations.

3. Create blocking runs that use neither patient_identifiers table nor the person_attributes table and measure how long these take to complete. I suspect that the "stacked" nature of these tables impacts efficiency.

4. Examine which specific tasks in the data extraction process are taking up time. To do so, we discussed the following approach:

Comment out PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
Access only 1 or 2 properties in PatientToRecord. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.
Fully execute the PatientToRecord method. Comment out SQL INSERT statement to load record into flat table. Measure how long it takes to iterate thru all Patients.

5. Because the de-duplication process will need to run when an OpenMRS system is not heavily loaded, it will likely need to be scheduled. We need to explore implementing a scheduling component.

6. We need to create two separate modules for each distinct linkage process: One for the batch duplication use-case, and one for the real-time matching use case (NBS). We envision creating a common package of linkage utilities that can be re-used in both modules.

Tuesday, September 2, 2008

Notes from 9/2/2008 Patient Matching Call

1) Implemented: Display additional matching variables in the web based report so the user can better evaluate the nature/quality of the match

2) Implemented: Create a flat file output containing (at least) all blocking and include variables, if not additional variables

3) Implemented: Because different blocking runs may have different include variables, we need to be able to include the union of all fields used across all blocking runs

4) Testing: We need to test the de-dupe module with different data sets. We need to test both for A) performance (how long does it take to match larger numbers of patients), and B) bugs -- what issues will we run into with different data?

Results to date: An out of memory error occurs with an OpenMRS patient table containing 10,000 patients (likely a Tomcat memory error)

Had increased Tomcat memory allocation to 1024M, no change
We may need to use a software profiler to see what processes are using memory
Question: are match results objects (MR's) stored in memory? MR's should not be stored in memory -- to minimize memory use, each MR should be handled and written to a file, etc.

Tuesday, August 5, 2008

Notes from 8/5/08 Patient Matching Call

1. The HQL-based data source reader (DSR) nearly complete. James needs to add patient_id (== person_id?) to the matching objects. About 1 day is needed to complete the DSR. James will move DSR to link VM for testing/implementation. He will test the DSR's ability to use multiple blocking columns and multiple data types as blocking columns

2. Nyoman completed data loading utility. Takes approximately 1 second per patient to load. He will post his code to the OpenMRS website and get others' feedback on speed, etc.

3. Nyoman has implemented introspection of person attributes. We discussed what to do if an OpenMRS implementer adds a new person_attribute after creating blocking runs in an operational system. Decision was that the existing blocking runs would need to be manually deleted by the user and new blocking runs created.

4. Nyoman to ensure the config.xml file contains the presets for random sampling (="true"); number of samples (100,000); uid_field=??.

5. James will review how the DSR interacts with, and functions appropriately with, the following processes: 1) random sampling, 2) EM analysis, 3) Scoring pairs. There may be some performance issues with Hibernate/caching that we'll need to address.

6. Transitive grouping function will be tested as part of the end-to-end workflow (Nyoman/James). The transitive grouping report requires a "Group ID". The transitive grouping report will initially generate a flat file.

7. (New code) Each blocking run typically has a different cut-off score, and that cut-off score has been manually determined in the past. A method is needed to automatically determine the cut-off score. The cut-off score will be calculated based on the total number of EM-estimated true matches.

8. The following work flow will need to be implemented to deliver the de-duplication functionality:

User clicks "Generate De-dupe Report" (new code)
XML matching configuration file is read, matching objects are created (code exists, needs testing, modification)
Potential pairs are formed using the Blocking fields configured by the user. The patient_id (person_id) will be used to avoid redundant pairs being formed (code exists, needs testing, modification)
The pairs are randomly sampled to calculate u-values (code exists, needs testing, modification)
The pairs are evaluated by EM to calculate m-values and estimate the number of true matches (code exists, needs testing, modification)
The match cut-off score is calculated (new code)
True matches (determined by score cut-off) from all 3 blocking runs are "squeezed" based on unique ID's (new code)
The "squeezed" pairs are processed by transitive grouping function. The transitive grouping function should include a "Group ID" to identify records that belong to the same duplicate group. (code exists, needs testing, modification)
The grouped pairs are output to a flat file. (new code)
The OpenMRS User reviews the flat file for duplicates

Tuesday, July 29, 2008

Notes from 7/29/08 Patient Matching call

OpenMRS De-duplication Process:

1. As a first pass, James will use an SQL query approach to create a data source reader (DSR) to access OpenMRS patient data from the Patient, Name, Address, and Attribute tables. To accomplish this we will need to map java object properties to SQL database field names. We will eventually evaluate using HQL after demonstrating functionality using SQL.

2. Nyoman will load "test5" synthetic data into OpenMRS. To do this he will need to create the following attributes: SSN, CC Num, CCV, CC expiration date. We may need to ask OPENMRS-DEV if any existing functions exist to do this (semi) automatically.

3. Nyoman will add functionality to introspect the attribute table, and will add attribute fields to the management web interface.

4. After the web GUI and DSR are implemented, James and Nyoman will ensure that the standard XML linkage configuration file is created. This will persist the deduplication configuration data. The XML config file will have defaults for certain parameters file. These defaults include data source (OpenMRS), number of random samples (100,000), string comparator (exact match)

5. James will test the new FormPairs implementation that avoids redundant pairs; he will determine whether the EM analysis and random sampling analysis are using the non-redundant FormPairs when deduplicating.

6. James will test the nascent transitive grouping functionality, which ultimately identifies the groups of potential duplicates.

User Workflow script for OpenMRS deduplication:

1. User logs in to OpenMRS

2. User selects admin page

3. User selects "Manage Configuration" from Patient Matching Module section

4. User creates 1-to-n "blocking runs" configuration from Patient, Address, Name, and Attribute table fields

5. User clicks generate report

GUI for Manual Review of Record Pairs:

1. As an initial "low-hanging fruit" strategy, the Manual Review GUI will access MatchResult (MR) objects in memory, rather than in a persistent database. Ultimately we want to persist MR's in a relational data model, but for initial testing and prototyping, we will use in-memory MatchResults.

2. Nyoman will design and prototype the Manual Review GUI features using the Power Point slide to prioritize features.

3. The Manual review GUI will be incorporated into the larger RecMatch GUI application.

Tuesday, July 22, 2008

Notes from 7/22/08 Patient Matching call

1. Ubuntu VM up and running

2. OpenMRS re-installed

3. Ticket 897 reviewed -- no major revisions

4. Minor Issues to address:

Lock the global config panel size (not adjustable in the vertical)
Ensure the config file history is updated in a timely fashion
Review De-duplicate check box work flow (Data source config complete check box is inactivated when un-checking de-duplicate check box)
Re-ordering blocking sessions is not saved

5. We briefly reviewed the evolving data model for the RecMatch process. Shaun will review further with Nyoman/James at a later time.

6. Nyoman will review requirements for the manual review GUI.

Still to be done:

7. Generate synthetic data (Shaun)

8. Load synthetic data into OpenMRS (James/Nyoman)

9. James to evaluate (speed/efficiency) using an HQL query approach to accessing patient data in OpenMRS

10. James/Nyoman to integrate UID/de-duplicate workflow; build uid rules into FormPairs

Notes from 7/15/08 Patient Matching call

1. Re-instantiate VM (Shaun/James)

2. Reload OpenMRS on the VM (Nyoman)

3. Generate synthetic data (Shaun)

4. Load synthetic data into OpenMRS (James/Nyoman)

5. James to evaluate (speed/efficiency) using an HQL query approach to accessing patient data in OpenMRS

6. James/Nyoman to integrate UID/de-duplicate workflow; build uid rules into FormPairs

7. Shaun to test ticket 897 changes

Tuesday, July 8, 2008

Notes from 7/8/08 Patient Matching Call

1. James will complete code and to test patient demographic queries using HQL

2. Nyoman to begin work on ticket #897

3. OpenMRS linkage VM set-up for development and testing. Nyoman to provide a public key for SSH access.

4. Nyoman to revise OpenMRS Web Admin screens

Create 3-c0lumn view for new configuration; field name, "blocking" check box, "include" check box
"Create Report" screen will directly invoke the de-duplication report
Scheduled report generation will be added later

5. James will implement a FormPairs object that implements rules specific to de-duplication

6. We discussed the concept of 'derived traits'. Any additional data that is added to the raw data source is a 'derived trait'. Examples of potential derived traits include Soundex of name, NYSIIS of name, and unique record ID ("uid"). Although we'll assume for the time being that the UID is present, we'll soon need to be able to add UID's to data sources that don't contain them natively, but this requires substantial change to the RecMatch work flow. Shaun will draft an overview of the work required to implement the a derived trait framework.

7. Shaun will generate synthetic patient data using DBGen.

Monday, June 16, 2008

OpenMRS Patient Matching Module

Hi! This is the initial post describing the collective efforts toward implementing an open-source patient identity management framework in OpenMRS.

Motivation
As you may know, health care information is increasingly distributed across many independent databases and systems, both within and among organizations as separate islands with different patient identifiers. This is the case for data collected about the same patient at different health care institutions, different pharmacy systems, different payers, and so on. This situation interferes with the aggregation of information about individuals across such databases as needed for many health care use-cases: public health reporting, clinical research, outcomes management, and administrative reporting. Aggregation is important not only to determine a patients’ health care status, but also for population based studies.

Why is a patient matching module needed for OpenMRS?
As mentioned above, health care data is scattered across disparate systems, and as OpenMRS implementations grow, they will begin facing multiple instances of the same patient across and within their implementation. One OpenMRS implementation is known to contain more than 700,000 patients(!) In a related fashion, it's also the case that duplicate patient registrations will accumulate over time in the same single system. Processes to link entities (e.g., patients) across and within OpenMRS implementations will become increasingly important.

What functionality will patient identity management add to OpenMRS?
The patient matching module will initially provide two core types of functionality. First, it provides a stand-alone application that implements sophisticated probabilistic matching algorithms for both a) identifying duplicates in a single data source, and b) identifying matches between two generic data sources. The current output from the stand-alone application is a delimited file containing matches with associated match scores: the higher the score, the more likely the match.

Screen-shot of Stand-alone Matching Application

The second core function addresses the issue of duplicate patient registrations in an instance of OpenMRS. Many OpenMRS implementations have 10's if not 100's of thousands of patient records. Over time, duplicate patient records will creep in. The patient matching module will identify and provide a list of likely duplicates to OpenMRS administrators. Because patient identifiers vary across countries and culture, we've designed the patient matching module to adapt to widely varying patient identifiers.

Screen-shot of the OpenMRS de-duplication Admin Screen

When will this functionality be available to the OpenMRS community?
The source code for the patient matching module is currently available in the OpenMRS subversion repository by clicking here (http://svn.openmrs.org/openmrs-modules/patientmatching). The stand-alone application has expanding functionality. The OpenMRS de-duplication module is currently under active development, with great support provided through the Google Summer of Code 2008 initiative! We anticipate delivering the de-duplication functionality to the community by Summer's end.

Join In the Fun!
We welcome those with an interest in this area (you know who you are)! To become further acquainted and involved, we encourage you to:

read the OpenMRS developers "Where to Get Started" page
check out the blog of our excellent GSoC intern, Nyoman Ribea
peruse the source code
review outstanding developer tickets
find us on the OpenMRS IRC Channel (sgrannis, james_regen, nribeka)
email shaun: s g r a n n i s { a t } r e g e n s t r i e f { d o t } o r g
check back here from time-to-time

Science and A Life of Adventure