Open Source EDC

I have received a number of questions about the existence of an open source EDC system that can be use din the context of academic research. The best known one is OpenClinica and the people I know who have used it found it to work well. You can get more information about that tool from here: http://www.openclinica.org/

De-identifying protocol amendment test data for e-clinical trials

In the last posting I spoke about masking as a technology to help protect the privacy of patients when using real clinical trial data for testing protocol amendments. Masking is not enough, however. Let me illustrate through an example.

Let's assume that we are involved in a medical device trial. The data set contains the site information, as well as patient demographics (such as date of birth, gender), physical characteristics (eg, weight, height), a list of medications being taken by the patient, and the results of sensitive medical tests. There are no names nor patient addresses in the database. So ostensibly this is considered an anonymous database.

However, knowing the site gives us information about the community where the patient lives and a set of most likely Forward Sortation Areas (the first three characters of the postal code).

If the patient is quite old (older than 89 years, for example) then it would be quite easy to know who the patient is because people at that age are quite rare in any community. If the patient was too heavy or too tall/short for their age they would also be easier to re-identify because they would stand out within the community. If a neighbour/spouse/ex-spouse/employer knew that the patient participated in the trial they could re-identify the patient's record because individuals tend to be relatively unique on birthdate, gender, and geography. Uniqueness makes it easier to re-identify individuals using these background/demographic variables.

There are a number of de-identification techniques. These will often remove the highest risk records and perturb the data to make it less likely that an individual can be re-identified. For example, by adding a bit of noise to the date of birth or changing the day to the first of every month (effectively making it to a month and year).

Proper de-identification techniques will provide good protection to the clinical trials data making it suitable for testing purposes. We have published an article on de-identification recently, which you can access from the JAMIA site. This describes some improvements to k-anonymity, a popular de-identification framework, but also gives you an extensive literature review to follow-up on.

It should be noted that we are talking about generating data for testing here. De-identification often results in some records having to be suppressed. In clinical trials the cost of collecting data for each patient is so high, that it would be quite painful to suppress records.

Protecting test data privacy when testing protocol amendments

When testing an electronic clinical trial system after a protocol amendment, you will probably need to mask real data so that you can use it for testing. Masking is the first step in protecting the privacy of the patient data. In this entry I will explain what masking means.

If the original clinical trial data has names and addresses, for example, you cannot send this data across to the testers to test with. But these variables cannot be removed either because the testing cannot be done properly otherwise. To take a simple example, if there is a function to allow browsing of patients by their names sorted in alphabetical order, then the best way to test this function is to have real patient names.

The masking approach is to replace the real names with random names. The Census Bureau has a list of common North American names that can be used to replace the real names with the census names. You can do it in a gender correct way as well. There is a slight chance that the replacement random name is the same as the original name, but this will likely be very small and it is not possible for the tester to know where this has happened. Also, if you randomize the first and last names independently then the chances a replacement full name matching the original is really remote.

You can also mask addresses. For example, a postal code can be replaced with another randomly selected postal code from the same Forward Sortation Area, city, or province. But if postal codes are masked, then all other geographic information also needs to be consistently masked. For example, if there is a phone number or a street address it has to be equally distorted so that it is in the same postal code as the random replaced postal code.

Masking can also deal with health insurance card numbers. I have seen clinical studies that collect this information to facilitate linking to administrative databases at a later point in time. But this kind of information is additionally sensitive because it can facilitate medical identity theft.

Such masking must be done every time production data is used in testing.

There are some freely available masking tools for Canadian data sets that have been developed by our research group. If you are interested let me know. I can also point you to commercial tools.

The other type of protection that is needed is de-identification. This deals with residual privacy risk from the remaining data. That will be my next posting.

Getting good data for testing protocol amendments in your e-clinical trial

After making changes to an e-clinical trial, say due to a protocol amendment, it is necessary to re-validate the system. This necessitates running tests on the data collection forms, any logic embedded within them,  real-time or batch rules for the validation of the data, alerts and notifications, randomization setup, and reports.

But you would not want to do this testing on the real "production" data from the clinical trial. It will be necessary to stage the new version of the system somewhere and run tests on that. Once the tests pass then propagate the changes to the real "production" system.

Note here that I am not talking about testing any generic EDC software functionality, but the functionality pertinent to the clinical trial itself. If the EDC software itself changes, then that introduces its own set of issues that need to be addressed, as I discuss here.

Of course, some EDC systems do not differentiate between e-clinical trial changes and EDC system changes because the EDC is custom developed for a particular trial. In such a case any changes to the e-clinical trial, say new form validation logic, entails making changes to the EDC software. In those situations then the sponsor needs to make sure that both sets of issues are dealt with appropriately.

To do proper testing on a staged version of the e-clinical trial requires some data. For example, if you will enter valid and invalid data to test form logic or test the calculations in a report, you need to either enter new data or have data already in your test database. Where do you get this data from ?

There are three options. First, you can just copy data from the real clinical trial to the staging area and use that for testing. This is a very dangerous strategy and has serious privacy implications. The first question is whether the testers are authorized to access the personal health information of the patients on the trial ? Testers are not necessarily screened as diligently as other staff, and in some cases this work is outsourced to faraway lands where labor is cheap. The second issue is whether the information security infrastructure at the testing site is sufficient to be handling real patient data ? In most organizations the answers to both of these questions is no. This increases the risk of inadvertent privacy breaches. In many jurisdictions now it is a legal requirement to notify patients if their personal health information is leaked. If this happens it will not be very helpful for recruitment and retention of subjects - patients will not be very pleased if their sensitive health information is lost. So you do not want to be giving real patient data to your testers.

Also, there will likely be many protocol amendments, so this will have to be done repeatedly as the e-clinical trial system evolves.

The second option is to create artificial data to use for testing. This can work, but it is time consuming to create artificial data that is realistic and that reflects the actual distributions of your real clinical trial data. If the data is not realistic then there is also the risk that the testing will not uncover important flaws in the new version of the e-clinical trial.

The third  option is to anonymize real production data and use the anonymized data for testing. The advantages here are that this process can be fully automated (so it is easier to do repeatedly) and the data used in testing is as realistic as it gets.

There are two ways to anonymize production data: masking and de-identification. Ideally you would want to do both.

In subsequent posts I will discuss each of these two approaches to anonymizing real clinical trial data so that you can do effective and secure testing of protocol amendments.

Archiving data from e-clinical trials

Most national regulations require that the data from a clinical trial be archived and available for a number of years after the trial's completion. That number of years varies by country. This means that the clinical trial needs to have an archival strategy.

When using an EDC system, it is also important to archive the meta-data used during the trial.  Meta-data includes information about the eCRFs, the questions, and the validation logic. The meta-data plays a critical role in understanding how data was collected and handled during the study.

An obvious choice of format for archiving is the CDISC Operational Data Model (ODM). This is a de facto standard for representing clinical trial data in XML. The advantage of using this is that one would expect that in five or ten years' time there will be a company out there with a viewer for ODM that will allow a regulator or auditer to look at the ODM trial data in a usable way. This is the advantage of a standard - if the standard is adopted, which ODM is, a market emerges to support it.

Of course, an ODM viewer is only the starting point. What is really needed is an ODM Navigator that allows the user to query the data to extract subsets, to understand trends, and to look at data on a site by site basis.

However, ODM has important weaknesses as well that you need to be aware of. Of course, these may be addressed in future iterations of the standard.

The first is that it is not good at representing logic. Therefore any complex validation logic, calculations, and notifications will be difficult to capture in ODM format. That information can be critical for understanding what happened in a clinical trial many years from now.

Status information is also not so straight forward to represent. For example, whether a patient is withdrawn, excluded, lost to follow-up, and so on.

A second, and perhaps more critical issue, is that to be able to truly recreate a trial at any point it is necessary to maintain site specific snapshots of its meta-data and its data. Conceptually, this means that each site would need to have an ODM export before any change was made to the data.

As I mentioned in another posting, not all sites will be ready to adopt a protocol amendment at the same time: each site may be using a different version of the eCRFs at any one point.

Let's imagine we have two sites, A and B. Because they can get ethics approvals at different times, site A is able to deploy a protocol amendment before site B.  We will denote the eCRFs before the amendment as v1 and those after as v2. To archive this data we will have four snapshots: site A with v1 data, site A with v2 data, site B with v1 data, and site B with v2 data. The archive must distinguish among these four snapshots.

When examining an archiving solution, make sure that it can address this issue. Such a capability will save a lot of agony later on.

Although I did not discuss it here, audit trails also need to be part of the archive. They will also have the same versioning issue discussed above.

What's in an audit trail ?

Maintaining an audit trail is a 21 CFR Part 11 compliance requirement. But what makes a good audit trail that is effective and meets the regulation's intentions ?

I have seen audit trails that capture every single transaction that runs on a database. This is important to do because in some cases people will not come in through the front door, so to speak. Therefore, a detailed audit trail is needed to forensically analyze any intrusions.

In theory if you do that then you have met the letter of the regulations. But in practice this is not enough. And some auditors will not be satisfied with an audit trail that only a database expert who understands the exact data model behind the EDC system can interpret.

Audit trails must be viewable/accessible to end-users. For example, a site coordinator should be able to see all changes made to an eCRF, by who, and when, without having to go through SQL. So a subset of the audit trail must be consumable by end-users. This subset includes:

All modification to data and meta-data (eg, someone changes an eCRF design) All system logins and attempted logins All randomizations

An audit trail must include a time stamp, as well as the account name and IP address of the user.

The above information should be viewable by an end-user. Of course there needs to be access control on the audit trails so that a user cannot view information about another user or site that they are not allowed to see.

The importance of having audit trails viewable by end-users is evident when you consider that users can check changes and see who made them. This can help catch errors or even malicious attempts to manipulate data quite quickly.

Readily accessible audit trails are very useful for investigating unexpected changes to eCRFs and data, and to determine whether a potential security or privacy breach has resulted in inappropriate disclosure of personal information.

There are issues with storing such a large volume of data, but there are also good architectural solutions to make this work. Therefore, storage should not be a reason for having good audit trails.

What are electronic signatures ?

When using an electronic data capture (EDC) system, it is important to ensure that all data entry modification, and deletion, as well as any access and sign-offs are done by authorized people. The most common approach is to authenticate users with a username and password. If the person who is logged in is authorized to manipulate the clinical trial data then there is no problem. The electronic signature is effectively the audit trail showing that it was this login account that made that change.

Recently, I have seen auditors starting to get uncomfortable with this simple approach to authentication: username/password. The concern is that there are increasing ways for accounts to be compromised using this type of authentication, that it is more plausible that someone will fake the electronic signature.

There are a number of solutions that have been proposed. The first is to use biometrics. A good example is a fingerprint reader that is installed on the computer of each end-user. This biometric can be used to login the EDC system. The major disadvantage here is cost. There are two elements to cost.

The first is the hardware. If a site has, say, three possible computers that can be used to access the EDC the three readers are needed. Now multiply this by the number of sites. Now assume that a certain percentage of these will fail or be damaged every year. For a large long-lasting trial these costs can add up.

The second element is IT support costs. Generally speaking, end-users are not able to install new software or hardware on their work computers. So the IT department has to do that for them. Most IT departments are stretched, so it may take them some time to install things. The only way to create an incentive for them to pay attention to your trial's IT needs is to pay them. Also, over the duration of the trial end-users will have questions or problems about the biometric system (eg, it is not working, too slow, a user cannot login, etc.). Therefore it is necessary to allocate support staff for the duration of the trial to remotely troubleshoot user problems with the biometric system.

An alternative approach is to use one time passwords. These can be very secure. This means that every end-user is issued with a small electronic device that generates a temporary passwords as needed. The end-user has to carry this with him/her all the time. From a hardware cost perspective, each person needs a device. There will be no site IT support costs here but the overall trial support cost can still be significant. This is because users lose those things (they are easier to lose than devices attached to a computer).

The second difficulty with one time passwords is that if a study coordinator is involved in multiple trials using different EDC systems and each requires them to carry one of these devices, it can become unwieldy.

I any case, if you have the budget the above are good options.

Two low budget solutions augment usernames and passwords: secret questions and out-of-band confirmation.

Many people have seen secret questions used, say by their on-line bank. It is the same idea here that when a user tries to login they are asked a secret question and if they the answer right then they are logged in. Typically the user will provide answers to multiple secret questions and the system will select one of these at random for the user to answer at each login. This approach also makes it more difficult, in theory, for someone to phish an EDC site because the user is expecting to be presented with one of their secret questions. If a phishing site presents them with a question that they never provided an answer to before the user may get suspicious.

There are two things to keep in mind with secret questions. First, relative or friends of an end-user will often know the answers to the most commonly used secret questions (pet's name, school name, favorite movie, etc.). So this approach is not safe from that kind of intruder.

The second issue is that users are often easily tricked to by-pass such controls. Experiments have shown that when presented with a "our secret question module is being upgraded / under maintenance" users will accept that and perceive nothing of it. So as a mechanism to alert users to potential phishing sites it may not be very effective.

Nevertheless, this additional level of authentication is an improvement over plain usernames and passwords, and entails only a small amount of additional effort on the user's side to login. But it is low cost (no hardware) and will provide additional assurance that the person manipulating the trial data is the owner of the account.

Another option is to let the user login as usual, but before you let them access the data you ask for a six digit PIN. That PIN is generated automatically and sent to the user by SMS or email. It would only be valid for say five minutes. So the user has to read the PIN from their SMS or email and enter it to login. This type of out-of-band communication makes it more likely that the person logging in is the account holder because it is difficult to intercept someone messages and it is unlikely that someone will deliberately give someone else their email or SMS account information.

Each of these latter two solutions is not bullet-proof. But when combined they provide an effective authentication mechanism to establish reliable electronic signature for the purpose of part 11.

Open source and 21 CFR Part 11

The first evidence that an auditor usually asks for is the set of training records of the individuals responsible for developing, maintaining, and operating an electronic data capture (EDC) system. These records typically include an updated CV, a job description, and a list of all training received by the individual since getting responsibility for some aspect of the EDC system. If available, a helpful document would map the qualifications with the job requirements.

Open source software is based on the premise of openness. This means that anyone can report bugs, contribute bug fixes, and implement whole new pieces of functionality to the open source product. In fact, many new programmers see participation in open source projects as an opportunity to learn programming skills from their peers and gain experience.

The people making these contributions can come from anywhere in the world and they do change/churn over time. Each project will have a set of gatekeepers. These individuals will be more experienced, but they will also change over time.

The challenge of using an open source EDC system is how can you  provide training records for a transient development team that is not part of a formal structure. This can work if the gatekeepers work for a single organization and they scrutinize every piece of code that is submitted, then one can argue that the gatekeepers are effectively the development team.

However, when an open source EDC system is truly a community effort and not tied to a specific organization employing the bulk of gatekeepers, this becomes a tricky case to make.

The issue does not go away if you just "operate" an open source EDC system since the question remains about the qualifications of the individuals who developed and are maintaining the code for the system. There is an exception, which I will discuss below.

The second question an auditor will ask is about the standard operating procedures (SOPs), including those that cover the development, maintenance, and operation of the EDC system. Evidence that these SOPs are followed, that the individuals working on the EDC system have been properly trained on them, and that they have been re-trained after every change made to the SOPs. With a distributed community of developers who are volunteering their time, it may be a challenge to demonstrate that they always follow the SOPs, and they may not be motivated to be educated (repeatedly) on SOPs. Furthermore, there is always the question of who is developing and maintaining the SOPs to make sure that they accurate and up-to-date. There needs to be an SOP owner and maintainer. It is well known that open source developers are not big fans of producing detailed documentation. And maintaining a large set of SOPs and keeping them consistent is no trivial task.

Many auditors still like to see hand written signatures authorizing SOP changes (I know, I know). This can be challenging with an open source project where most communication is electronic.

The exception is that when a system is deemed widely used (such as Excel or Linux) that you can take it for granted that it has been validated through extensive usage. This is reasonable because it is not realistic to expect an EDC system developer or operator to validate every piece of software used, including the operating system. For third party components (including open source ones), if the case can be made that it is widely used in industry, then most auditors will put it out scope for validation and audit.

Having multiple electronic clinical trial versions

With an Electronic Data Capture (EDC) system running as a service (ie, SaaS model), one of the big advantages is that whenever there are new features they can be deployed immediately to all sites. It also means that any protocol amendments can go through to all sites instantaneously. This a long way from the old days where volumes of paper had to be copied and sent to all sites.

This instant deployment of amendments, however, can cause a problem because not all sites are ready to implement the amendments at the same time. For example, a protocol amendment may require a research ethics board approval if it is substantial and not all sites will be able to get their approvals at the same time or be ready when the changes to the EDC system are pushed out. Some amendments may require additional resourcing or changes in their staffing and not all sites will be ready at the same time.

Practically, to deal with this the EDC system must allow multiple versions of an electronic clinical trial to be live at the same time. It may be more than two versions if some sites are really slow coming on-board.

This also means that a version number needs to be associated with everything - every data record, eCRF, report, rule, logic, and so on. For example, when data is exported then the end-user needs to be able to specify the version number(s) of the clinical trial that should be exported.

Maintaining versions is important because there may be subtle changes in the data collection instruments that have an impact on the collected data. For example, a measurement unit may change during a protocol amendment. If it is not clear which version of the trial a particular data element pertains to subsequent analysis may be questionable.

Even though this is not a data tampering issue, it certainly is a data quality issue. Auditors and sponsors need to ask versioning questions about their EDC system.

Phishing electronic clinical trials

Many operators of EDC platforms will send emails regularly to their end-users. These emails may notify users of maintenance windows, new features or releases of the software, or upcoming training sessions and webinars.

This also opens up all clinical tials hosted on that platform to some relatively simple phishing attacks.

Imagine an intruder who constructs a login page that looks exactly like the login page of your favourite EDC product. Except that this web page is hosted by the intruder, and as soon as someone enters their username and password, it gets sent to the intruder. Once a user attempts to login they get an error message saying that their username and password is incorrect and then redirects them to the real EDC platform login page. On the second attempt the user is able to login successfully.

The end-user thinks nothing of this. But now the intruder has their username and password.

How does the end-user actually get to that fake login screen ?

The intruder will send them an email that looks like it is coming from the EDC platform operator. This email will require urgent action. For example, a serious edverse event has been submitted that requires immediate review. Then it will have a link to the login page. Except that the link will be to the fake page. If the end-user clicks on that link then the sequence above starts.

This is a classic phishing attack.

An intruder can then login as that phished end-user and modify data. The audit trail will show that all changes have been made by that end-user.

This raises an interesting scenario. If there is ever data tampering under an end-user's account, that end-user can make a very plausible defense that they were the victim of a phisihing attack and it was not really them who made the change.

The end-user can be a study coordinator, site investigator, a nurse involved in the trial, a data entry clerk, a monitor, or a clinical trial manager. These individuals implicitly trust the EDC platform operator - they have entrusted their most sensitive patients' data with the EDC platform operator. So if they get an email that looks like it is coming from this trusted source they will not suspect it at all. This is especially true if the EDC platform operator regularly sends emails to end-users.

To mitigate such risks all communication with end-users should be within the EDC platform itself rather than through regular email. For example, a messaging system within the EDC platform where the operator can send important notices. If that is the approach taken then the end-user can be trained never to expect emails from the EDC platform operator, and any such emails would be treated as suspicious. Most banks follow this practice.

An alternative is to send emails but never embed clickable links in them. Then the end-users are trained not to expect links. This is, however, a subtle point and many end-users will forget the details over time. Never sending regular emails is the safest option.

One important capability to make it possible to forensically detect a phishing attack is to keep the IP address of the end-user in the audit trail. So for example, if someone logs in from Russia and modifies some data and there are no sites in Russia and no end-user was travelling in that part of the world at that point in time, then that should raise some red flags. It also makes it more difficult for an end-user to make the "I was phished" claim to deny data tampering (for example, if the IP address that made the changes was from their own hospital).

As a CRO, sponsor, or auditor you should find out what anti-phishing strategies are being used by your EDC platform operator.