English - Home page

ISS
Istituto Superiore di Sanità
EpiCentro - L'epidemiologia per la sanità pubblica
Istituto Superiore di Sanità - EpiCentro


Use of Different Identifiers for Record Linkage of Health Archives

Carlotta Sacerdote.(1), Marco Dalmasso(2), Giovannino Ciccone(1), Moreno Demaria(3), Roberto Gnavi.(2)

(1) San Giovanni Battista of Turin Hospital Authority; Cancer Epidemiology and Oncologic Prevention Center, CPO Piedmont Region, Turin

(2) Epidemiology Service – Local Health Authority 5 – Piedmont Region, Turin

(3) Environmental Health, ARPA-Piedmont Region, Turin

 

For many years, health data from sources that differ in their timeliness, access, and quality has been linked (1). In the absence of a unique and universal personal identifier in Italy to link data from multiple sources, basic identifiers such as name, date of birth, and sex that are routinely collected as part of virtually all information systems are a useful substitute.

 

Such basic variables have been used to link information regarding the same person from different data sets (2,3,4) or from the same data set over time, e.g., serial hospital recoveries for the same person. The purpose of this paper is to provide simple indicators regarding the validity of various combinations of basic variables using three different data sets of epidemiologic interest.

 

The evaluation of basic variables that can be used for linkage was performed using the following data bases:

1. The Turin cohort of the EPIC project (5), which contained personal identifiers for each participant and information on lifestyles and dietary habits for 10,604 residents of Turin. The expectation was that each individual would have his or her own unique combination of identifiers.

2. Vital records of the Commune of Turin, which contains data regarding the 1,944,080 citizens registered in the communal registry as of December 31, 1998 (including persons who had died or who had immigrated after 1971). The expectation was that each individual would have his or her own unique combination of identifiers.

3. The hospital discharge records (schede di dimissione ospedaliera; SDO) of the Piedmont Region for 1997. This data set included 923,289 episodes of hospitalizations in the hospitals of the region. While in the two previous archives no duplication of subjects was expected, in the SDO, the same person may be hospitalized various times and thus they will appear more than once in the data set. To identify those with multiple admissions, we used the Fiscal Code, that in 668,744 of admissions (72.4%) of the total) were complete and correct. This portion of the analysis was conducted on 335,725 episodes of multiple admissions on 122.884 different individuals. The expectation in this data set was that the identifiers would appear more than once in the dataset but would be identical for a given individual.

 

Considering the impossibility of using the individual’s name for at least for the hospital discharge records, we constructed 4 different combinations of identifiers:

a) the first four letters of the family name (eliminating spaces, apostrophes, and all characters other than letters, one character for the individual’s sex, the two-letter code of the commune of birth, and the date of birth (month, day, and year)

b)same as a but using only three letters of the family name

c)same as a but using the first four letters from the family name extracted using the method used to create the Fiscal Codes (ed. note: the Fiscal Code uses consonants only)

d) same as c but using three rather than four letters from the family name

The EPIC cohort and the vital records from Turin permitted us to identify the number of episodes in which the four combinations of identifiers identified more than one individual; this value reflects the lack of specificity of the combination. The hospital discharge records, in which the assumption was made that those records with the same Fiscal Code were from the same individual, allowed us to calculate the lack of specificity, defined as the number of episodes in which the various combinations of identifiers failed to identify those that were linked on the basis of their Fiscal code.

 

The table presents the percentage of individuals correctly identified and the changes obtained by using the four combinations of identifiers for the EPIC and vital records data sets and for two groups of patients from the hospital discharge records: those who had multiple hospitalizations and those who had only two hospitalizations. As shown in the Table, with the EPIC data, only minor differences were observed when 4 letters were used rather than 3 for both the first letters and Fiscal Code systems of letter extraction; for all four combinations, correct identification exceeded 99.85%. Correct identification was lower for the vital records, ranging from 97.71% to 99.10%; here the differences were greater when 4 rather than 3 letters were used. For the hospital discharge records, adding more letters actually slightly decreased the percentage successfully linked; for multiple hospitalizations, the differences were -0.12% for the first letter system and –0.99% for the Fiscal Code system, with overall values ranging from 94.96% to 96.06%. For those with only two hospitalizations, the percent of accurate identifications increased (range 96.08%-96.85%), with a decrease in the percentage difference based on the number of characters used. Thus, as might have been expected, a greater number of hospitalizations increased the risk of an incorrect identification.

Perhaps more interesting, the analysis comparing the method used four-letter extraction (Table). Using the Fiscal Code system resulted in a higher percentage of correct matches for both the EPIC and Vital Records data. As seen above, for the hospital discharge records, the changes observed were in the opposite direction, with the Fiscal Code combinations performing more poorly than the first letters combinations.

 

As expected, the comparison between the combinations using 3 or 4 letters of the family name demonstrates that as the number of characters increases, the probability of assigning the same identifier to different persons decreases, while at the same time it increases the probability of mistakenly identifying distinct events happening to the same individual.

 

Regardless of the combination used, the probability of incorrectly identifying a subject was less than 2.5% with whatever combination was used. The percent error was reduced by an average of 1% when 4 characters were used instead of 3 and by 0.5% when the Fiscal Code method was used for letter extraction rather than the first letters of the family name. The probability of having different identifiers for the same individual in the hospital discharge data set was about 4% for all the combinations except that utilizing only the 3-letter fiscal Code algorithm (error 5%).

 

The two combinations that minimized errors with both the EPIC/Vital records and hospital discharge types of data sets were those which 1) used the first 4 characters of the family name and those that used 3 letters extracted according to the Fiscal Code algorithm. The choice of combination will depend on the type of linkage to be performed and the quality and completeness of the data available.

 

References

1. Rosso S. Archivi e liste di popolazione: accessibilità, completezza, aggiornamento. Atti del Convegno della Associazione Italiana di Epidemiologia, 1986

2. Lagorio S, Forastiere F, Michelozzo P, et al. Accertamento delle cause di morte in studi di follow-up: confronto di procedure utilizzabili in Italia . Epid Prev 1987; 31: 57 - 61.

3.Costa G, Demaria M, Bisanti L, et al. Uso di dati amministrativi per la ricerca epidemiologica. La consultazione dell’archivio dei codici fiscali per l’accertamento di esistenza in vita negli studi di coorte. Epid Prev 1988; 35: 40 - 46

4. Costa G, Demaria M. Un sistema longitudinale di sorveglianza della mortalità secondo le caratteristiche socio-economiche, come rilevati ai censimenti di popolazione: descrizione e documentazione del sistema. Epid Prev 1988; 36: 37 - 47.

5. Riboli E, Kaaks R. The EPIC Project: Rationale and study design. Int J Epidemiol 1997; 26: 1 (suppl. 1).