Blog

Modeling Group Life Mortality Experience: A How-To Guide - Part 4: Validation of Industry and Area Data

By Jonathan Polon on 4/12/2018

Part 4: Validation of Industry and Area Data

Overview

The focus of this, the fourth part of our series on modeling Group Life Mortality Experience, is to continue the data validation of the Basic Life Death and Disability dataset that we started in the previous post.

Specifically, we will focus on the industry (SIC code) and area (zip code) data. The data validation exercise will uncover some concerns about the integrity of the data in these data fields.

 

Summary of Previous Posts

Before we move forward, here is a brief reminder of what we covered in the most recent post in this series.

Part 3: Data Validation

Outlined a general approach to data validation

Provided examples of validating data fields of different data classes, including:

o Factor

o Integer

o Numeric

• Indicated that, for this specific dataset, the data validation approach would reveal concerns for both Industry and Area data fields... and that these would be reviewed in the following post

General Approach to Data Validation – Quick Reminder

The general approach to data validation was outlined in the previous post.

As a reminder, our objective at this stage of the project is to validate the data quality. We’re not yet ready to begin analyzing mortality experience. We simply want to review each data field to ensure that (i) we understand the data within and (ii) the distribution of the data is generally consistent with our expectations. 

Data Exploration in R

Similar to previous posts, we will not require any scripting for this data exploration in R. We will continue to work interactively on the command line. For convention, I will show the commands that I enter in grey script and I will show the output returned in blue script. The number sign (#) will be used to denote comments.

Industry Data Fields

The previous post included a list of the data fields in our dataset. Three of these data fields relate to Industry:

• four_digit_sic_code (data class = “Factor”)

• three_digit_sic_code (data class = “Factor”)

• two_digit_sic_code (data class = “Integer”)

Notice that the data class differs across the three Industry-related data fields. For the 2-digit SIC codes the data class, as expected, is Integer. But for both the 3-digit and 4-digit SIC codes the data class is Factor. It is possible that the 3-digit and 4-digit SIC code data fields have integer-type data that has, for some reason, been stored in R as Factor data class. But it is also possible that there are some records for which these fields do not contain integers – or even numbers. Let’s begin our validation of Industry data with an attempt to convert the 4-digit SIC code data to a new data field with data class of Numeric.

The first two functions we will use in R will be:

• as.character: Will coerce the inputted data to a field where data class = “Character”.

• as.numeric: Will coerce the inputted data to a field where data class = “Numeric”.

# Create a new data field where four_digit_sic_code is converted to a “Numeric” data class

# This is achieved by first converting the data field from “Factor” to “Character” and then

# Converting the data from “Character” to “Numeric”

 

SOA_Basic_Data$sic4 <- as.numeric(as.character(SOA_Basic_Data$four_digit_sic_code))

Warning message:

NAs introduced by coercion

# Some of the data entries could not be converted to numeric

# This likely indicates that some of the entries include non-numeric characters

We now know that the four_digit_sic_code data field includes values that could not be converted to Numeric. However, we do not know the contents of these data entries nor do we know the number of exposures and deaths related to these records. In order to get this information, we can use the is.na,  summary and sum functions.

• is.na: The generic function is.na indicates which elements have a missing value indicator.

• summary: A generic function used to produce result summaries of the results of various model fitting functions.

• sum: Returns the sum of all the values present in its arguments.

# Find the non-numeric entries in the four_digit_sic_code data field

# Apply “is.na” function to the “sic4” data field.

# Then summarize on the 4-digit SIC code.

# The “maxsum” parameter indicates to display 4 values – the 3 most common and “Other”

 

summary(SOA_Basic_Data[is.na(SOA_Basic_Data $sic4),]$four_digit_sic_code, maxsum = 4)

00UN            0000      0013        (Other)

871                  0            0                0

# All of the non-numeric entries in the four_digit_sic_code field have value of “00UN”

# And there are 871 such entries

 

 

# Find the number of death_policies_exposed and death_count with sic4 is “NA”

sum(SOA_Basic_Data[is.na(SOA_Basic_Data$sic4),]$death_policies_exposed)

845.5364

sum(SOA_Basic_Data[is.na(SOA_Basic_Data$sic4),]$death_count)

0

I was a little concerned about the presence of non-numeric data in the Industry data fields. However, we have now determined that these records make up an immaterial amount of our experience – only 845 exposures (out of 25 million total exposures) and 0 deaths.

Of course, looking at the non-numeric values is just the first step in validation the integrity of the Industry data. As a next step, let’s apply the summary function to the sic4 data field that we created (which contains 4-digit SIC codes as a numeric data class).

# Summarize the “sic4” data field from the SOA_Basic_Data data frame

 

summary(SOA_Basic_Data$sic4)

Min.        1st Qu.         Median        Mean       3rd Qu.       Max.     NA's

0               4911               6722            6220          8211         9999       871


The quartiles and averages are not meaningful because the SIC codes are just that – codes, not values. But there are two interesting observations:

• The minimum value is 0, meaning an SIC code of “0000”.

o   There is no industry with SIC code of “0000” – this is most likely used for unknowns.

• The number of NA records is 871 – consistent with our earlier findings.

Let’s also summarize the 2-digit SIC code data field to find out if the records with 4-digit code of “00UN” have value of “0” or “NA” for the 2-digit SIC code.

# Summarize the “two_digit_sic_code” data field from the SOA_Basic_Data data frame

 

summary(SOA_Basic_Data$two_digit_sic_code)

Min.         1st Qu.          Median         Mean          3rd Qu.        Max.

0.00           49.00              67.00           61.77            82.00          99.00

All entries in the two_digit_sic_code data field are, in fact, integers. There are no “NA” values.

Convert SIC Codes to Industry Class

The 2-digit SIC code field will contain, at most, 100 unique values. This is clear because the data class is “Integer” and the values range from 0 to 99.

We would like to look at the distribution of exposures by Industry. We could look at the distribution for the 100 or so unique values of 2-digit SIC codes. But it is probably more meaningful to consolidate these values to a smaller number of industry groupings.

There are a couple of different approaches that we could apply to map SIC codes to industry groupings. One option is to import a translation table – but we will save this approach for zip code analysis. The other option, which we will apply here, is to use “if” statements. In R, we do this using the ifelse function.

• ifelse: Returns a value with the same shape as the object being tested which is filled with elements selected from either yes or no depending on whether the element        of the test is TRUE or FALSE.

• as.factor: Will coerce the inputted data to a field where data class = “Factor”.

Before we start working in R, here is the translation we will use:

2-Digit SIC Code

Industry Group

00

Unknown

01-09

Agriculture, Forestry and Fishing

10-17

Mining and Construction

18-19

Not in Use

20-39

Manufacturing

40-49

Transportation, Communication and Utilities

50-59

Trade

60-67

Finance, Insurance and Real Estate

68-69

Not in Use

70-89, ex 80 and 82

Services

80

Health Services

82

Educational Services

90-99

Public Administration

The use of “If” statements will be a little bit cumbersome, given that we are mapping to 12 unique Industry groupings. But part of the objective of this blog series is to provide some readers with exposure to the R environment – and so we will take this opportunity to gain experience with “If” statements.

# Map 2-digit SIC codes to a new data field for Industry Groupings

# Call this new data field “IndGrp”

 

SOA_Basic_Data$IndGrp <- as.factor( ifelse(SOA_Basic_Data$two_digit_sic_code == 0, "Unk",

ifelse(SOA_Basic_Data$two_digit_sic_code < 10, "Agri",

  ifelse(SOA_Basic_Data$two_digit_sic_code < 18, "Mining",

    ifelse(SOA_Basic_Data$two_digit_sic_code < 20, "NotUsed",

      ifelse(SOA_Basic_Data$two_digit_sic_code < 40, "Manu",

        ifelse(SOA_Basic_Data$two_digit_sic_code < 50, "Trans",

          ifelse(SOA_Basic_Data$two_digit_sic_code < 60, "Trade",

            ifelse(SOA_Basic_Data$two_digit_sic_code < 68, "Fin",

              ifelse(SOA_Basic_Data$two_digit_sic_code < 70, "NotUsed",

                ifelse(SOA_Basic_Data$two_digit_sic_code < 80, "Srvcs",

                  ifelse(SOA_Basic_Data$two_digit_sic_code == 80, "Health",

                   ifelse(SOA_Basic_Data$two_digit_sic_code == 81, "Srvcs",

                    ifelse(SOA_Basic_Data$two_digit_sic_code == 82, "Edu",

                     ifelse(SOA_Basic_Data$two_digit_sic_code < 90, "Srvcs",

                       "PubAd")))))))))))))))

Next, we will view the distribution of exposures by industry as a table. A bar chart would have been a nice way to summarize the data visually; however, it would be difficult to squeeze in the data labels for all 12 categories.

# Display the proportionate exposures by IndGrp in a table

 

# First, get the exposure by industry

exp_by_ind <- aggregate(death_policies_exposed ~ IndGrp, data = SOA_Basic_Data, sum)

 

# Next, calculate the exposures as a proportion

exp_by_ind$prop <- round(exp_by_ind$death_policies_exposed/

sum(exp_by_ind$death_policies_exposed) * 100,1

 

# Finally, view the distribution of exposures by industry

exp_by_ind

 

IndGrp                   death_policies_exposed                 prop

Agri                              2.738011e+05                                0.6

Edu                              2.690616e+06                                6.1

Fin                               3.609390e+06                                8.1

Health                        4.238318e+06                               9.5

Manu                          6.619285e+06                              14.9

Mining                       1.952551e+06                               4.4

NotUsed                    7.093636e+02                              0.0

PubAd                        2.229721e+06                              5.0

Srvcs                         1.154338e+07                               26.0

Trade                        6.130366e+06                              13.8

Trans                       1.874369e+06                                4.2

Unk                          3.300237e+06                               7.4

 

I didn’t have any strong a priori assumptions about the distribution of exposures by industry. On the whole, it seems quite plausible to me. My only concern is that 7.4% of exposures relate to lives where the industry is Unknown. That is higher than I would have liked. More on this later…

Area Data Fields

The previous post included a list of the data fields in our dataset. Three of these data fields relate to Area:

• five_digit_zip_code (data class = “Factor”)

• four_digit_zip_code (data class = “Factor”)

• three_digit_zip_code (data class = “Factor”)

Two quick observations about the Area-related data fields. First, it is only zip code data – no state or region indicator. Second, all three data fields are of data class “Factor” – likely indicating that there are some non-numeric data entries.

For data validation, we will first want to learn more about the non-numeric entries in the zip code data fields. Then, we will want to consolidate the zip code data to something higher-level to facilitate a look at the distribution of exposures. A carrier could use its own area groupings. In this, a generic example, consolidate zip code to state makes good sense. However, to simplify the visuals, I will instead consolidate from zip code to region (as defined in Appendix II of the SOA’s Group Life Mortality Report).

We will begin working with the Area data fields in a similar manner to how we worked with the Industry data fields. We will convert the 3-digit zip codes to numeric and then identify those values that could not be converted.

# Create a new data field where three_digit_zip_code is converted to a “Numeric” data class

# This is achieved by first converting the data field from “Factor” to “Character” and then

# Converting the data from “Character” to “Numeric”

 

SOA_Basic_Data$zip3 <- as.numeric(as.character(SOA_Basic_Data$three_digit_zip_code))

Warning message:

NAs introduced by coercion

# Some of the data entries could not be converted to numeric

# This likely indicates that some of the entries include non-numeric characters

 

 

# Summarize the “zip3” data field from the SOA_Basic_Data data frame

 

summary(SOA_Basic_Data$zip3)

Min.          1st Qu.           Median           Mean        3rd Qu.           Max.       NA's

0.0               235.0              480.0             491.4           752.0               999        5588

 

# Find number of exposures for entries where 3-digit zip is non-numeric

sum(SOA_Basic_Data[is.na(SOA_Basic_Data$zip3),]$death_policies_exposed)

17607.67

 

# Find the non-numeric values in the 3-digit zip code data field

 

summary(SOA_Basic_Data[is.na(SOA_Basic_Data$zip3),]$three_digit_zip_code, maxsum = 40)

L4B     R3B      M1S     L4M     M9C     V6X     T2P     M2N     T4B     M2K     K2K     V6P    H3B

979     528      404        351      330      324      320      228      209       201      157     138     119

V6B    R3Y      M1H     R3C      K6H     L9T      V7V     L8L      M4S      T6E      T2H    S4P     H4P

113    108      101        99         98          96        94       79         59        55         54        48       41

V8V    J2G      M1K      R2J       H3Z     G0R     N1G    N2H    H4S      T2C      L7N     M5J    000 (Other)

40      34         30          30        25         24        19         16        12        12         9         4          0        0


From the above analysis, we learn the following about the non-numeric data in the 3-digit zip code field:

• These cases have exposures of about 17,600. This is quite small relative to our total exposures of 25 million.

• All of these cases appear to relate to exposures in Canada. Canadian postal codes are of the form “A1A 1A1”. I.e., 6 characters, alternating between letters and digits. Thus, the first three digits of a Canadian postal code has the form “A1A”.

Our next step is to translate the zip code data to a higher-level area grouping, such as state or region. As previously mentioned, for this exercise, I will translate to the regions as defined in the SOA report.

My first step is to create a translation table from 5-digit zip codes to region. I will create this as a csv file using Excel. The table will contain nearly 100,000 entries so I will not publish it within this blog post. Feel free to contact me if you would like a copy of the file.

We will read the translation table into R. Then we will use the merge function to append the appropriate region category to each record in the SOA_Basic_Data data frame.

• merge: Merge two data frames by common columns or row names, or do other versions of database join operations.

# First, read in the zip code to region translation table

zip_region <- read.csv("C:\\SOA 2016 GL Exp\\zip_region.csv")

 

 

# Next, take a quick look at column names and data classes

sapply(zip_region,class)

five_digit_zip_code       State          Region

"factor"                       "factor"       "factor"

 

 

# Append the “region” to the SOA_Basic_Data data frame

## use “merge” function to merge two data frames:

## data frame “x” is SOA_Basic_Data, data frame “y” is zip_region

## merge using the “five_digit_zip_code” field from both data frames

## Note the last argument in the function: “all.x = TRUE” – this ensures that all rows from

## “SOA_Basic_Data” will remain in the new data frame – even if there is no match of 5-digit zip

## codes to the “zip_region” data frame

 

SOA_Basic_Data <- merge(x = SOA_Basic_Data, y = zip_region, by = "five_digit_zip_code", all.x

= TRUE)

 

Now let’s get the distribution of exposures by region:

# Calculate the distribution of exposures by region

 

# First, aggregate exposures by region

exp_by_reg <- aggregate(death_policies_exposed ~ Region, data = SOA_Basic_Data, sum)

 

# Calculate the exposures by region as a proportion

exp_by_reg$prop <- round(exp_by_reg$death_policies_exposed/

sum(exp_by_reg$death_policies_exposed) * 100,1

 

# View the distribution of exposures by industry

exp_by_reg

 

Region            death_policies_exposed               prop

ENC                   8224359.97                                    18.5

ESC                   2428963.83                                       5.5

MidAtl             5181649.64                                    11.7

Mountain      2191707.68                                      4.9

NewEng         2557672.01                                      5.8

Other             28155.48                                           0.1

Pacific            5388243.70                                    12.1

SthAtl            6857990.72                                    15.4

Unknown     2815669.90                                      6.3

WNC               3935781.53                                      8.9

WSC               4852552.20                                    10.9

 

# Or view the distribution of exposures in chart form

barplot(exp_by_reg$prop, names.arg = exp_by_reg$Region, col = "red")

 

 

The distribution of exposures is reasonably consistent with general population numbers. East North Central (“ENC” includes IL, IN, MI, OH, WI) stands out as being a few percentage points higher than expected, but within the realm of reasonability. An insurer working with its own data would have the ability to review and validate these distributions against internal reports. One challenge of working with aggregated industry data is that our ability to validate against other information sources is limited.

Similar to the distribution of exposures by Industry, there are a lot of exposures where the Region is unknown (6.3%). At this point, it makes sense to look in greater detail at the records missing Industry and Region information.

Industry and Area Data

Thus far we have uncovered that Industry is unknown for 7.4% of exposures and Area is unknown for 6.3% of exposures. The numbers are quite similar and so I wonder if there is a lot of overlap – are the exposures that are missing Industry information the same as those that are missing Area information?

Let’s create a data field that indicates which of Industry and Area data are known / unknown. Then we can aggregate by this data field to determine the proportion of exposures where both Industry and Area are unknown.

# Create a data field to indicate known/unknown status of Industry and Area

 

SOA_Basic_Data$IndArea <- as.factor(ifelse(SOA_Basic_Data $IndGrp == "Unk" &

                                                SOA_Basic_Data $Region == "Unknown", "Both Unk",

                                                  ifelse(SOA_Basic_Data $IndGrp == "Unk", "Ind Unk",

                                                     ifelse(SOA_Basic_Data $Region == "Unknown",

                                                        "Area Unk", "Both Known"))))

 

 

# Aggregate exposures by this new data field

 

exp_by_indarea <- aggregate(death_policies_exposed ~ IndArea, data = SOA_Basic_Data, sum)

 

exp_by_indarea$prop <- round(exp_by_indarea$death_policies_exposed/

sum(exp_by_indarea$death_policies_exposed)*100,1)

 

# View the distribution of exposures by Industry/Area known/unknown indicator

exp_by_indarea

 

IndArea                death_policies_exposed                        prop

Area Unk                          7342.843                                           0.0

Both Known                    41155166.460                                92.6

Both Unk                         2808327.059                                    6.3

Ind Unk                           491910.301                                       1.1

For nearly all the exposures where the Area is unknown, the Industry is also unknown – this is 6.3% of total exposures for which we are missing both Industry and Area information. This is big enough that we should do further investigation into these exposures.

If this data were our own company’s internal data then we would likely begin our investigation at the data source and try to learn why these data fields are not entered for so many exposures. But, as we are studying industry data, we do not have this luxury. Instead, we will focus on the data – and look at these exposures in more detail.

Let’s begin by comparing the age and gender distributions of the exposures where Industry and Area are unknown and to the age and gender distributions for the rest of the exposures. First, we will introduce a new R function.

•  tapply: Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

# Calculate the distribution of exposures by gender

## Separately for exposures where Industry/Area are unknown vs all other exposures

 

# First, aggregate exposures by Industry/Area indicator, gender and age

exp_by_mult <- aggregate(death_policies_exposed ~ IndArea + sex + central_age,

data = SOA_Basic_Data, sum)

 

# Demonstrate the “tapply” function

## Determine total exposures by gender for all cases

### Use “tapply” function on “death_policies_exposed” by “sex” and get the “sum”

tapply(exp_by_mult$death_policies_exposed, exp_by_mult$sex, sum)

 

Female                   Male

19311789               25150958

 

# Let’s repeat, but only including cases where Industry/Area are both unknown

## Ie, rows where: exp_by_mult[exp_by_mult$IndArea == “Both Unk”,]

 

tapply(exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$death_policies_exposed,

exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$sex, sum)

 

Female               Male

900169.6            1908157.5

 

# Let’s repeat, but (i) dividing each value by the sum to get a proportion, (ii) multiplying by 100

## and rounding to one decimal point to simplify viewing and (iii) saving to a variable

 

both_unk_by_sex <- round(100 *

tapply(exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$death_policies_exposed,

exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$sex, sum) /

sum(exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$death_policies_exposed), 1)

 

# View Results

both_unk_by_sex

 

Female         Male

32.1               67.9

 

# Let’s repeat, but for cases where at least one of Industry or Area is known

 

other_by_sex <- round(100 *

tapply(exp_by_mult[exp_by_mult$IndArea != "Both Unk",]$death_policies_exposed,

exp_by_mult[exp_by_mult$IndArea != "Both Unk",]$sex, sum) /

sum(exp_by_mult[exp_by_mult$IndArea != "Both Unk",]$death_policies_exposed), 1)

 

# View Results

other_by_sex

 

Female             Male

44.2                   55.8

 

# Now let’s combine the output into one table – to view “unknown” and “other” side-by-side

 

# First, create an empty data frame of 2x2 dimension

df_sex <- data.frame(matrix(ncol = 2, nrow = 2))

 

# Assign the values of “Unknowns” to the first column and “Others” to the second colum

 

df_sex[,1] <- both_unk_by_sex

df_sex[,2] <- other_by_sex

 

# Finally, add row and column names

 

row.names(df_sex) <- c("Female", "Male")

colnames(df_sex) <- c("Both Unk", "Other")

 

# Let’s view the results

 

Both               Unk              Other

Female          32.1               44.2

Male              67.9               55.8

When we compare cases where both Industry and Area are unknown to the Other cases, there are some differences in the distribution of exposures by gender. However, these differences aren’t large enough to raise immediate concerns.

Next, we can repeat this exercise to get the distributions of exposures by age (rather than by gender). I won’t provide the R code as it is the same as above – just substituting “central_age” for “sex”. Here are the results:

 

Both Unk

Other

17

0.1

0.2

22

1.8

4.7

27

5.6

10.8

32

8.2

11.9

37

9.1

11.6

42

10.5

12.5

47

11.7

13.0

52

12.1

13.

57

11.4

10.7

62

9.4

7.1

67

9.9

2.8

72

3.3

1.0

77

2.5

0.4

82

2.1

0.2

87

1.6

0.1

92

0.6

0.0

97

0.1

0.0

There is a significant difference in the distribution of exposure by age for these two subsets of the data. In particular, focus on central age of 67 or greater (i.e., at or beyond normal retirement age). For the cases where both Industry and Area are unknown, 20.1% of exposures are at these ages versus only 4.5% of the cases where at least one of Industry or Area are known.

Take a moment to reflect on these numbers and consider the context. Ask yourself what proportion of Group Life exposures would you expect to be aged 65 or greater. I think a good estimate is somewhere around 5% - and so I am comfortable with the data for the cases where at least one of Industry and Area are known (as the observed proportion is 4.5%). For the cases where both Industry and Area are unknown, the observed proportion is 20.1% - much higher than my expectations and so I have concerns about the quality of this data.

Claims Experience for Exposures Missing Industry and Area Data

Thus far, we have been focused on risk characteristics of the exposures and we have not given any consideration to the outcome (mortality rates) we will be studying. This is a deliberate decision which will be explained in more detail in a future post. But, at a high-level, the explanation is that we will want to put aside some of the data – before we begin looking at the outcomes – so that we will have out-of-sample data to use to validate our final model.

But in our current situation – where we have serious concerns about data quality – it really is necessary to peek at the outcomes (mortality rates). There are two fields we will look at:

• death_count: the number of claims

• expected_death_by_policy: the expected number of claims, as determined by the study’s authors, based on the observed mortality rate by age and gender

Specifically, we are interested to see the (i) proportion of total deaths that are related to exposures where both the Industry and Area data are missing and (ii) the Actual/Expected ratios by known status of Industry and Area data.

# Determine # death claims for cases where Industry and Area are unknown

 

aggregate(death_count ~ IndArea, data = SOA_Basic_Data, sum)

IndArea                death_count

Area Unk                   15

Both Known           67226

Both Unk               27894

Ind Unk                  1754

 

## 27,894 death claims relate to exposures where Industry and Area are both unknown

## What is this as a proportion of total death claims?

 

sum(SOA_Basic_Data[SOA_Basic_Data$IndArea == "Both Unk",]$death_count) /

sum(SOA_Basic_Data$death_count)

0.2878965

 

# What are the Actual and Expected death counts by Area/Industry indicator?

 

aggregate(cbind(death_count, expected_death_by_policy) ~ IndArea, data = test_data, sum)

IndArea         death_count               expected_death_by_policy

Area Unk             15                                     10.76177

Both Known     67226                              74010.58226

Both Unk         27894                               21021.78983

Ind Unk            1754                                 1845.86614

 

 

# What are the Actual/Expected ratios by Area/Industry indicator?

## First, where Industry and Area are both unknown

 

sum(SOA_Basic_Data[SOA_Basic_Data$IndArea == "Both Unk",]$death_count) /

sum(SOA_Basic_Data[SOA_Basic_Data$IndArea == "Both Unk",]$expected_death_by_policy)

1.326909

 

## Next, all other cases

 

sum(SOA_Basic_Data[SOA_Basic_Data$IndArea != "Both Unk",]$death_count) /

sum(SOA_Basic_Data[SOA_Basic_Data$IndArea != "Both Unk",]$expected_death_by_policy)

0.9094179

The cases where neither Industry nor Area are known account for 6.3% of lives exposed but 28.8% of deaths. This discrepancy is partly explained by the very high proportion of these exposures relating to individuals aged 65 or greater (20.1%) and it is partially explained by poor mortality experience (Actual/Expected ratio of 132.7%).

Again, one challenge of working with data collected from multiple carriers, is that our ability to further investigate the data is limited. In this specific instance, I suspect that our data may have been tainted with some Retiree life insurance data – as this would explain both the high proportion of exposures at or beyond age 65 as well as the high Actual/Expected ratios that are observed for cases where both Industry and Area data are missing.

I was able to present this data to the SOA and they were able to confirm that nearly all of the death counts for exposures missing both Industry and Area (27,820 of 27,893) were submitted by one carrier. My decision will thus be to exclude these exposures from the analysis.

Final Thoughts on Data Validation

Data validation is critical to any analytics exercise. And so, we spend a lot of time and effort to review the data and ensure its integrity prior to commencement of our modeling. But it is important to remember that data validation is not a one-time exercise. We need to be maintain our vigilance in questioning data integrity throughout the modeling process. As we begin to analyze the data, if we see results that are inconsistent with our expectations, we need to make extra effort to validate the accuracy of the results – and that often begins with another review of the integrity of the underlying data.