Healthcare is integral to the quality of life in a modern society, yet the United States is the sickest of the rich countries: the costs in the US health-care system are higher than in any other capitalist countries, but this does not lead to better outcomes - 39 countries have longer life expectancies. (Rosling, Hans. Factfulness)
This makes it a prime candidate for analytical intervention. In the last couple of years, Centers for Medicare and Medicaid Services (CMS) has been releasing some previously unaccessible data, including payments drug companies make to physicians (OpenPayments dataset), as well as doctors’ medication choices in Medicare’s prescription drug program (Medicare Part D dataset). ProPublica has previously analyzed the OpenPayments data in conjunction with the prescription data, showing that doctors who received money from drug and device makers — even just a meal – are two to three times as likely to prescribe brand-name drugs than doctors who didn’t. So I took the opportunity to get a closer look at what else drives the prescription choices (and what might be inflating drug prices and the opioid crisis).
The yearly data files consist of 2 datasets: the raw dataset assembled per a distinct combination of a doctor’s and drug’s name (about 25 million rows in 2017) and a summary table containing each doctors’ prescription costs and claims record for groups such as brand/generic drugs, insurance plans, low/non-low incomes, opioid/antibiotic/antipsychotic disorders, in addition to doctors’ and their patients’ demographic. Although it’s good practice to work with as unprocessed data as possible, I was more interested in the demographic and utilization than pure transactions, so I chose to work with the aggregated dataset.
After wrangling the data from 2015 to 2017 into my own dataset using a DB Browser for SQLite, here are the spending and volume highlights I found:
To be able to look at the data through the lenses of prescription patterns, as well as predict future patterns, I’ve created two variables:
To better understand the costs and counts every doctor prescribed, it helps to use a cost per claim proportion. This is how brand and generic costs per claim are distributed:
Across the country, brand cost per claim are on average 17 higher than generic cost per claim.
Looking at the disorder and demographic groups, most values cluster around 0 (distributions are extremely skewed, spreading thinly over the right tail). Even after engineering proportional representations of the disorder and demographic variables, such as cost per claim, cost per beneficiary and claim per beneficiary, and looking at the boxplot grahps of those, you mostly see a long tail of outliers. Having collected close to 200 outliers from each group, we can tell what they practice, where they live and, why they are an outlier:
An outlier doctor’s average prescription costs $320, it can go up to $6500. The average claim among non-outliers is $30, and $3500 is as high as it will go. That’s for generic prescription; for brand prescription, you can expect to pay at least $77 and up to $52k at an outlier doctors office. The minimum claim among non-outliers is $1, and the maximum is $30k. Non-outliers’ average patients’ age is as old as 94 years old. Outliers’ patient don’t treat patients above 88.. Outliers also tend to have patients with higher risk scores.
The top 10 features most correlated to percentage brand prescribed among the outlier group are:
nonlis_cost_per_claim 0.576560 pdp_cost_per_claim 0.484523 la_opioid_prescriber_rate -0.454956 opioid_claim_per_bene -0.431869 opioid_prescriber_rate -0.364436 la_opioid_claim_per_bene -0.364383 average_age_of_beneficiaries -0.334875 lis_cost_per_claim 0.283848 antibiotic_cost_per_bene 0.024989 beneficiary_average_risk_score 0.022957
I was able to identify 6 doctors who made it to the outlier group for more than 2 reasons.
In terms of modeling, I fit the data to Linear Regression, Decision Tree, Random Forest and Ensemble’s Voting Regressor models. The latter slightly underperformed compared to Random Forest, which tells me the component models are not perfectly independent. When re-training on the top 20 features, performance did not change, which tells me the model is not biased. When I used only the engineered features, I got the best performance yet. Looking at the residual plots, however, all other models’ and feature subsets had random distributions of fitted VS residual values, but engineered features had some quite clear patterns.
This is how a single tree sees engineered features importances (good resolution):
Let’s see how this tree makes predictions to estimate what kind of practice corresponds with doctors’ preference for brand-name drugs. Starting at the root node (depth 0, at the top): the single best question we can ask is whether his costs per claim for beneficiaries without a low-income subsidy are less than 131 dollars. If it is, then you move down to the root’s left child node (depth 1, left) asking what are his costs per claim for generic drugs, as well as, what are his cost per claim for beneficiaries covered by a stand-alone prescription drug plan (as opposed to plans that cover drugs and health services) and generally decide that this doctor will be in the lower quartile of brand prescription percentage.
If the answer to the first, root question is that the doctor’s costs per claim for beneficiaries without a low-income subsidy are more than 131 dollars, then you move down to the root’s right child node and ask what are what are his costs per claim for generic drugs, as well as, his costs per claim for beneficiaries with a low-income subsidy and generally see that this would be a higher volume of brand prescription kind of practice. A doctor might even end up among the outlier brand prescribers, prescribing brand in 84% of all cases, which happened in almost 1 percent of sample population, if his:
I could certainly get some diversity in questions the tree is asking, i.e. expand the features selection, but it sounds like a big decider here is who is picking up the bill.
These are the features that Random Forest deemed as most important: ‘pdp_drug_cost’, ‘lis_drug_cost’,’lis_claim_count’, ‘nonlis_claim_count’, ‘antipsych_drug_cost_ge65’, ‘nppes_provider_state_FL’, ‘beneficiary_age_65_74_count’, ‘antibiotic_cost_per_claim’, ‘antipsych_claim_per_bene_ge65’, ‘antibiotic_claim_per_bene’, ‘beneficiary_average_risk_score’, ‘total_day_supply’, ‘generic_cost_per_claim’, ‘antipsych_cost_per_claim_ge65’, ‘antipsych_cost_per_bene_ge65’, ‘total_day_supply_ge65’, ‘opioid_day_supply’, ‘nonlis_drug_cost’, ‘nppes_provider_city_NEWTON’, ‘la_opioid_prescriber_rate’.
We cannot say, however, whether the relationship between these predictors and the dependent variable is positive or negative, like we can in linear models. An ensemble of decision trees can have arbitrary complex rules for each feature, as we saw above. Lastly, feature importance is an artifact of a model, not an underlying dataset, i.e. it can work in a different way under different conditions.
In the next steps, I would like to join the prescription dataset with other data, such as (FDA’s Drug Safety-related Labeling Changes) or indeed, the prescription data for population at large, not just for those enrolled at Medicare PartD, which would allow further investigation into the income inequality hypothesis.
This research did not find any considerable insights in terms of what drives drug prices or the opioid crises, but my hunch is, as with other public health accomplishments in human history, such as antibiotics and public sanitation (sewer systems and chlorinated water), the new improvements will come from social infrastructure reforms with an across-the-board effect on diseases and disorders.