subset data frame based on character value - r

I'm trying to subset a data frame that I imported with read.table using the colClasses='character' option.
A small sample of the data can be found here
Full99<-read.csv("File.csv",header=TRUE,colClasses='character')
After removing duplicates, missing values, and all unnecessary columns I get a data frame of these dimmensions:
>dim(NoMissNoDup99)
[1] 81551 6
I'm interested in reducing the data to only include observations of a specific Service.Type
I've tried with the subset function:
MU99<-subset(NoMissNoDup99,Service.Type=='Apartment'|
Service.Type=='Duplex'|
Service.Type=='Triplex'|
Service.Type=='Fourplex',
select=Service.Type:X.13)
dim(MU99)
[1] 0 6
MU99<-NoMissNoDup99[which(NoMissNoDup99$Service.Type!='Hospital'
& NoMissNoDup99$Service.Type!= 'Hotel or Motel'
& NoMissNoDup99$Service.Type!= 'Industry'
& NoMissNoDup99$Service.Type!= 'Micellaneous'
& NoMissNoDup99$Service.Type!= 'Parks & Municipals'
& NoMissNoDup99$Service.Type!= 'Restaurant'
& NoMissNoDup99$Service.Type!= 'School or Church or Charity'
& NoMissNoDup99$Service.Type!='Single Residence'),]
but that doesn't remove observations.
I've tried that same method but slightly tweaked...
MU99<-NoMissNoDup99[which(NoMissNoDup99$Service.Type=='Apartment'
|NoMissNoDup99$Service.Type=='Duplex'
|NoMissNoDup99$Service.Type=='Triplex'
|NoMissNoDup99$Service.Type=='Fourplex'), ]
but that removes every observation...
The final subset should have somewhere around 8000 observations
I'm pretty new to R and Stack Overflow, so I apologize if there's some convention of posting I've neglected to follow, but if anyone has a magic bullet to get this data to cooperate, I'd love your insights :)

The different methods should work if you were using the right variable values. Your issue likely is extra spaces in your variable names.
You can avoid this kind of issues using grep for example:
NoMissNoDup99[grep("Apartment|Duplex|Business",NoMissNoDup99$Service.Type),]

## exclude
MU99<-subset(NoMissNoDup99,!(Service.Type %in% c('Hospital','Hotel or Motel')))
##include
MU99<-subset(NoMissNoDup99,Service.Type %in% c('Apartment','Duplex'))

Related

how to use regexpr to identify patters in icd10 data

I am working with icd10 data, and I wish to create new variables called complication based on the pattern "E1X.9X", using regular expression but I keep getting an error. please help
dm_2$icd9_9code<- (E10.49, E11.51, E13.52, E13.9, E10.9, E11.21, E16.0)
dm_2$DM.complications<- "present"
dm_2$DM.complications[regexpr("^E\\d{2}.9$",dm_2$icd9_code)]<- "None"
# Error in dm_2$DM.complications[regexpr("^E\\d{2}.9", dm_2$icd9_code)] <-
# "None" : only 0's may be mixed with negative subscripts
I want
icd9_9code complications
E10.49 present
E11.51 present
E13.52 present
E13.9 none
E10.9 none
E11.21 present
This problem has already been solved. The 'icd' R package which me and co-authors have been maintaining for five years can do this. In particular, it uses standardized sets of comorbidities, including the diabetes with complications you seek, from AHRQ, Elixhauser original, Charlson, etc..
E.g., for ICD-10 AHRQ, you can see the codes for diabetes with complications here. From icd 4.0, these include ICD-10 codes from the WHO, and all years of ICD-10-CM.
icd::icd10_map_ahrq$DMcx
To use them, first just take your patient data frame and try:
library(icd)
pts <- data.frame(visit_id = c("encounter-1", "encounter-2", "encounter-3",
"encounter-4", "encounter-5", "encounter-6"), icd10 = c("I70401",
"E16", "I70.449", "E13.52", "I70.6", "E11.51"))
comorbid_ahrq(pts)
# and for diabetes with complications only:
comorbid_ahrq(pts)[, "DMcx"]
Or, you can get a data frame instead of a matrix this way:
comorbid_ahrq(pts, return_df = TRUE)
# then you can do:
comorbid_ahrq(pts, return_df = TRUE)$DMcx
If you give an example of the source data and your goal, I can help more.
Seems like there are a few errors in your code, I'll note them in the code below:
You'll want to start with wrapping your ICD codes with quotes: "E13.9"
dm_2 <- data.frame(icd9_9code = c("E10.49", "E11.51", "E13.52", "E13.9", "E10.9", "E11.21", "E16.0"))
Next let's use grepl() to search for the particular ICD pattern. Make sure you're applying it to the proper column, your code above is attempting to use dm_2$icd9_code and not dm_2$icd9_9code:
dm_2$DM.complications <- "present"
dm_2$DM.complications[grepl("^E\\d{2}.9$", dm_2$icd9_9code)] <- "None"
Finally,
dm_2
#> icd9_9code DM.complications
#> 1 E10.49 present
#> 2 E11.51 present
#> 3 E13.52 present
#> 4 E13.9 None
#> 5 E10.9 None
#> 6 E11.21 present
#> 7 E16.0 present
A quick side note -- there is a wonderful ICD package you may find handy as well: https://cran.r-project.org/web/packages/icd/index.html

Code to filter data where observations may or may not be present in the dataframe

I am trying to make a template of code to filter data. The problem I am having is that there are various levels of categorical data and if I use the dplyr function filter R returns no data if the filtering level was not in the data.
For example:
library(dplyr)
lease <-c(1,2,1)
year<-c(2010,2011,2010)
beg <-c(1,2,1)
gas<-c(1,2,2)
pelelts<-c(1,2,2)
df<-data.frame(lease, year, beg, gas, pelelts)
df%>%
mutate_all(as.character)%>%
filter(lease==1 | year==2010)%>%
filter(beg==1 & gas==2)%>%
filter(pelelts==3)
this returns
<0 rows> (or 0-length row.names), which I believe is because pelelts==3 doesn't exist (and I get data if I remove this line of code). The problem I have is I don't want to check every data set for what is there as it will vary on a subset by subset basis. Any help will be much appreciated.
By saying pelelts == 3 you're telling R that you only want 3. You need to modify your code to catch the other conditions that are acceptable. If 3 doesn't exist then something else has to happen, or no results will be returned.
This code got the required result thanks #H5470
df%>%
mutate_all(as.character)%>%
filter(lease==1 | year==2010)%>%
filter(beg==1 & gas==2)%>%
mutate(pelelts=case_when(pelelts %in% '3'~ '3',
pelelts %in% c('0', '1', '2')~ '',
TRUE~as.character(pelelts)))
returning:
lease year beg gas pelelts
1 2010 1 2

Extracting matching rows from data frame

I have a data frame with 30+ columns. I want to extract the rows where three specific columns are matching with some reference values. Example, col A has state name, col B has site types, col C has number of annual visitors. I want to find out number of visitors (col C) going to capital (col B) of New Jersey (col A).
How about
subset(my_df,A=="New Jersey" & B=="capitol")$C
or
with(my_df,my_df[A=="New Jersey" & B=="capitol","C"])
You should probably check out some introductory R material: e.g. http://www.ats.ucla.edu/stat/r/faq/subset_R.htm ; http://digitheadslabnotebook.blogspot.ca/2009/07/select-operations-on-r-data-frames.html (results of googling "selection rows from a data frame")
This is pretty easy with a subset command.
subset(data, A=="New Jersey" & B=="capital", select=C)
Or with standard indexing
data$C[ data$A=="New Jersey" & data$B=="capital" ]
I strongly recommend reading a basic introduction to R because this is pretty elementary stuff.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Pasting (or merging) two elements of a column together

I have two sources of clinical procedure billing information that I have added together (with rbind). In each row there is a CPT field and a CPT.description field that supplys a brief explanation. However, the descriptions are slightly different from the two sources. I want to be able to combine them. That way, if different words or abbreviations are used, then I can just do a string search to find what I am looking for.
So lets make up a simplified representation of a data table that I was able to generate.
cpt <- c(23456,23456,10000,44555,44555)
description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy")
cpt.desc <- data.frame(cpt,description)
And here is what I want to get to.
cpt.wanted <- c(23456,10000,44555)
description.wanted <- c("tonsillectomy; tonsillectomy in >12 year old","brain transplant","castration; orchidectomy")
cpt.desc.wanted <- data.frame(cpt.wanted,description.wanted)
I have tried using functions such as unstack and then lapply(list,paste) but that is not pasting the elements of each list. I also tried reshape but there was no categorical variable to differentiate first or second version of description or even in some cases a third. The really annoying part is I had a similar problem a few months or years ago and someone helped me either on stackoverflow or on r-help and for the life of me I cannot find it.
So the underlying problem is, imagine that I have a spreadsheet in front of me. I need to do a vertical merge (paste) of two or maybe even three description cells who have the same CPT code in the adjacent column.
What buzzwords should I have been using to search for a solution to this problem.
Thank you so much for your help.
sapply( sapply(unique(cpt), function(x) grep(x, cpt) ),
# creates sets of index vectors as a list
function(x) paste(description[x], collapse=";") )
# ... and this pastes each set of selected items from "description" vector
[1] "tonsillectomy;tonsillectomy in >12 year old"
[2] "brain transplant"
[3] "castration;orchidectomy"
Here is an approach that uses plyr.
library("plyr")
cpt.desc.wanted <- ddply(cpt.desc, .(cpt), summarise,
description.wanted = paste(unique(description), collapse="; "))
which gives
> cpt.desc.wanted
cpt description.wanted
1 10000 brain transplant
2 23456 tonsillectomy; tonsillectomy in >12 year old
3 44555 castration; orchidectomy

Resources