R - Using Stringr to identify a string across hundreds of rows - r

I have a database where some people have multiple diagnoses. I posted a similar question in the past, but now have some more nuances I need to work through:
R- How to test multiple 100s of similar variables against a condition
I have this dataset (which was an import of a SAS file)
ID dx1 dx2 dx3 dx4 dx5 dx6 .... dx200
1 343 432 873 129 12 123 3445
2 34 12 44
3 12
4 34 56
Initially, I wanted to be able to create a new variable if any of the "dxs" equals a certain number without using hundreds of if statements? All the different variables have the same format (dx#). So I used the following code:
Ex:
dataset$highbloodpressure <- rowSums(screen[0:832] == "410") > 0
This worked great. However, there are many different codes for the same diagnosis. For example, a heart attack can be defined as:
410.1,
410.71,
410.62,
410.42,
...this goes on for 20 additional codes. BUT! They all start with 410.
I thought about using stringr (the variable is a string), to identify the common code components (410, for the example above), but am not sure how to use it in the context of rowsums.
If anyone has any suggestions for this, please let me know!
Thanks for all the help!

You can use the grepl() function that returns TRUE if a value is present. In order to check all columns simultaneously, just collapse all of them to one character per row:
df$dx.410 = NA
for(i in 1:dim(df)[1]){
if(grepl('410',paste(df[i,2:200],collapse=' '))){
df$dx.410[i]="Present"
}
}
This will loop through all lines, create one large character containing all diagnoses for this case and write "Present" in column dx.410 if any column contains a 410-diagnosis.
(The solution expects the data structure you have here with the dx-variables in columns 2 to 200. If there are some other columns, just adjust these numbers)

Related

Specify multiple conditions in long form data in R

How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.
I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat

Keep only the Cells that contain certain value

I have a dataset "Cases" with 16000+ rows that contains 25 columns with various diagnosis codes, 1 code per column. I also have a key which contains 56 unique diagnosis codes. I need to parse out the codes that match those in the key. How can I condense this down to a more concise dataset? In general, only one code should remain. Unfortunately I don't have any code to share because I don't know where to start. I am open to using Excel or R to do this. Thank you in advance and apologies in advance for how vague this question is, I honestly just don't know where to start.
Principal.DX DX2 DX3 DX4 DX5 DX6 DX7
1 D57.01 J18.9 Z86.711 Z79.01 Z87.891 E55.9
2 E66.2 J18.9 J96.21 J96.22 Z68.45 J20.9 I10
3 T82.7XXA A41.01 J18.9 N18.6 L03.114 L02.414 I12.0
4 J18.9 J96.21 R57.1 I42.0 I50.32 K52.1 N17.9
5 J10.08 J12.9 H66.91 L30.9
6 J15.212 E84.0 E44.0 J44.1 J15.6 K86.81
So in this example, let's say I only need to keep the values "J96.21", "J12.9", and "E55.9".

Merge columns with the same name R

I'm fairly new to R. I'm working with a data set that is incredibly redundant with a lot of columns (~400). There are several duplicate column names, however the data is not duplicate, so I need to sum the columns when collapsing them.
The columns all have a similar name that allows easy identification, so I'm hoping I can use that to my advantage.
I attempted to perform the following:
ColNames <- unique(colnames(df))
CombinedDf <- data.frame(sapply(ColNames, function(i)rowSums(Test[,ColNames==i, drop=FALSE])))
This works if I sum over the range of columns that only contain integers, but the issue is that other columns have strings and such in them, so rowSums throws a fit.
Assuming that the identifier is "XXX", how can I aggregate all the columns that are of the same name leaving the other columns as is?
Thank you for your time.
Edit: Sample data has been asked for, I cannot give the exact data as it is sensitive, but I will give an example:
Name COL1XXX COL2XXX COL1XXX COL3XXX COL2XXX Type
Henry 5 15 25 31 1 Orange
Tom 8 16 12 4 3 Green
Should return
Name COL1XXX COL2XXX COL3XXX Type
Henry 30 16 31 Orange
Tom 20 19 4 Green
I'm not really sure, but you may try transposing the data and then aggregating by unique names.
t_df=as.data.frame(t(df))
new_df=aggregate(t_df, by=list(rownames(t_df)),sum)
Again, without sample data I'm unsure if it'll work, but based on what you said, that might work.

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

Resources