I am trying to create a custom function to give labels to modified list of data frames. For example, I have a data frame like below.
df<-data.frame(
gender = c(1,2,1,2,1,2,1,2,2,2,2,1,1,2,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,2,1,2,1,2,1,2,2,2),
country = c(3,3,1,2,5,4,4,4,4,3,3,4,3,4,2,1,4,2,3,4,4,4,3,1,2,1,5,5,4,3,1,4,5,2,3,4,5,1,4),
Q1=c(1,1,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA,1),
Q2=c(1,1,1,1,1,NA,NA,NA,NA,1,1,1,1,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,NA,NA,1,1,1,1,1,1,1,NA,NA,NA),
Q3=c(1,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,1,1,1,NA,NA,NA,1,NA,NA,1,1,1,1,1,NA,NA,1),
Q4=c(1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
Q5=c(1,2,1,1,1,2,1,2,2,1,2,NA,1,1,2,2,2,1,1,1,2,NA,2,1,1,1,2,2,2,NA,1,2,2,1,1,1,2,2,2)
)
I understand your goal to be the following: You want to take a list of data frames (ldat). For each of the dataframes in the list (df, df2) you want to take some existing columns (Q1, Q2, Q3) and replicate them with new names in the same data frame (Q1_new, Q2_new, Q3_new). This you could achieve like this:
variables = c("Q1","Q2","Q3")
new_label =c("Q1_new","Q2_new","Q3_new")
newdfs <- lapply(ldat, FUN = function(x) {
x[,new_label] = x[,variables]
return(x)})
head(newdfs$ALL)
gender country Q1 Q2 Q3 Q4 Q5 cc2 Q1_new Q2_new Q3_new
1 Male USA Yes Available Partner Depends on sales Local 1 Yes Available Partner
2 female USA Yes Available Partner <NA> Overseas NA Yes Available Partner
3 Male CAN <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
4 female EU <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
5 Male UK <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
6 female BR <NA> <NA> <NA> <NA> Overseas NA <NA> <NA> <NA>
Is this what you had in mind?
I have two datasets:
Contacts2: This contains a list of ~100,000 contacts, their respective titles and a set of columns which describes the types of work contacts could be involved in. Here's an example dataset:
First<-c("George","Thomas","James","Jimmy","Howard","Herbert")
Last<-c("Washington", "Jefferson", "Madison", "Carter", "Taft", "Hoover")
Title<-c("CEO", "Accountant","Communications Specialist", "President", "Accountant", "CFO")
Finance<-NA
Executive<-NA
Communications<-NA
Contacts2<-as.data.frame(cbind(First,Last,Title,Finance,Executive,Communications))
First Last Title Finance Executive Communications
1 George Washington CEO <NA> <NA> <NA>
2 Thomas Jefferson Accountant <NA> <NA> <NA>
3 James Madison Communications Specialist <NA> <NA> <NA>
4 Jimmy Carter President <NA> <NA> <NA>
5 Howard Taft Accountant <NA> <NA> <NA>
6 Herbert Hoover CFO <NA> <NA> <NA>
Note the last three columns are numeric.
TableOfTitle: This dataset contains a list of ~1,000 unique titles and the same set of columns in which describes the type of work the contacts could be involved in. For each title I've put an 1 in the column(s) of the roles that describe that person's job.
Title<-c("CEO","Accountant", "Communications Specialist", "President", "CFO")
Finance<-c(NA,1,NA,1,1)
Executive<-c(1,NA,NA,NA,1)
Communications<-c(NA,NA,1,NA,NA)
TableOfTitle<-as.data.frame(cbind(Title,Finance,Executive,Communications))
Title Finance Executive Communications
1 CEO <NA> 1 <NA>
2 Accountant 1 <NA> <NA>
3 Communications Specialist <NA> <NA> 1
4 President 1 <NA> <NA>
5 CFO 1 1 <NA>
Note the last three columns are numeric.
I'm now trying to match the check boxes in TableOfTitle in Contacts2 based on the contact title field. For example, since TableOfTitle shows anyone with the title of CFO should have an x in the Finance and Executive field, the record for Herbert Hoover in Contacts2 should also have 1s in those columns as well.
Here's a solution that uses dplyr. It is essentially what some commenters have already recommended, except that this fulfills the request of not copying over any pre-existing data in the last 3 columns of Contacts2.
Note that ifelse() can be very slow with large datasets, but for your stated task this shouldn't really be noticeable. Algorithmically, this solution is also a bit clumsy in other ways, but I went for maximum readability here.
Contacts2 <- left_join(Contacts2, TableOfTitle, by = "Title") %>%
transmute(First = First,
Last = Last,
Title = Title,
Finance = ifelse(is.na(Finance.x), Finance.y, Finance.x),
Executive = ifelse(is.na(Executive.x), Executive.y, Executive.x),
Communications = ifelse(is.na(Communications.x), Communications.y, Communications.x))
Example output:
First Last Title Finance Executive Communications
George Washington CEO <NA> 1 <NA>
Thomas Jefferson Accountant 1 <NA> <NA>
James Madison Communications Specialist <NA> <NA> 1
Jimmy Carter President 1 <NA> <NA>
Howard Taft Accountant 1 <NA> <NA>
Herbert Hoover CFO 1 1 <NA>
I have a dataset with categorical and NA observations of 10 variables. I want to replace the NA values of each column with the mode. I did a histogram of each variable for identifying the density for each observation and got the mode. I know what values to replace the NAs in each column with.
I saw there was a related post, but I already know what values to replace. Here's the link: Replace mean or mode for missing values in R
Here's to reproduce the dataset:
> #Create data with missing values
> set.seed(1)
> dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20),
stringsAsFactors=FALSE)
> dat[c(5,10,15),1] <- NA
Here's an example:
> #The head of the first five observations
> head(SmallStoredf, n=5)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus HomeMarketValue
1 <NA> Male <NA> <NA> <NA> <NA> <NA>
2 45-54 Female <NA> <NA> <NA> <NA> <NA>
5 45-54 Female 75k-100k Married Yes Own 150k-200k
6 25-34 Male 75k-100k Married No Own 300k-350k
7 35-44 Female 125k-150k Married Yes Own 250k-300k
Occupation Education LengthofResidence
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
5 <NA> Completed High School 9 Years
6 <NA> Completed High School 11-15 years
7 <NA> Completed High School 2 Years
In this example, I want NAs in HomeOwnerStatus replaced with Own, HomeMarketValue with 350K-500K, and Occupation with Professional.
EDIT: I tried inputting the values in, but got an error about three of the columns.
> replacementVals <- c(Age = "45-54", Gender = "Male", HouseholdIncome = "50K-75K",
+ MaritalStatus = "Single", PresenceofChildren = "No",
+ HomeOwnerStatus = "Own", HomeMarketValue = "350K-500K",
+ Occupation = "Professional", Education = "Completed High School",
+ LengthofResidence = "11-15yrs")
> indx1 <- replacementVals[col(df2)][is.na(df2[,names(replacementVals)])]
> df2[is.na(df2[,names(replacementVals)])] <- indx1
#Warning messages:
#1: In `[<-.factor`(`*tmp*`, thisvar, value = c("50K-75K", "50K-75K", :
invalid factor level, NA generated
#2: In `[<-.factor`(`*tmp*`, thisvar, value = c("350K-500K", "350K-500K", :
invalid factor level, NA generated
#3: In `[<-.factor`(`*tmp*`, thisvar, value = c("11-15yrs", "11-15yrs", :
invalid factor level, NA generated
Here's the output:
> head(SmallStoredf)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus HomeMarketValue
1 45-54 Male <NA> Single No Own <NA>
2 45-54 Female <NA> Single No Own <NA>
5 45-54 Female 75k-100k Married Yes Own 150k-200k
6 25-34 Male 75k-100k Married No Own 300k-350k
7 35-44 Female 125k-150k Married Yes Own 250k-300k
8 55-64 Male 75k-100k Married No Own 150k-200k
Occupation Education LengthofResidence
1 Professional Completed High School <NA>
2 Professional Completed High School <NA>
5 Professional Completed High School 9 Years
6 Professional Completed High School 11-15 years
7 Professional Completed High School 2 Years
8 Professional Completed High School 16-19 years
Only NA values in some columns were replaced.
I amended your reproducible example a little bit, here's the setup
> #Create data with missing values
> set.seed(1)
> dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20),
stringsAsFactors=FALSE)
> dat[c(5,10,15),1] <- NA
> dat[6,1]<-NA
#output
# x y
#1 a 1.511781168450847978590
#2 b 0.389843236411431093291
#3 b -0.621240580541803755210
#4 c -2.214699887177499881830
#5 <NA> 1.124930918143108193874
#6 c NA
#7 c -0.016190263098946087311
#8 b 0.943836210685299215051
#9 b 0.821221195098088552200
#10 <NA> 0.593901321217508826322
#11 a 0.918977371608218240873
#12 a 0.782136300731067102276
#13 c 0.074564983365190601328
#14 b -1.989351695863372793127
#15 <NA> 0.619825747894710232799
#16 b -0.056128739529000784558
#17 c -0.155795506705329295238
#18 c -1.470752383899274429169
#19 b -0.478150055108620353206
#20 c 0.417941560199702411005
now define your replacement vals, labeled by the columns you want to have NAs replaced
replacementVals<-c(x="Xreplace", y="Yreplace")
and the next call can replace them in all in one shot
dat[is.na(dat[,names(replacementVals)])]<-replacementVals
# x y
#1 a 1.51178116845085
#2 b 0.389843236411431
#3 b -0.621240580541804
#4 c -2.2146998871775
#5 Xreplace 1.12493091814311
#6 c Yreplace
#7 c -0.0161902630989461
#8 b 0.943836210685299
#9 b 0.821221195098089
#10 Yreplace 0.593901321217509
#11 a 0.918977371608218
#12 a 0.782136300731067
#13 c 0.0745649833651906
#14 b -1.98935169586337
#15 Xreplace 0.61982574789471
#16 b -0.0561287395290008
#17 c -0.155795506705329
#18 c -1.47075238389927
#19 b -0.47815005510862
#20 c 0.417941560199702
But as akrun pointed out, and subsequently solved, this didn't map well to your second data frame example. This is just taken straight from the comments they made (so either way they should probably get the check on this question)
We'll do the setup, I'm not going to do all the prints except for the result
HomeOwnerStatus = c(NA,NA,NA ,"Rent", "Rent" )
HomeMarketValue = c(NA,NA,NA, "350k", "350k")
Occupation = c(NA,NA,NA, NA, NA)
SmallStoreddf<-data.frame(HomeOwnerStatus,HomeMarketValue,Occupation, stringsAsFactors=FALSE)
replacementVals<-c("HomeOwnerStatus" = "Rent", "HomeMarketValue"="350k", "Occupation"="Professional")
Then in two steps (which could be combined into one really long line) you go
#get the values that we will be replacing
indx1<-replacementVals[col(SmallStoreddf)][is.na(SmallStoreddf[, names(replacementVals)])]
#do the replacement
SmallStoreddf[is.na(SmallStoredf[,names(replacementVals)])] <-indx1
# HomeOwnerStatus HomeMarketValue Occupation
#1 Own 350k Professional
#2 Own 350k Professional
#3 Own 350k Professional
#4 Rent 350k Professional
#5 Rent 350k Professional
Try: (Using your second example as it was a bit unclear when you showed two datasets)
indx <- which(is.na(SmallStoredf), arr.ind=TRUE)
SmallStoredf[indx] <- c("Own", "350K-500K", "Professional")[indx[,2]]
SmallStoredf
# HomeOwnerStatus HomeMarketValue Occupation
#1 Own 350K-500K Professional
#2 Own 350K-500K Professional
#3 Own 350K-500K Professional
#4 Rent 350k-500k Professional
#5 Rent 500k-1mm Professional
Upgrading comment.
If you are wanting to replace the missing data with the most frequent category, there may be an equal count of categories within a variable. So in the code below, the replacements are randomly sampled from the categories that are most frequent.
# some example data with missing
set.seed(1)
dat <- data.frame(x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
w=rnorm(20),
z=sample(letters[1:3],20,TRUE),
stringsAsFactors=FALSE)
dat[c(5,10,15),1] <- NA
dat[c(3,7),2] <- NA
# function to get replacement for missing
# sample is used to randomly select categories, allowing for the case
# when the maximum frequency is shared by more than one category
f <- function(x) {
tab <- table(x)
l <- sum(is.na(x))
sample(names(tab)[tab==max(tab)], l, TRUE)
}
# as we are using sample, set.seed before replacing
set.seed(1)
for(i in 1:ncol(dat)){
if(!is.numeric(dat[i]))
dat[i][is.na(dat[i])] <- f(dat[i])
}
gentle warning: you should think carefully before imputing missing data this way. For example, income is often more likely to be missing for highest and lowest categories. By this method you may be imputing an average wage incorrectly. You should consider why each variable is missing and if it is reasonable to assume the data is MCAR or MAR. If so, i would then consider a more robust method of imputation (mice package).
thank you for your patience.
I am dealing with a large dataset detailing patients and medications.
Medications are hard to code, as they are (usually) meaningless unless matched with doses.
I have a dataframe with vectors (Drug1, Drug2..... Drug 16) where individual patients are represented by rows.
The vectors are actually factors, with 100s of possible levels (all the drugs the patient could be on).
All I want to do is produce a vector of logicals (TTTTFFFFTTT......) that I could then cbind into a dataframe which will tell me whether a patient is or is not on a particular, drug.
I could then use particularly important drugs' presence or absence as categorical covariates in a model.
I've tried grep, to search along the rows, and I can generate a vector of identifiers, but I cannot seem to generate the vector of logicals.
I realise I'm doing something simply wrong.
names(drugindex)
[1] "book.MRN" "DRUG1" "DRUG2" "DRUG3" "DRUG4" "DRUG5"
[7] "DRUG6" "DRUG7" "DRUG8" "DRUG9" "DRUG10" "DRUG11"
[13] "DRUG12" "DRUG13" "DRUG14" "DRUG15" "DRUG16"
> truvec<-drugindex$book.MRN[as.vector(unlist(apply(drugindex[,2:17], 2, grep, pattern="Lamotrigine")))]
> truvec
truvec
[1] 0024633 0008291 0008469 0030599 0027667
37 Levels: 0008291 0008469 0010188 0014217 0014439 0015822 ... 0034262
> head(drugindex)
book.MRN DRUG1 DRUG2 DRUG3 DRUG4 DRUG5
4 0008291 Venlafaxine Procyclidine Flunitrazepam Amisulpiride Clozapine
31 0008469 Venlafaxine Mirtazapine Lithium Olanzapine Metoprolol
3 0010188 Flurazepam Valproate Olanzapine Mirtazapine Esomeprazole
13 0014217 Aspirin Ramipril Zuclopenthixol Lorazepam Haloperidol
15 0014439 Zopiclone Diazepam Haloperidol Paracetamol <NA>
5 0015822 Olanzapine Venlafaxine Lithium Haloperidol Alprazolam
DRUG6 DRUG7 DRUG8 DRUG9 DRUG10 DRUG11 DRUG12
4 Lamotrigine Alprazolam Lithium Alprazolam <NA> <NA> <NA>
31 Lamotrigine Ramipril Alprazolam Zolpidem Trifluoperazine <NA> <NA>
3 Paracetamol Alprazolam Citalopram <NA> <NA> <NA> <NA>
13 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
15 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
DRUG13 DRUG14 DRUG15 DRUG16
4 <NA> <NA> <NA> <NA>
31 <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA>
13 <NA> <NA> <NA> <NA>
15 <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA>
And what I want is a vector of logicals for each drug, saying whether that patient is on it
Thank you all for your time.
Ross Dunne MRCPsych
"Te occidere possunt sed te edere ne possunt, nefas est".
You were close with your apply attempt, but MARGIN=2 applies the function over columns, not rows. Also, grep returns the locations of the matches; you want grepl, which returns a logical vector. Try this:
apply(x[,-1], 1, function(x) any(grepl("Aspirin",x)))
You could also use %in%, which you may find more intuitive:
apply(x[,-1], 1, "%in%", x="Aspirin")
First, a comment on data structure. You have data in what some call a "wide" format, with a single row per patient and multiple columns for the drugs. It is usually the case that the "long" format, with reapeated rows per patient and a single column for drugs is more amenable to data manipulation. To reshape your data from wide to long and vice versa, take a look at the reshape package. In this case, you would have something like:
library(reshape)
dnow <- melt(drugindex, id.var='book.MRN')
subset(dnow, value=='Lamotrigine')
Much cleaner, and obvious, code, if I may say so ...
Edit: If you need the old structure back you can use cast:
cast(subset(dnow, value=='Lamotrigine'), book.MRN ~ value)
as suggested by #jonw in the comments.