Recoding multiple variables based on logical rules in external table. - r

Objective
Given data ds, compute a new variable ds$h1 from ds$raw1 and ds$raw2 according to the harmonization rule specified in the object hrule.
The reproducible example contains response of 10 individuals on 2 measures, raw1 and raw2:
>ds
id raw1 raw2
1 1 1 1
2 2 1 0
3 3 0 1
4 4 0 0
5 5 NA 1
6 6 NA 0
7 7 1 NA
8 8 0 NA
9 9 NA NA
10 10 1 1
These two variables need to be transformed into a single, harmonized variable, according to some rule (developed qualitatively). The rules of harmonizational transformation are encoded in the object hrule:
>hrule
raw1 raw2 h1
1 0 0 0
2 0 1 1
3 0 NA 0
4 1 0 1
5 1 1 1
6 1 NA 1
7 NA 0 0
8 NA 1 1
9 NA NA NA
Thus, the rule should be read for row 1 as:
if respondent provides a value of 0 on raw1 and the value of 0 on raw2 then the value of h1 should be 0.
Functional objective
Develop a function that passes ds, hrule, names of variables a character vector( c("raw1","raw2")) , and the name of the harmonization variable ("h1") and outputs a new harmonized variable (ds$h1).
Starter code
(ds <- data.frame("id" = 1:10,
"raw1" = c(1,1,0,0,NA,NA,1 ,0 ,NA,1),
"raw2" = c(1,0,1,0,1 ,0 ,NA,NA,NA,1)))
(response_profile <- ds %>% dplyr::group_by(raw1, raw2) %>% dplyr::summarize(count=n()))
(hrule <- cbind(response_profile, "h1" = c(0,1,0,1,1,1,0,1,NA)))
new_function <- function(ds, hrule,
variable_names, # variable_names = c("raw1,"raw2"), the number will vary
harmony_name # harmony_name = "h1", there might be "h2"
){
}
Thanks in advance for your ideas!

Here's the full solution, suggested by #Symbolix
rm(list=ls(all=TRUE)) #Clear the memory of variables from previous run. This is not called by knitr, because it's above the first chunk.
cat("\f")
library(magrittr)
(ds <- data.frame("id" = 1:10,
"raw1" = c(1,1,0,0,NA,NA,1 ,0 ,NA,1),
"raw2" = c(1,0,1,0,1 ,0 ,NA,NA,NA,1)))
response_profile <- ds %>% dplyr::group_by(raw1, raw2) %>% dplyr::summarize(count=n()) %>% dplyr::select(-count)
(hrule <- cbind(response_profile,
"h1" = c(0,1,0 ,1,1,1 ,0 ,1 ,NA), # at least one 1 to produce 1
"h2"= c(0,0,NA,0,1,NA,NA,NA,NA) # both must be 1
))
recode_from_meta <- function(ds, hrule, variable_names, harmony_name){
d <- merge(ds, hrule[, c(variable_names, harmony_name)], by=variable_names, all.x=T)
}
> hrule
raw1 raw2 h1 h2
1 0 0 0 0
2 0 1 1 0
3 0 NA 0 NA
4 1 0 1 0
5 1 1 1 1
6 1 NA 1 NA
7 NA 0 0 NA
8 NA 1 1 NA
9 NA NA NA NA
> (d <- recode_from_meta(ds, hrule,variable_names=c("raw1", "raw2"), harmony_name="h1"))
raw1 raw2 id h1
1 0 0 4 0
2 0 1 3 1
3 0 NA 8 0
4 1 0 2 1
5 1 1 1 1
6 1 1 10 1
7 1 NA 7 1
8 NA 0 6 0
9 NA 1 5 1
10 NA NA 9 NA
> (d <- recode_from_meta(ds, hrule,variable_names=c("raw1", "raw2"), harmony_name="h2"))
raw1 raw2 id h2
1 0 0 4 0
2 0 1 3 0
3 0 NA 8 NA
4 1 0 2 0
5 1 1 1 1
6 1 1 10 1
7 1 NA 7 NA
8 NA 0 6 NA
9 NA 1 5 NA
10 NA NA 9 NA

Related

Conditioning error, progression of logic in mutate/elseif_ pipeline

I'm trying to work out why a code like this won't give me the expected results. I understand there are better ways of achieving the results (cut, etc.) but I am specifically trying to understand why the mutate>ifelse pipeline progression to replace values doesn't work.
A <- c(1,0,0,0,NA,0,1,0,1,0,0,1,1,1,NA,NA,NA,1,0,0,0,1,1,1,0,1,NA)
B <- c(1,0,0,NA,0,1,1,1,0,1,NA,1,0,1,NA,NA,1,0,01,0,0,0,NA,0,1,0,1)
C <- c(0,NA,0,1,0,1,NA,1,0,1,NA,0,1,0,NA,NA,1,0,01,NA,0,0,NA,1,NA,NA,1)
df <- data.frame(A, B, C)
df$D <- NA
df <- df %>%
mutate(D=ifelse(A==0 & B==0 & C==0,0,D)) %>% #assign 0 to d IF all 3 variables 0
mutate(D=ifelse(A==0 | B==0 | C==0,0,D)) %>% #now assign 0 to d IF ANY of 3 variables 0
mutate(D=ifelse(A==1 | B==1 | C==1,1,D)) #now reassign d to 1 if any of the variables has the value 1
> summary(as.factor(df$D))
0 1 NA's
2 19 6
But looking at cross tabulation, my aims is to get 0=2 and NA=2 and rest assigned 1. I can't figure out why my code's logic is not working.
> ftable(xtabs(~A+B+C, df, addNA = TRUE, na.action = NULL)) #matches AV variable
C 0 1 NA
A B
0 0 2 0 2
1 0 4 1
NA 0 1 1
1 0 3 2 1
1 3 0 1
NA 0 0 1
NA 0 1 0 0
1 0 2 0
NA 0 0 2
Edit: corrected typo
Look at your code step by step, specificslly the two mutate commands with the OR conditions. For rows that contain missing and 1s (but no zeroes), R can‘t check if this row contains a zero, because it does not know what NA might be. So the second mutate returns NA for any row that has only 1s and NAs. The third step dows the same, just with 1s. Any row that only contains 0s and NAs will then return NA.
You can verify this by:
x <- c(0, 0, NA)
any(x == 0)
[1] TRUE
any(x == 1)
[1] NA
You can do:
library(tidyverse)
df2 <- df %>%
mutate(D = case_when(A == 0 & B == 0 & C == 0 ~ 0,
is.na(A) & is.na(B) & is.na(C) ~ NA_real_,
TRUE ~ 1))
which gives:
A B C D
1 1 1 0 1
2 0 0 NA 1
3 0 0 0 0
4 0 NA 1 1
5 NA 0 0 1
6 0 1 1 1
7 1 1 NA 1
8 0 1 1 1
9 1 0 0 1
10 0 1 1 1
11 0 NA NA 1
12 1 1 0 1
13 1 0 1 1
14 1 1 0 1
15 NA NA NA NA
16 NA NA NA NA
17 NA 1 1 1
18 1 0 0 1
19 0 1 1 1
20 0 0 NA 1
21 0 0 0 0
22 1 0 0 1
23 1 NA NA 1
24 1 0 1 1
25 0 1 NA 1
26 1 0 NA 1
27 NA 1 1 1
And then
df2 %>% count(D)
D n
1 0 2
2 1 23
3 NA 2

Creating dummies with apply in R

I have data about different study strategies for individuals (stored in columns labeled StrategyA, StrategyB, StrategyC. The strategies are coded 1-15. I want to create a dummy for each strategy (e.g. strategy1, strategy2, etc) because each student can list up to 3 strategies.
Example Data
ID = c(1, 2, 3, 4, 5)
Strategy_A = c(10, 12, 13, 1, 2)
Strategy_B = c(1, 2, 1, 4, 5)
Strategy_C = c(2, 3, 6, 8, 15)
all = data.frame(ID, Strategy_A, Strategy_B, Strategy_C)
I thought about using apply and creating a function linked to the fastDummies package.
dummies = function(x){
dummy_cols(x)
}
new = apply(all [,-1], 2, dummies)
new = as.data.frame(new)
However, this creates dummies for StrategyA_1 StrategyA_2 StrategyA_3 rather than summarizing the dummies as Strategy1 Strategy2 Strategy3. Any ideas how to fix this?
After a small transformation of all, you can use dummy.data.frame() from dummies (you can also use dummy_cols() from fastDummies) and then aggregate per ID.
all <- data.frame(ID = rep(all$ID, 3),
Strategy = c(all$Strategy_A, all$Strategy_B, all$Strategy_C)) # data frame "all" with one column Strategy
library(dummies)
all <- dummy.data.frame(all, "Strategy") # or fastDummies::dummy_cols(all, "Strategy")
aggregate(. ~ ID, all, sum) # since strategies are now dummies, the sum will always be 0 or 1
# output
ID Strategy1 Strategy2 Strategy3 Strategy4 Strategy5 Strategy6 Strategy8 Strategy10 Strategy12 Strategy13 Strategy15
1 1 1 1 0 0 0 0 0 1 0 0 0
2 2 0 1 1 0 0 0 0 0 1 0 0
3 3 1 0 0 0 0 1 0 0 0 1 0
4 4 1 0 0 1 0 0 1 0 0 0 0
5 5 0 1 0 0 1 0 0 0 0 0 1
I provide a method with the tidyverse way.
library(tidyverse)
new <- all %>% gather(select = -ID) %>%
mutate(key = NULL, num = 1) %>%
spread(value, num)
# ID 1 2 3 4 5 6 8 10 12 13 15
# 1 1 1 1 NA NA NA NA NA 1 NA NA NA
# 2 2 NA 1 1 NA NA NA NA NA 1 NA NA
# 3 3 1 NA NA NA NA 1 NA NA NA 1 NA
# 4 4 1 NA NA 1 NA NA 1 NA NA NA NA
# 5 5 NA 1 NA NA 1 NA NA NA NA NA 1
new[is.na(new)] <- 0
new
# ID 1 2 3 4 5 6 8 10 12 13 15
# 1 1 1 1 0 0 0 0 0 1 0 0 0
# 2 2 0 1 1 0 0 0 0 0 1 0 0
# 3 3 1 0 0 0 0 1 0 0 0 1 0
# 4 4 1 0 0 1 0 0 1 0 0 0 0
# 5 5 0 1 0 0 1 0 0 0 0 0 1

how to merge in R and automatically have 0 counts instead of NA

I have two (really: multiple) vectors and am merging them together (really: to a very large df). When I merge, if there is one instance of a in one vector and not another, it appears as NA in the merged df:
> a=data.frame(table(letters[1:4]))
> b=data.frame(table(letters[4:10]))
> merge(a,b,by='Var1',all=1)
Var1 Freq.x Freq.y
1 a 1 NA
2 b 1 NA
3 c 1 NA
4 d 1 1
5 e NA 1
6 f NA 1
7 g NA 1
8 h NA 1
9 i NA 1
10 j NA 1
> m=merge(a,b,by='Var1',all=1)
Is it possible to convert the NA's straight to 0, without adding an extra line of code
> m[is.na(m)]=0
> m
Var1 Freq.x Freq.y
1 a 1 0
2 b 1 0
3 c 1 0
4 d 1 1
5 e 0 1
6 f 0 1
7 g 0 1
8 h 0 1
9 i 0 1
10 j 0 1
Reason: my df is quite large, and I do not want to use lots of processing power
What you suggested should work directly, or at least how I understand your question:
> d <- data.frame(rep(NA,10),rep(1,10),rep(NA,10))
> d
rep.NA..10. rep.1..10. rep.NA..10..1
1 NA 1 NA
2 NA 1 NA
3 NA 1 NA
4 NA 1 NA
5 NA 1 NA
6 NA 1 NA
7 NA 1 NA
8 NA 1 NA
9 NA 1 NA
10 NA 1 NA
> d[is.na(d)] <- 0
> d
rep.NA..10. rep.1..10. rep.NA..10..1
1 0 1 0
2 0 1 0
3 0 1 0
4 0 1 0
5 0 1 0
6 0 1 0
7 0 1 0
8 0 1 0
9 0 1 0
10 0 1 0

Dynamic column name in for loop

I have problem with my code. I have data frame like this:
A <- c(21, 234, NA, 286,NA)
B <- c(3,NA,NA, 8, 10)
data <- data.frame(A,B)
data
A B
1 21 3
2 234 NA
3 NA NA
4 286 8
5 NA 10
And the effect I want to create is:
A B A_NA B_NA
1 21 3 0 0
2 234 NA 0 1
3 NA NA 1 1
4 286 8 0 0
5 NA 10 1 0
Here is my simple code, but something doesn't work..
for(i in c(1:ncol(data)))
{
data[, ncol(data) + 1] <- ifelse(is.na(data[i]), 1, 0)
names(data)[ncol(data)] <- paste0(colnames(data[i]), "_NA")
}
because effect is:
A B A A B A A
1 21 3 0 0 0 0 0
2 234 NA 0 0 1 0 0
3 NA NA 1 1 1 0 0
4 286 8 0 0 0 0 0
5 NA 10 1 1 0 0 0
We can use lapply to loop over the columns of 'data', check whether the elements are NA (is.na(x)), convert to integer (as.integer) and assign the output to new columns
data[paste0(names(data), "_NA")] <- lapply(data, function(x) as.integer(is.na(x)))
data
# A B A_NA B_NA
#1 21 3 0 0
#2 234 NA 0 1
#3 NA NA 1 1
#4 286 8 0 0
#5 NA 10 1 0
Adding columns based on a condition:
data$A_NA<-ifelse(is.na(data$A),1,0)
data$B_NA<-ifelse(is.na(data$B),1,0)
addition
Recursively:
for(nm in names(data))
eval(parse(text = paste0("data$",nm,"_NA<-ifelse(is.na(data$",nm,"),1,0)")))
addition2
Alternatively one can use:
for(nm in names(data)){
assign(paste0(nm,"_NA"), ifelse(is.na(data[nm]),1,0))
tempo<-data.frame(get(paste0(nm,"_NA")));names(tempo)<-paste0(nm,"_NA")
data<-cbind(data,tempo)
}

Convert 0s to NAs when that is the only value in a row. Keep all other 0s using R

I have a dataframe 'animals':
ID A B C D
1 Bear 1 1 1 0
2 Tiger 0 0 0 0
3 Horse 1 0 1 0
4 Badger 0 0 0 1
5 Rabbit 1 1 0 1
6 Otter 0 0 0 0
7 Peacock 1 0 0 0
I would like to convert the zeros in rows only containing zeros to NAs but to leave other zeros as they are. I can do this as follows:
animals$Result = rowSums(animals[2:ncol(animals)])
df = data.frame()
for(row in 1:nrow(animals)) {
row = as.data.frame(animals[row,])
if(row$Result == 0){
row[2:5] = NA
}
df = rbind(df,row)
print(row)}
df$Result = NULL
To obtain this:
ID A B C D
Bear 1 1 1 0
Tiger NA NA NA NA
Horse 1 0 1 0
Badger 0 0 0 1
Rabbit 1 1 0 1
Otter NA NA NA NA
Peacock 1 0 0 0
However, I feel there should be an easier way to do this. Is there? Thank you!
We can do this without a loop by creating a logical vector based on counting the number of 0's per each row with rowSums. Based on that subset the dataset without the first column and assign the rows that satisfy the condition to NA
df1[!rowSums(df1[-1]!=0), -1] <- NA
df1
# ID A B C D
#1 Bear 1 1 1 0
#2 Tiger NA NA NA NA
#3 Horse 1 0 1 0
#4 Badger 0 0 0 1
#5 Rabbit 1 1 0 1
#6 Otter NA NA NA NA
#7 Peacock 1 0 0 0
Here is second base R method that uses Reduce to find the rows to set to NA and then lapply to loop through the variables, with replace doing the replacement work.
# find rows to set to NA
nas <- !Reduce("|", df[-1])
# run through relevant variables, setting desired elements to NA
df[-1] <- lapply(df[-1], replace, nas, NA)
This returns
df
ID A B C D
1 Bear 1 1 1 0
2 Tiger NA NA NA NA
3 Horse 1 0 1 0
4 Badger 0 0 0 1
5 Rabbit 1 1 0 1
6 Otter NA NA NA NA
7 Peacock 1 0 0 0

Resources