Recoding multiple variables on different scales using across() - r

Let's say I have the following data:
data <- data.frame("ID" = c(1:5),
"Var1" = c(1,0,1,0,0),
"Var2" = c(99,2,1,3,2))
Each variable beginning with "Var" has a different numeric scale. I want to recode these numeric values into text. To do this I can use something like:
Var1_recode <- c("1 = 'yes'; 0 = 'no'")
Var2_recode <- c("99 = 'unknown'; 1 = 'weak'; 2 = 'moderate'; 3 = 'strong'")
data_recoded <- data %>%
mutate(Var1 = car::recode(Var1, Var1_recode),
Var2 = car::recode(Var2, Var2_recode))
However, in a large dataset with lots of columns to be recoded, specifying each recoded variable in mutate would lead to lots of repetition. My question: is there a way to use across to recode all of my "Var" variables with the relevant recode variables? The output for this example would look like this:
ID Var1 Var2
1 1 yes unknown
2 2 no moderate
3 3 yes weak
4 4 no strong
5 5 no moderate
I've tried searching for a solution like the following, but I can't work out a way of specifying the relevant recode vector for each column in my data:
data_recoded <- data %>%
mutate(across(.cols = starts_with("Var"), ~ car::recode(.x, relevant_recode_vector_here)))
Any help would be much appreciated.

One option could be:
data %>%
mutate(across(Var1:Var2, ~ car::recode(., get(paste0(cur_column(), "_recode")))))
ID Var1 Var2
1 1 yes unknown
2 2 no moderate
3 3 yes weak
4 4 no strong
5 5 no moderate

Include the recode rules in a list and apply it using Map :
recode_rules <- list(c("1 = 'yes'; 0 = 'no'"),
c("99 = 'unknown'; 1 = 'weak'; 2 = 'moderate'; 3 = 'strong'"))
data[-1] <- Map(car::recode, data[-1], recode_rules)
data
# ID Var1 Var2
#1 1 yes unknown
#2 2 no moderate
#3 3 yes weak
#4 4 no strong
#5 5 no moderate

Related

Create a chi-square table from 4 columns and pair 2 of the values together to make one dependent and other indenpendent

I have a list of columns below.
col 1|col 2|col 3|col 4|col 5|Yes Col_B|No Col_B|Yes Col_W|No Col_W
1 1 3 3 5 7 9 3 2
What i would like to do is take the last four columns and take Yes Col_B, No Col_B, Yes Col_W, and No Col_W and then imagine them as two columns
Yes or No| B or W
7 B
9 B
3 W
2 W
Now that i have two temporary columns I could run a chisquare to indicate if Yes or No is dependent on B or W
test <- chisq.test(table(data$YesorNo, data$BorW))
First we use pivot_longer from tidyr, and set it to create one group (line) for every column:
newdf = tidyr::pivot_longer(df[,6:9], cols=everything())
Which gives:
name value
1 Yes Col_B 7
2 No Col_B 9
3 Yes Col_W 3
4 No Col_W 2
Now we need to separate the name column into two, one for the yes or no, one for the B or W. We do that with finding a pattern in those names (regular expressions):
The pattern is (yes or no)( Col_)(B or W), we write that as "(Yes|No) Col_(B|W)". Then we run a loop to create one column for the first group - where the groups are set by the brackets - (given by "\\1"), and another for the second ("\\2"), and use paste0("\\",i) to do this.
newdf = cbind(NA, NA, newdf) #Creating 2 empty columns
for(i in c(1,2)){
newdf[,i] = gsub("(Yes|No) Col_(B|W)",
paste0("\\",i),
newdf$name)}
newdf$name = NULL #Getting rid of the name column
colnames(newdf) = c("Yes or No", "B or W", "Value")
Output:
Yes or No B or W Value
1 Yes B 7
2 No B 9
3 Yes W 3
4 No W 2
Here is another version to Ricardo, where most of the name splitting and separation is accomplished within the pivot_longer function:
df<-data.frame(`Yes Col_B`=7, `No Col_B`=9, `Yes Col_W`=3, `No Col_W`=2)
library(tidyr)
library(dplyr)
answer <- pivot_longer(df, contains("Col_"), names_sep = "_", names_to=c("Yes_No", ".value")) %>%
mutate(Yes_No=str_replace(Yes_No, "\\.Col", ""))
answer
## A tibble: 2 x 3
# Yes_No B W
# <chr> <dbl> <dbl>
#1 Yes 7 3
#2 No 9 8
chisq.test(answer[ , c("B", "W")])
#since counts are less than 5 suggest the Fisher's Exact Test
fisher.test(answer[ , c("B", "W")])
The chi^2 test generally needs at least 5 members per category for analysis, thus I have included the Fisher's Exact test as alternative.

Find rows with incomplete set depending on a factor, then replace values that exist by NA for the incomplete set

I cannot work this one out.
I have an incomplete dataset (many rows and variables) with one factor that specify whether all the other variables are pre- or post- something. I need to get summary statistics for all variables pre- and post- only including rows where the pre- AND post- values are not NA.
I am trying to find a way to replace existing values with NA if the set is incomplete separately for each variable.
The following is a simple example of what I am trying to achieve:
df = data.frame(
id = c(1,1,2,2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20),
var3change = c(5,10,15,20),
var4change = c(NA,2,3,8)
)
which leads to:
id myfactor var2change var3change var4change
1 1 1 10 5 NA
2 1 2 10 10 2
3 2 1 NA 15 3
4 2 2 20 20 8
My desired output would be:
id myfactor var2change var3change var4change
1 1 1 10 5 NA
2 1 2 10 10 NA
3 2 1 NA 15 3
4 2 2 NA 20 8
I have much more than one variable to deal with and the set is incomplete in a different way for each variable independently. I have the feeling this may be achieved with smart use of existing functions from the plyr / tidyr packages but I cannot find an elegant way to apply the concepts to my problem.
Any help would be appreciated.
You can group by id and if any value has NA in it replace all of them with NA. To apply a function to multiple columns we use across.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('var'), ~if(any(is.na(.))) NA else .))
#for dplyr < 1.0.0 we can use `mutate_at`
#mutate_at(vars(starts_with('var')), ~if(any(is.na(.))) NA else .)
# id myfactor var2change var3change var4change
# <dbl> <fct> <dbl> <dbl> <dbl>
#1 1 1 10 5 NA
#2 1 2 10 10 NA
#3 2 1 NA 15 3
#4 2 2 NA 20 8
It would help to have a grouping variable (group) as well as your time variable (myfactor). Then you can do some finangling to create the variables you want with dplyr.
library(dplyr)
df = data.frame(
group = rep(c(1,2), each = 2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20)
)
df %>% group_by(group) %>%
mutate(var3change = all(!is.na(var2change)),
var4change = if_else(var3change, var2change, as.numeric(NA)))
I'm assuming that the dataset you have is ordered, so each pair of observations is grouped by their row index.
By default, the mean() function will return an NA if any of the inputs to it are NA. This is therefore a neat way of getting an NA by group, using dplyr.
library(dplyr)
df = data.frame(
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20)
)
# 1 Create ID variable to group rows in pairs
id = c()
j = 0
for (i in 1:length(df$var2change)){
k = floor(j/2)
id = c(id, k)
j = j + 1
}
df$id = id
# Set all variables within group to NA if one of them is
df = df %>%
group_by(id) %>%
mutate(var_changed = mean(var2change))
If you have an explicit ID variable in your data, you can replace the first part of this solution.
EDIT: doing this for multiple variables (based on change to the question):
df = data.frame(
id = c(1,1,2,2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20),
var3change = c(5,10,15,20),
var4change = c(NA,2,3,8)
)
for (col in 2:4) {
col = paste0("var", col, "change")
df = df %>%
group_by(id) %>%
mutate(new_col = mean(get(col)))
df[["new_col"]] = ifelse(is.na(df["new_col"]), NA, df[[col]])
df[col] = NULL
names(df)[names(df) == "new_col"] <- col
}
If speed is an issue, you could speed this up by moving the group_by outside the loop

Assigning values to patterns of letters in character strings using R

I have a data frame that looks like this:
head(df)
shotchart
1 BMMMBMMBMMBM
2 MMMBBMMBBMMB
3 BBBBMMBMMMBB
4 MMMMBBMMBBMM
Different patterns of the letter 'M' are worth certain values such as the following:
MM = 1
MMM = 2
MMMM = 3
I want to create an extra column to this data frame that calculates the total value of the different patterns of 'M' in each row individually.
For example:
head(df)
shotchart score
1 BMMMBMMBMMBM 4
2 MMMBBMMBBMMB 4
3 BBBBMMBMMMBB 3
4 MMMMBBMMBBMM 5
I can't seem to figure out how to assign the values to the different 'M' patterns.
I tried using the following code but it didn't work:
df$score <- revalue(df$scorechart, c("MM"="1", "MMM"="2", "MMMM"="3"))
We create a named vector ('nm1'), split the 'shotchart' to extract only 'M' and then use the named vector to change the values to get the sum
nm1 <- setNames(1:3, strrep("M", 2:4))
sapply(strsplit(gsub("[^M]+", ",", df$shotchart), ","),
function(x) sum(nm1[x[nzchar(x)]], na.rm = TRUE))
Or using tidyverse
library(tidyverse)
df %>%
mutate(score = str_extract_all(shotchart, "M+") %>%
map_dbl(~ nm1[.x] %>%
sum(., na.rm = TRUE)))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
You can also split on "B" and base the result on the count of "M" characters -1 as follows:
df <- data.frame(shotchart = c("BMMMBMMBMMBM", "MMMBBMMBBMMB", "BBBBMMBMMMBB", "MMMMBBMMBBMM"),
score = NA_integer_,
stringsAsFactors = F)
df$score <- lapply(strsplit(df$shotchart, "B"), function(i) sum((nchar(i)-1)[(nchar(i)-1)>0]))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5

Turn long dataset of classes taken into wide dataset where variables are dummy code for each class

Say I have a dataset where rows are classes people took:
attendance <- data.frame(id = c(1, 1, 1, 2, 2),
class = c("Math", "English", "Math", "Reading", "Math"))
I.e.,
id class
1 1 "Math"
2 1 "English"
3 1 "Math"
4 2 "Reading"
5 2 "Math"
And I want to create a new dataset where rows are ids and the variables are class names, like this:
class.names <- names(table(attendance$class))
attedance2 <- matrix(nrow=length(table(attendance$id)),
ncol=length(class.names))
colnames(attedance2) <- class.names
attedance2 <- as.data.frame(attedance2)
attedance2$id <- unique(attendance$id)
I.e.,
English Math Reading id
1 NA NA NA 1
2 NA NA NA 2
I want to fill in the NAs with whether that particular id took that class or not. It can be Yes/No, 1/0, or counts of the classes
I.e.,
English Math Reading id
1 "Yes" "Yes" "No" 1
2 "No" "Yes" "Yes" 2
I'm familiar with dplyr, so it'd be easier for me if that was used in the solution but not necessary. Thank you for your help!
Using:
library(reshape2)
attendance$val <- 'yes'
dcast(unique(attendance), id ~ class, value.var = 'val', fill = 'no')
gives:
id English Math Reading
1 1 yes yes no
2 2 no yes yes
A similar approach with data.table:
library(data.table)
dcast(unique(setDT(attendance))[,val:='yes'], id ~ class, value.var = 'val', fill = 'no')
Or with dplyr/tidyr:
library(dplyr)
library(tidyr)
attendance %>%
distinct() %>%
mutate(var = 'yes') %>%
spread(class, var, fill = 'no')
Another, somewhat more convoluted option might to reshape first and then replace the counts with yes and no (see here for an explanation about the default aggregate option of dcast):
att2 <- dcast(attendance, id ~ class, value.var = 'class')
which gives:
id English Math Reading
1 1 1 2 0
2 2 0 1 1
Now you can replace the count with:
# create index which counts are above zero
idx <- att2[,-1] > 0
# replace the non-zero values with 'yes'
att2[,-1][idx] <- 'yes'
# replace the zero values with 'no'
att2[,-1][!idx] <- 'no'
which finally gives:
> att2
id English Math Reading
1 1 yes yes no
2 2 no yes yes
We can do this with base R
attendance$val <- "yes"
d1 <- reshape(attendance, idvar = 'id', direction = 'wide', timevar = 'class')
d1[is.na(d1)] <- "no"
names(d1) <- sub("val\\.", '', names(d1))
d1
# id Math English Reading
#1 1 yes yes no
#4 2 yes no yes
Or with xtabs
xtabs(val ~id + class, transform(unique(attendance), val = 1))
# class
# id English Math Reading
# 1 1 1 0
# 2 0 1 1
NOTE: The binary can be easily converted to 'yes', 'no', but it is better to have either 1/0 or TRUE/FALSE

How to refer to a tibble column, using a variable name, in a pipe (R)

I am pretty new to R, so this question may be a bit naive.
I have got a tibble with several columns, and I want to create a factor (Bin) by binning the values in one of the columns in N bins. Which is done in a pipe. However, I would like to be able to define the column to be binned at the top of the script (e.g. bin2use = RT), because I want this to be flexible.
I've tried several ways of referring to a column name using this variable, but I cannot get it to work. Amongst others I have tried get(), eval(), [[]]
simplified example code
Subject <- c(rep(1,100), rep(2,100))
RT <- runif(200, 300, 800 )
data_st <- tibble(Subject, RT)
bin2use = 'RT'
nbin = 5
binned_data <- data_st %>%
group_by(Subject) %>%
mutate(
Bin = cut_number(get(bin2use), nbin, label = F)
)
Error in mutate_impl(.data, dots) :
non-numeric argument to binary operator
We can use a non-standard evaluation with `lazyeval
library(dplyr)
library(ggplot2)
f1 <- function(colName, bin){
call <- lazyeval::interp(~cut_number(a, b, label = FALSE),
a = as.name(colName), b = bin)
data_st %>%
group_by(Subject) %>%
mutate_(.dots = setNames(list(call), "Bin"))
}
f1(bin2use, nbin)
#Source: local data frame [200 x 3]
#Groups: Subject [2]
# Subject RT Bin
# <dbl> <dbl> <int>
#1 1 752.2066 5
#2 1 353.0410 1
#3 1 676.5617 4
#4 1 493.0052 2
#5 1 532.2157 3
#6 1 467.5940 2
#7 1 791.6643 5
#8 1 333.1583 1
#9 1 342.5786 1
#10 1 637.8601 4
# ... with 190 more rows

Resources