Recoding multiple variables in R - r

I would like to recode multiple variables at once in R. The variables are within a larger dataframe. Here is some example data:
z <- data.frame (A = c(1,2,300,444,555),
B = c(555,444,300,2,1),
C = c(1,2,300,444,555),
D = c(1,2,300,444,555))
What I would like to do is recode all values that equal 300 as 3, 444 as 4, and 555 as 5.
I thought I could possibly do this in a list. Here is what I tried:
example_list = list(c("A", "B", "C", "D"))
example_list <- apply(z[,example_list], 1, function(x) ifelse(any(x==555, na.rm=F), 0.5,
ifelse(any(x==444), 0.25),
ifelse(any(x==300), 3, example_list)))
I get this error:
Error during wrapup: invalid subscript type 'list'
Then tried using "lapply" and I got this error:
Error during wrapup: '1' is not a function, character or symbol
Even then I'm not sure this is the best way to go about doing this... I would just like to avoid doing this line by line for multiple variables. Any suggestions would be amazing, as I'm new to R and don't entirely understand what I'm doing wrong.
I did find a similar questions on SO: Question, but I'm not sure how to apply that to my specific problem.

Using case_when:
library(dplyr)
z %>% mutate_all(
function(x) case_when(
x == 300 ~ 3,
x == 444 ~ 4,
x == 555 ~ 5,
TRUE ~ x
)
)
A B C D
1 1 5 1 1
2 2 4 2 2
3 3 3 3 3
4 4 2 4 4
5 5 1 5 5

Here's a base R attempt which should be neatly extendable and pretty fast:
# set find and replace vectors
f <- c(300,444,555)
r <- c(3, 4, 5)
# replace!
m <- lapply(z, function(x) r[match(x,f)] )
z[] <- Map(function(z,m) replace(m,is.na(m),z[is.na(m)]), z, m)
# A B C D
#1 1 5 1 1
#2 2 4 2 2
#3 3 3 3 3
#4 4 2 4 4
#5 5 1 5 5

This seems a bit clunky but it works:
mutate_cols <- c('A', 'B')
z[, mutate_cols] <- as.data.frame(lapply(z[, mutate_cols], function(x) ifelse(x == 300, 3,
ifelse(x == 444, 4,
ifelse(x== 555, 5, x)))))

This should work.
library(plyr)
new.z<- apply(z, 1, function(x) mapvalues(x, from = c(300, 444, 555), to = c(3, 4, 5)))

z = data.frame (A = c(1,2,300,444,555),
B = c(555,444,300,2,1),
C = c(1,2,300,444,555),
D = c(1,2,300,444,555))
library(expss)
to_recode = c("A", "B", "C", "D")
recode(z[, to_recode]) = c(300 ~ 3, 444 ~ 4, 555 ~ 5)

If you alredy have factor variables and also want factor variables as result you can use the following code:
library(tidyverse)
z <- data.frame (A = factor(c(1,2,300,444,555)),
B = factor(c(555,444,300,2,1)),
C = factor(c(1,2,300,444,555)),
D = factor(c(1,2,300,444,555)))
new.z <- z %>%
mutate_all(function(x) recode_factor(x, "300" = "3", "444" = "4", "555" = "5"))

Related

Remove all records that have duplicates based on more than one variables

I have data like this
df <- data.frame(var1 = c("A", "A", "B", "B", "C", "D", "E"), var2 = c(1, 2, 3, 4, 5, 5, 6 ))
# var1 var2
# 1 A 1
# 2 A 2
# 3 B 3
# 4 B 4
# 5 C 5
# 6 D 5
# 7 E 6
A is mapped to 1, 2
B is mapped to 3, 4
C and D are both mapped to 5 (and vice versa: 5 is mapped to C and D)
E is uniquely mapped to 6 and 6 is uniquely mapped to E
I would like filter the dataset so that only
var1 var2
7 E 6
is returned. base or tidyverse solution are welcomed.
I have tried
unique(df$var1, df$var2)
df[!duplicated(df),]
df %>% distinct(var1, var2)
but without the wanted result.
Using igraph::components.
Represent data as graph and get connected components:
library(igraph)
g = graph_from_data_frame(df)
cmp = components(g)
Grab components where cluster size (csize) is 2. Output vertices as a two-column character matrix:
matrix(names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)]),
ncol = 2, dimnames = list(NULL, names(df))) # wrap in as.data.frame if desired
# var1 var2
# [1,] "E" "6"
Alternatively, use names of relevant vertices to index original data frame:
v = names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)])
df[df$var1 %in% v[1:(length(v)/2)], ]
# var1 var2
# 7 E 6
Visualize the connections:
plot(g)
Using a custom function to determine if the mapping is unique you could achieve your desired result like so:
df <- data.frame(
var1 = c("A", "A", "B", "B", "C", "D", "E"),
var2 = c(1, 2, 3, 4, 5, 5, 6)
)
is_unique <- function(x, y) ave(as.numeric(factor(x)), y, FUN = function(x) length(unique(x)) == 1)
df[is_unique(df$var2, df$var1) & is_unique(df$var1, df$var2), ]
#> var1 var2
#> 7 E 6
Another igraph option
decompose(graph_from_data_frame(df)) %>%
subset(sapply(., vcount) == 2) %>%
sapply(function(g) names(V(g)))
which gives
[,1]
[1,] "E"
[2,] "6"
A base R solution:
df[!(duplicated(df$var1) | duplicated(df$var1, fromLast = TRUE) |
duplicated(df$var2) | duplicated(df$var2, fromLast = TRUE)), ]
var1 var2
7 E 6

Finding the index for 2nd Min value in a data frame

I have a data frame df1. I would like to find the index for the second smallest value from this dataframe. With the function which.min I was able to get the row index for the smallest value but is there a way to get the index for the second smallest value?
> df1
structure(list(x = c(1, 2, 3, 4, 3), y = c(2, 3, 2, 4, 6), z = c(1,
4, 2, 3, 11)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
>df1
x y z
1 2 1
2 3 4
3 2 2
4 4 3
3 6 11
This is my desired output. For example, in x, the value 2 in row 2 is the second smallest value. Thank you.
>df2
x 2
y 2
z 3
Updated answer
You can write a function like the following, using factor:
which_min <- function(x, pos) {
sapply(x, function(y) {
which(as.numeric(factor(y, sort(unique(y)))) == pos)[1]
})
}
which_min(df1, 2)
# x y z
# 2 2 3
Testing it out with other data:
df2 <- df1
df2$new <- c(1, 1, 1, 2, 3)
which_min(df2, 2)
# x y z new
# 2 2 3 4
Original answer
Instead of sort, you can use order:
sapply(df1, function(x) order(unique(x))[2])
# x y z
# 2 2 3
Or you can make use of the index.return argument in sort:
sapply(df1, function(x) sort(unique(x), index.return = TRUE)$ix[2])
# x y z
# 2 2 3
You can do :
sapply(df1, function(x) which.max(x == sort(unique(x))[2]))
#x y z
#2 2 3
Or with dplyr :
library(dplyr)
df1 %>%
summarise(across(.fns = ~which.max(. == sort(unique(.))[2])))
# x y z
# <int> <int> <int>
#1 2 2 3
Another base R version using rank
> sapply(df1, function(x) which(rank(unique(x)) == 2))
x y z
2 2 3
You could try something like:
sort(unique(unlist(df1)))[2]

Creating a function to remove columns with different names from a list of dataframes

I have many dataframes that contain the same data, except for a few column differences between them that I want to remove. Here's something similar to what I have:
df1 <- data.frame(X = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
df2 <- data.frame(X..x = c(1, 2, 3, 4, 5),
X..y = c(1, 2, 3, 4, 5),
var1 = c('f', 'g', 'h', 'i', 'j'),
var2 = c(0, 1, 0, 1, 1))
df_list <- list(df1=df1,df2=df2)
I am trying to create a function to remove the X, X..x, and X..y columns from each of the dataframes. Here's what I've tried with the given error:
remove_col <- function(df){
df = subset(df, select = -c(X, X..x, X..y))
return(df)
}
df_list <- lapply(df_list, remove_col)
# Error in eval(substitute(select), nl, parent.frame()) :
# object 'X..x' not found
I'm running into problems because not all dataframes contain X, and similarly not all dataframes contain X..x and X..y. How can I update the function so that it can be applied to all dataframes in the list and successfully remove its given columns?
Using R version 3.5.1, Mac OS X 10.13.6
You can try:
#Function
remove_col <- function(df,name){
vec <- which(names(df) %in% name)
df = df[,-vec]
return(df)
}
df_list <- lapply(df_list, remove_col,name=c('X', 'X..x', 'X..y'))
$df1
var1 var2
1 a 1
2 b 1
3 c 0
4 d 0
5 e 1
$df2
var1 var2
1 f 0
2 g 1
3 h 0
4 i 1
5 j 1
if you want to keep only the columns with "var"
lapply(df_list, function(x) x[grepl("var",colnames(x))])
or if you really just want those removed explecitly
lapply(df_list, function(x) x[!grepl("^X$|^X\\.\\.x$|^X\\.\\.y$",colnames(x))])
$df1
var1 var2
1 a 1
2 b 1
3 c 0
4 d 0
5 e 1
$df2
var1 var2
1 f 0
2 g 1
3 h 0
4 i 1
5 j 1
Instead of checking each list element for the same column names, it can be automated if we can extract the intersecting column names across the list. Loop over the list, get the column names, find the intersecting elements with Reduce and use that to subset the columns
nm1 <- Reduce(intersect, lapply(df_list, names))
lapply(df_list, `[`, nm1)
#$df1
# var1 var2
#1 a 1
#2 b 1
#3 c 0
#4 d 0
#5 e 1
#$df2
# var1 var2
#1 f 0
#2 g 1
#3 h 0
#4 i 1
#5 j 1
Or with tidyverse
library(dplyr)
library(purrr)
map(df_list, names) %>%
reduce(intersect) %>%
map(df_list, select, .)

Using a column as a column index to extract value from a data frame in R

I am trying to use the values from a column to extract column numbers in a data frame. My problem is similar to this topic in r-bloggers. Copying the script here:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
However, instead of having column names in choice, I have column index number, such that my data frame looks like this:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3),
stringsAsFactors = FALSE)
I tried using this solution:
df$newValue <-
df[cbind(
seq_len(nrow(df)),
match(df$choice, colnames(df))
)]
Instead of giving me an output that looks like this:
# x y choice newValue
# 1 1 4 1 1
# 2 2 5 2 2
# 3 3 6 1 6
# 4 8 9 3 NA
My newValue column returns all NAs.
# x y choice newValue
# 1 1 4 1 NA
# 2 2 5 2 NA
# 3 3 6 1 NA
# 4 8 9 3 NA
What should I modify in the code so that it would read my choice column as column index?
As you have column numbers which we need to extract from data frame already we don't need match here. However, since there is a column called choice in the data which you don't want to consider while extracting data we need to turn the values which are not in the range to NA before subsetting from the dataframe.
mat <- cbind(seq_len(nrow(df)), df$choice)
mat[mat[, 2] > (ncol(df) -1), ] <- NA
df$newValue <- df[mat]
df
# x y choice newValue
#1 1 5 1 1
#2 2 6 2 6
#3 3 7 1 3
#4 4 8 3 NA
data
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3))

How to simplify data checking that iterates by rows and columns (two nested loops) with if_else

Q1 <- c(1, 1, 2, 2)
Q2_1 <- c(3, 3, 3, 3)
Q2_2 <- c(3, 4, 2, 1)
data <- data.frame(cbind(Q1, Q2_1, Q2_2))
I need to do some data checking if values in Q1 variables do not appear in Q2 variables (in both Q2_1 and Q2_2) and I need the result in a single variable.
For now I was using to nested for loops (for rows and columns) with if_else function from dplyr but it's quite a lot of code and I have to do similar checks multiple times. Is there any way to simplify the code?
For now that what I'm doing:
Q2_index <- grep("Q2_", names(data))
data$Q2_error <- 0
for(i in 1:dim(data)[1]){
for(j in 1:length(Q2_index)){
data$Q2_error[i] <- if_else(data$Q2_error != 1 & data$Q1 == data[, Q2_index[j]], 1, 0, 0)[i]
}
}
Second example:
ID <- 1:3
Q1_1 <- 1:3
Q1_2 <- c(3, NA, 1)
Q1_3 <- c(4, 2, 1)
Q2_1 <- c(5, 2, 1)
Q2_2 <- c(1, NA, NA)
Q2_3 <- c(NA, NA, NA)
data <- data.frame(ID, Q1_1, Q1_2, Q1_3, Q2_1, Q2_2, Q2_3)
Q1_index <- grep("Q1_", names(data))
Q2_index <- grep("Q2_", names(data))
data$Q1Q2error <- 0
for(i in 1:dim(data)[1]){
for(j in 1:length(Q1_index)){
data$Q1Q2error[i] <- if_else(data[, Q1_index[j]] >= 1 & data[, Q2_index[j]] != data[, Q1_index[j]] & is.na(data[, Q2_index[j]]), 0, 1, 1)[i]
}
}
Evaluated conditions vary from check to check. As a result I need a single variable that indicates if I deal with an error so I can easily match the error to ID (so either 1 and 0 or TRUE, FALSE). Please notice that this is simplyfied example and I have to deal with around 10-20 Q1 or Q2 variables at the same time.
Why not to create a generic function for the desired operation so you can reuse it on new data frames:
aggreg_it <- function(data){
cols_Q1 <- names(data)[grep("Q1", names(data))]
cols_Q2 <- names(data)[grep("Q2", names(data))]
mapply(function(i,j) {ifelse(length(intersect(i, j))>0,1,0)},
strsplit(apply(data[, cols_Q1], 1, paste, collapse=","),","),
strsplit(apply(data[, cols_Q2] , 1, paste ,collapse=","),","))
}
data$result <- aggreg_it(data)
# ID Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 result
#1 1 10 3 4 5 1 NA 0
#2 2 11 NA 2 2 NA NA 1
#3 3 12 1 1 1 NA NA 1
#4 4 13 5 6 7 8 9 0
as you did not make assumptions about NAs, note that NAs are considered as valid values in this example. Hope this brings you forward. similar you could make a funcion from your code.
Data used:
ID <- 1:4
Q1_1 <- c(10,11,12,13)
Q1_2 <- c(3, NA, 1, 5)
Q1_3 <- c(4, 2, 1, 6)
Q2_1 <- c(5, 2, 1, 7)
Q2_2 <- c(1, NA, NA, 8)
Q2_3 <- c(NA, NA, NA, 9)
data <- data.frame(ID, Q1_1, Q1_2, Q1_3, Q2_1, Q2_2, Q2_3)
No need for a loop, these operations are vectorized in R.
(I changed your input data a bit to show differentiated results)
Base R:
data$Q1_in_Q2 <- data$Q1 %in% data$Q2_1 | data$Q1 %in% data$Q2_2
data
#> Q1 Q2_1 Q2_2 Q1_in_Q2
#> 1 1 1 5 TRUE
#> 2 3 3 4 TRUE
#> 3 2 3 2 TRUE
#> 4 6 3 1 FALSE
With dplyr:
library(dplyr)
data <- data %>%
mutate(Q1_in_Q2_1 = Q1 %in% Q2_1,
Q1_in_Q2_2 = Q1 %in% Q2_2,
Q1_in_Q2 = Q1_in_Q2_1 | Q1_in_Q2_2) %>%
select(Q1, Q2_1, Q2_2, Q1_in_Q2_1, Q1_in_Q2_2, Q1_in_Q2)
data
#> Q1 Q2_1 Q2_2 Q1_in_Q2_1 Q1_in_Q2_2 Q1_in_Q2
#> 1 1 1 5 TRUE TRUE TRUE
#> 2 3 3 4 TRUE FALSE TRUE
#> 3 2 3 2 FALSE TRUE TRUE
#> 4 6 3 1 FALSE FALSE FALSE
Data:
Q1 <- c(1, 3, 2, 6)
Q2_1 <- c(1, 3, 3, 3)
Q2_2 <- c(5, 4, 2, 1)
data <- data.frame(cbind(Q1, Q2_1, Q2_2))

Resources