case_when using variable name to change data value - r

I have the following dataframe:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
I want to change a specific value of each columns using the following logic:
Variable name contains "_lag1" then the first element of the column has to turn into NA
Variable name contains "_lag2" then the first and second element of the column has to turn into NA
Else the column remains as it is
The expected result should be look like:
df_new <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(NA,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(NA,NA,34,5,45,7)
)

As you have the original unlagged variables in your df you could simply recompute the lagged values using e.g. dplyr::lag which by default will give you NAs:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
library(dplyr)
df %>% mutate(var1_lag1 = dplyr::lag(var1_lag0, n = 1), var2_lag2 = dplyr::lag(var2_lag0, n = 2))
#> var1_lag0 var1_lag1 var2_lag0 var2_lag2
#> 1 1 NA 34 NA
#> 2 2 1 5 NA
#> 3 3 2 45 34
#> 4 4 3 7 5
#> 5 5 4 2 45
#> 6 6 5 1 7

A base R solution might look like this:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
df_new <- df
df_new[1 , grep(pattern="_lag1", colnames(df))] <- NA
df_new[c(1,2) , grep(pattern="_lag2", colnames(df))] <- NA
df_new
#> var1_lag0 var1_lag1 var2_lag0 var2_lag2
#> 1 1 NA 34 NA
#> 2 2 1 5 NA
#> 3 3 2 45 34
#> 4 4 3 7 5
#> 5 5 4 2 45
#> 6 6 5 1 7
Created on 2021-01-06 by the reprex package (v0.3.0)

Here is a for loop that checks the column names of the df for the key words "_lag1" and "_lag2" and turns the corresponding values to NA.
for (i in 1:length(df)){
if (grepl("_lag1",colnames(df)[i])){
df[1,i] = NA
}
else if (grepl("_lag2",colnames(df)[i])){
df[1:2,i] = NA
}
}

You can try to wrap a case_when inside a helper function and use mutate_at with contains to get the proper columns.
df %>%
mutate_at(vars(contains("lag1")),
function(x, lag) fix(x, "lag1")) %>%
mutate_at(vars(contains("lag2")),
function(x, lag) fix(x, "lag2"))
Which produces
var1_lag0 var1_lag1 var2_lag0 var2_lag2
1 1 NA 34 NA
2 2 1 5 NA
3 3 2 45 34
4 4 3 7 5
5 5 4 2 45
6 6 5 1 7
Here is the helper function called fix
fix <- function(x, lag){
real_lag <- case_when(stringr::str_detect("lag1", lag) ~ 1,
stringr::str_detect("lag2", lag) ~ 2)
x[1:real_lag] <- NA
return(x)
}

Related

dplyr/purrr iterate over columns as well as rows

I'm trying to drop (set to NA) values in 1 column, based on values in another column; and to do this over a large set of columns. The idea is to then pass the data to a plotting function, to generate different plots for different cuts of the data.
Here's a reproducible example:
d <- data.frame("A_agree" = sample(1:7, 20, replace=T),
"B_agree" = sample(1:7, 20, replace=T),
"C_agree" = sample(1:7, 20, replace=T),
"A_change" = sample(1:5, 20, replace=T),
"B_change" = sample(1:5, 20, replace=T),
"C_change" = sample(1:5, 20, replace=T))
I've already found the following solution using base R, but it's of course slow, and I'm trying to learn more and more dplyr, so was wondering how to achieve this in dplyr
d.positive <- d
for (n in (c("A","B","C"))) {
for (i in 1:nrow(d.positive)) {
d.positive[i, paste0(n, "_agree")] <- ifelse(d.positive[i, paste0(n, "_change")] > 3,
d.positive[i, paste0(n, "_agree")],
NA)
}
}
d.neutral <- d
for (n in (c("A","B","C"))) {
for (i in 1:nrow(d.neutral)) {
d.neutral[i, paste0(n, "_agree")] <- ifelse(d.neutral[i, paste0(n, "_change")] == 3,
d.neutral[i, paste0(n, "_agree")],
NA)
}
}
d.negative <- d
for (n in (c("A","B","C"))) {
for (i in 1:nrow(d.negative)) {
d.negative[i, paste0(n, "_agree")] <- ifelse(d.negative[i, paste0(n, "_change")] < 3,
d.negative[i, paste0(n, "_agree")],
NA)
}
}
I thought I would use gather(), and then check for each row whether the corresponding column (hence the !!dimension) is bigger than a certain value (3 in this case), but it doesn't seem to work?
d %>%
gather(dimension,
value,
paste0(c("A","B","C"), "_agree")
) %>%
case_when(!!dimension > 3 ~ value=NA)
Alternatively, I thought I'd use map2_dfr from purrr, but I don't think it iterates over cells, just takes the entire column, hence this doesn't work:
map2_dfr(.x = d %>%
select( paste0(c("A","B","C"), "_agree") ),
.y = d %>%
select( paste0(c("A","B","C"), "_change") ),
~ if_else(.y > 3, x, NA)} )
Any pointers would be really helpful, to keep learning about the wonderful world of dplyr !
I get that you want to learn about purrr, but base R is just easier here:
d.positive <- d
check <- d.positive[4:6] <= 3 #it's the same condition
d.positive[,1:3][check] <- NA
> d.positive
A_agree B_agree C_agree A_change B_change C_change
1 1 NA NA 4 3 2
2 2 2 NA 4 5 2
3 4 NA NA 4 3 1
4 1 NA NA 4 1 2
5 NA 1 NA 2 4 1
6 NA 7 NA 3 5 1
7 NA 6 NA 1 5 1
8 NA 6 4 2 5 5
9 4 NA NA 4 1 2
10 1 NA NA 5 1 2
11 NA NA NA 3 1 2
12 NA NA NA 1 3 3
13 NA NA NA 1 1 1
14 NA NA NA 3 2 3
15 1 NA NA 5 3 3
16 2 NA NA 4 3 2
17 NA NA 6 1 1 4
18 NA NA NA 1 1 2
19 NA NA NA 2 3 1
20 NA NA NA 1 3 1
I would suggest to use tidyr package in combination with dplyr. In it there are new functions pivot_longer and pivot_wider which replace older gather and spread.
Using a combination of both the solution could be as follows:
d.neutral1 =
d %>%
mutate(row = row_number() ) %>%
pivot_longer(-row, names_sep = "_", names_to = c("name","type") ) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(result = if_else(change == 3, agree, NA_integer_))
and if you want a similar shape to the original
d.neutral1 %>%
select(-agree, -change) %>%
pivot_wider(names_from = name, values_from = result)

Issue with local variables in r custom function

I've got a dataset
>view(interval)
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 2 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
>dput(interval)
structure(list(V1 = c(NA, 2, 3, 4, NA),
V2 = c(1, 2, NA, 2, 5),
V3 = c(2, 3, 1, 2, 1), ID = 1:5), row.names = c(NA, -5L), class = "data.frame")
I would like to extract the previous not NA value (or the next, if NA is in the first row) for every row, and store it as a local variable in a custom function, because I have to perform other operations on every row based on this value(which should change for every row i'm applying the function).
I've written this function to print the local variables, but when I apply it the output is not what I want
myFunction<- function(x){
position <- as.data.frame(which(is.na(interval), arr.ind=TRUE))
tempVar <- ifelse(interval$ID == 1, interval[position$row+1,
position$col], interval[position$row-1, position$col])
return(tempVar)
}
I was expecting to get something like this
# [1] 2
# [2] 2
# [3] 4
But I get something pretty messed up instead.
Here's attempt number 1:
dat <- read.table(header=TRUE, text='
V1 V2 V3 ID
NA 1 2 1
2 2 3 2
3 NA 1 3
4 2 2 4
NA 5 1 5')
myfunc1 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# catch first-row NA
ind[,1] <- ifelse(ind[,1] == 1L, 2L, ind[,1] - 1L)
x[ind]
}
myfunc1(dat)
# [1] 2 2 4
The problem with this is when there is a second "stacked" NA:
dat2 <- dat
dat2[2,1] <- NA
dat2
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 NA 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
myfunc1(dat2)
# [1] NA NA 2 4
One fix/safeguard against this is to use zoo::na.locf, which takes the "last observation carried forward". Since the top-row is a special case, we do it twice, second time in reverse. This gives us the "next non-NA value in the column (up or down, depending).
library(zoo)
myfunc2 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# this is to guard against stacked NA
x <- apply(x, 2, zoo::na.locf, na.rm = FALSE)
# this special-case is when there are one or more NAs at the top of a column
x <- apply(x, 2, zoo::na.locf, fromLast = TRUE, na.rm = FALSE)
x[ind]
}
myfunc2(dat2)
# [1] 3 3 2 4

Merge data.frame columns on set number of columns removing na's unless not enough values in row

I'd like to remove the NA values from my columns, merge all columns into four columns, while keeping NA's if there is not 4 values in each row.
Say I have data like this,
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
#> a b c d e f g
#> 1 1 3 NA 4 NA 1 NA
#> 2 4 NA 2 2 5 NA NA
#> 3 NA 3 NA NA 3 NA NA
#> 4 3 NA NA NA NA 4 4
My desired outcome would be,
df.desired <- data.frame('a' = c(1,4,3,3),
'b' = c(3,2,3,4),
'c' = c(4,2,NA,4),
'd' = c(1,5,NA,NA))
df.desired
#> a b c d
#> 1 1 3 4 1
#> 2 4 2 2 5
#> 3 3 3 NA NA
#> 4 3 4 4 NA
You could've probably explored a bit more on SO to tweak two answers 1 & 2.
Shifting all the Numbers with NAs
Remove the columns where you've got All NAs
Result:
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
df.new<-do.call(rbind,lapply(1:nrow(df),function(x) t(matrix(df[x,order(is.na(df[x,]))])) ))
colnames(df.new)<-colnames(df)
df.new
df.new[,colSums(is.na(df.new))<nrow(df.new)]
Output:
> df.new[,colSums(is.na(df.new))<nrow(df.new)]
a b c d
[1,] 1 3 4 1
[2,] 4 2 2 5
[3,] 3 3 NA NA
[4,] 3 4 4 NA
I believe there are more efficient ways, anyhow that is my try:
x00=sapply(1:nrow(df),function(x) df[x,][!is.na( df[x,])])
x01=lapply(x00,function(x) x=c(x,rep(NA,7-length(x)-1)))
x02=as.data.frame(do.call("rbind",x01))
x02 <- x02[,colSums(is.na(x02))<nrow(x02)]
I have following solution:
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
df
x <-list()
for(i in 1:nrow(df)){
x[[i]] <- df[i,]
x[[i]] <- x[[i]][!is.na(x[[i]])]
# x[[i]] <- as.data.frame(x[[i]], stringsAsFactors = FALSE)
x[[i]] <- c(x[[i]], rep(0, 5 -length(x[[i]])))
}
result <- do.call(rbind, x)
result

Why is using list() critical for .dots = setNames() uses in dplyr?

I am calling mutate using dynamic variable names. An example that mostly works is:
df <- data.frame(a = 1:5, b = 1:5)
func <- function(a,b){
return(a+b)
}
var1 = 'a'
var2 = 'b'
expr <- interp(~func(x, y), x = as.name(var1), y = as.name(var2))
new_name <- "dynamically_created_name"
temp <- df %>% mutate_(.dots = setNames(expr, nm = new_name))
Which produces
temp
a b func(a, b)
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10
This is mostly fine except that set names ignored the nm key. This is solved by wrapping my function in list():
temp <- df %>% mutate_(.dots = setNames(list(expr), nm = new_name))
temp
a b dynamically_created_name
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10
My question is why is setNames ignoring it's key in the first place, and how does list() solve this problem?
As noted in the other answer, the .dots argument is assumed to be a list, and setNames is a convenient way to rename elements in a list.
What is the .dots argument doing? Let's first think about the actual dots ... argument. It is a series of expressions to be evaluated. Below the dots ... are the two named expressions c = ~ a * scale1 and d = ~ a * scale2.
scale1 <- -1
scale2 <- -2
df %>%
mutate_(c = ~ a * scale1, d = ~ a * scale2)
#> a b c d
#> 1 1 1 -1 -2
#> 2 2 2 -2 -4
#> 3 3 3 -3 -6
#> 4 4 4 -4 -8
#> 5 5 5 -5 -10
We could just bundle those expressions together beforehand in a list. That's where .dots comes in. That parameter lets us tell mutate_ to evaluate the expressions in the list.
bundled <- list(
c2 = ~ a * scale1,
d2 = ~ a * scale2
)
df %>%
mutate_(.dots = bundled)
#> a b c2 d2
#> 1 1 1 -1 -2
#> 2 2 2 -2 -4
#> 3 3 3 -3 -6
#> 4 4 4 -4 -8
#> 5 5 5 -5 -10
If we want to programmatically update the names of the expressions in the list, then setNames is a convenient way to do that. If we want to programmatically mix and match constants and variable names when making expressions, then the lazyeval package provides convenient ways to do that. Below I do both to create a list of expressions, name them, and evaluate them with mutate_
# Imagine some dropdown boxes in a Shiny app, and this is what user requested
selected_func1 <- "min"
selected_func2 <- "max"
selected_var1 <- "a"
selected_var2 <- "b"
# Assemble expressions from those choices
bundled2 <- list(
interp(~fun(x), fun = as.name(selected_func1), x = as.name(selected_var1)),
interp(~fun(x), fun = as.name(selected_func2), x = as.name(selected_var2))
)
bundled2
#> [[1]]
#> ~min(a)
#>
#> [[2]]
#> ~max(b)
# Create variable names
exp_name1 <- paste0(selected_func1, "_", selected_var1)
exp_name2 <- paste0(selected_func2, "_", selected_var2)
bundled2 <- setNames(bundled2, c(exp_name1, exp_name2))
bundled2
#> $min_a
#> ~min(a)
#>
#> $max_b
#> ~max(b)
# Evaluate the expressions
df %>%
mutate_(.dots = bundled2)
#> a b min_a max_b
#> 1 1 1 1 5
#> 2 2 2 1 5
#> 3 3 3 1 5
#> 4 4 4 1 5
#> 5 5 5 1 5
From vignettes("nse"):
If you also want to output variables to vary, you need to pass a list of quoted objects to the .dots argument
So perhaps the reason why
temp <- df %>% mutate_(.dots = setNames(expr, nm = new_name))
Doesn't do what you want is, while you successfully set the name attribute here, expr is still a formula, not a list:
foo <- setNames(expr, nm = new_name)
names(foo) #"dynamically_created_name" ""
class(foo) #"formula"
So if you make it a list, it works as expected:
expr <- interp(~func(x, y), x = as.name(var1),
y = as.name(var2))
df %>% mutate_(.dots = list(new_name = expr))
a b new_name
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10

How to ignore case when using subset in R

How to ignore case when using subset function in R?
eos91corr.data <- subset(test.data,select=c(c(X,Y,Z,W,T)))
I would like to select columns with names x,y,z,w,t. what should i do?
Thanks
If you can live without the subset() function, the tolower() function may work:
dat <- data.frame(XY = 1:5, x = 1:5, mm = 1:5,
y = 1:5, z = 1:5, w = 1:5, t = 1:5, r = 1:5)
dat[,tolower(names(dat)) %in% c("xy","x")]
However, this will return a data.frame with the columns in the order they are in the original dataset dat: both
dat[,tolower(names(dat)) %in% c("xy","x")]
and
dat[,tolower(names(dat)) %in% c("x","xy")]
will yield the same result, although the order of the target names has been reversed.
If you want the columns in the result to be in the order of the target vector, you need to be slightly more fancy. The two following commands both return a data.frame with the columns in the order of the target vector (i.e., the results will be different, with columns switched):
dat[,sapply(c("x","xy"),FUN=function(foo)which(foo==tolower(names(dat))))]
dat[,sapply(c("xy","x"),FUN=function(foo)which(foo==tolower(names(dat))))]
You could use regular expressions with the grep function to ignore case when identifying column names to select. Once you have identified the desired column names, then you can pass these to subset.
If your data are
dat <- data.frame(xy = 1:5, x = 1:5, mm = 1:5, y = 1:5, z = 1:5,
w = 1:5, t = 1:5, r = 1:5)
# xy x mm y z w t r
# 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4
# 5 5 5 5 5 5 5 5 5
Then
(selNames <- grep("^[XYZWT]$", names(dat), ignore.case = TRUE, value = TRUE))
# [1] "x" "y" "z" "w" "t"
subset(dat, select = selNames)
# x y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
EDIT If your column names are longer than one letter, the above approach won't work too well. So assuming you can get your desired column names in a vector, you could use the following:
upperNames <- c("XY", "Y", "Z", "W", "T")
(grepPattern <- paste0("^", upperNames, "$", collapse = "|"))
# [1] "^XY$|^Y$|^Z$|^W$|^T$"
(selNames2 <- grep(grepPattern, names(dat), ignore.case = TRUE, value = TRUE))
# [1] "xy" "y" "z" "w" "t"
subset(dat, select = selNames2)
# xy y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
The 'stringr' library is a very neat wrapper for all of this functionality. It has 'ignore.case' option as follows:
also, you may want to consider using match not subset.

Resources