dplyr::mutate_if() with multiple conditions including column class not working - r

really confused why this is not working:
df <- data.frame(a = c("1", "2", "3"),
b = c(2, 3, 4),
c = c(4, 3, 2),
d = c("1", "5", "9"))
varnames = c("a", "c")
df %>%
mutate_if((is.character(.) & names(.) %in% varnames),
funs(mean(as.numeric(.))))
a b c d
1 1 2 4 1
2 2 3 3 5
3 3 4 2 9
Expected output would be
a b c d
1 2 2 4 1
2 2 3 3 5
3 2 4 2 9
It works with a single condition, but the class condition I've actually only gotten to work using this formulation (which I don't know how to combine with the column name condition):
df %>%
mutate_if(function(col) is.character(col),
funs(mean(as.numeric(.))))
a b c d
1 2 2 4 5
2 2 3 3 5
3 2 4 2 5
However is.factor seems to work fine with the column names?
df %>%
mutate_if(!is.factor(.) & (names(.) %in% varnames),
funs(mean(as.numeric(.))))
a b c d
1 2 2 3 1
2 2 3 3 5
3 2 4 3 9

Note that mutate_if is being phased out in favour of across, so the following is perhaps what you want...
df %>%
mutate(across(where(is.character) & matches(varnames), ~mean(as.numeric(.))))
a b c d
1 2 2 4 1
2 2 3 3 5
3 2 4 2 9

mutate_if() doesn't work like you do. In its help page, it says that the second argument to set the conditions need to be one of the following two cases:
A predicate function to be applied to the columns. (In this case, it can be a normal function or a lambda function, i.e. the form of ~ fun(.))
A logical vector.
If you want to calculate means for character columns, the correct syntax is
Code 1:
df %>% mutate_if(~ is.character(.), funs(mean(as.numeric(.))))
instead of
df %>% mutate_if(is.character(.), funs(mean(as.numeric(.))))
which results in an error message. Then, let's talk about the following code:
Code 2:
df %>% mutate_if(names(.) %in% varnames, funs(mean(as.numeric(.))))
Theoretically, mutate_if only extract column values, not column names, so ~ names(.) should make no sense in it. But why does Code 2 work fine without the ~ symbol in front of names(.)? The reason is that the "." in names actually represents df per se instead of each column from df owing to the feature of the pipe operator (%>%). Therefore, Code 2 is actually executed equivalently as
df %>% mutate_if(names(df) %in% varnames,funs(mean(as.numeric(.))))
where a logical vector is passed to it rather than a predicate function. names(df) %in% varnames returns TRUE FALSE TRUE FALSE and hence a and c are selected. This can explain why your first block fails but the last one works.
The first block
df %>% mutate_if(is.character(.) & names(.) %in% varnames,
funs(mean(as.numeric(.))))
Replace all "." with df, you can find
is.character(df) returns FALSE
names(df) %in% varnames returns TRUE FALSE TRUE FALSE
The & operator makes the final condition FALSE FALSE FALSE FALSE and hence no column is selected. The same goes for the last block.

Related

R multiple regular expressions, dataframe column names

I have a dataframe data with a lot of columns in the form of
...v1...min ...v1...max ...v2...min ...v2...max
1 a a a a
2 b b b b
3 c c c c
where in place ... there could be any expression.
I would like to create a function createData that takes three arguments:
X: a dataframe,
cols: a vector containing first part of the column, so i.e. c("v1", "v2")
fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")
and returns filtered dataframe, so - for example:
createData(X, c("v1"), None) would return this kind of dataframe:
...v1...min ...v1...max
1 a a
2 b b
3 c c
while createData(X, c("v1", "v2"), c("min")) would give me
...v1...min ...v2...min
1 a a
2 b b
3 c c
At this point I decided I need to use i.e. select(contains()) from dplyr package.
createData <- function(data, fun, cols)
{
X %>% select(contains())
return(X)
}
What I struggle with is:
how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,
how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?
I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!
EDIT:
An reproducible example:
data = data.frame(AXv1c2min = c(1,2,3),
subv1trwmax = c(4,5,6),
ss25v2xxmin = c(7,8,9),
cwfv2urttmmax = c(10,11,12))
If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:
We can filter for (v1 OR v2) AND min like this:
library(tidyverse)
data %>%
select(contains(c('v1','v2'))) %>%
select(contains('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
So as a function where either argument is optional:
createData <- function(data, fun=NULL, cols=NULL) {
if (!is.null(fun)) data <- select(data, contains(fun))
if (!is.null(cols)) data <- select(data, contains(cols))
return(data)
}
A series of examples:
createData(data, cols=c('v1', 'v2'), fun='min')
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, fun=c('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'), fun=c('min', 'max'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, cols=c('v1'), fun=c('max'))
subv1trwmax
1 4
2 5
3 6

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Repeated function over varnames

Using the following dataframe (similar to my data, but much smaller):
id <- c(1:10)
clo_a <- c(rep(c("Yes","No"), 5))
clo_f <- c(rep(c(4,5), 5))
man_a <- c(rep(c("Yes","No"), each = 5))
man_f <- c(rep(c(c(3,7)), each = 5))
pho_a <- c(rep(c(NA, NA, "Yes", NA, "No"), 2))
pho_f <- c(rep(c(1,2,3,4,5), 2))
ds <- data.frame(id, clo_a, clo_f, man_a, man_f, pho_a, pho_f)
produces a dataframe ds as follows:
id clo_a clo_f man_a man_f pho_a pho_f
1 1 Yes 4 Yes 3 <NA> 1
2 2 No 5 Yes 3 <NA> 2
3 3 Yes 4 Yes 3 Yes 3
4 4 No 5 Yes 3 <NA> 4
5 5 Yes 4 Yes 3 No 5
6 6 No 5 No 7 <NA> 1
7 7 Yes 4 No 7 <NA> 2
8 8 No 5 No 7 Yes 3
9 9 Yes 4 No 7 <NA> 4
10 10 No 5 No 7 No 5
I now want to select the id's of the variables ending in _a with "Yes", but also the values of the variables ending in _f, albeit ideally separately.
As one example I can write:
upset_clo_a <- ds$id[which( ds$clo_a == "Yes")]
producing:
> upset_clo_a
[1] 1 3 5 7 9
I'd now like the repeat this for all variables, ideally using a vector with the common denominator in the set of variables, susch as:
ai_list <- c("clo", "man", "pro")
Obviously the following example doesn't work. I tried several variants of using paste() or substitute(), not yielding anything useful.
lapply(ai_list, function(x) {
upset_x <- ds$id[which( ds$x == "Yes")]
})
The output is the same for all variants I tried:
[[1]]
integer(0)
[[2]]
integer(0)
[[3]]
integer(0)
In the end I want to (for example) read the ID-vectors per variable (e.g. upset_clo_f) as list of vectors into an upSet plot.
Maybe you have a great idea. Thanks!
You can paste the missing part (_a) in each vector element and use it to subset your data.frame. Then loop over those columns and get the index of Yes, i.e.
sapply(ds[names(ds) %in% paste0(ai_list, '_a')], function(i)which(i == 'Yes'))
Breaking down the code
paste0(ai_list, '_a') - Pastes the suffix [_a] to each name (clo_a, man_a, pro_a)
names(ds) %in% paste0(ai_list, '_a') - Returns a logical vector (TRUE, FALSE) which will be used for sub setting the columns of interest
ds[names(ds) %in% paste0(ai_list, '_a')] - Returns a data frame with only the columns of interest (as per logical conditions from above step)
sapply(ds[names(ds) %in% paste0(ai_list, '_a')], function(i)which(i == 'Yes')) - Finally, we loop over the columns and apply the which() function to get the indices of YES

Replace values in selected columns by passing column name of data.frame into apply() or plyr function

Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.

Resources