Repeated function over varnames - r

Using the following dataframe (similar to my data, but much smaller):
id <- c(1:10)
clo_a <- c(rep(c("Yes","No"), 5))
clo_f <- c(rep(c(4,5), 5))
man_a <- c(rep(c("Yes","No"), each = 5))
man_f <- c(rep(c(c(3,7)), each = 5))
pho_a <- c(rep(c(NA, NA, "Yes", NA, "No"), 2))
pho_f <- c(rep(c(1,2,3,4,5), 2))
ds <- data.frame(id, clo_a, clo_f, man_a, man_f, pho_a, pho_f)
produces a dataframe ds as follows:
id clo_a clo_f man_a man_f pho_a pho_f
1 1 Yes 4 Yes 3 <NA> 1
2 2 No 5 Yes 3 <NA> 2
3 3 Yes 4 Yes 3 Yes 3
4 4 No 5 Yes 3 <NA> 4
5 5 Yes 4 Yes 3 No 5
6 6 No 5 No 7 <NA> 1
7 7 Yes 4 No 7 <NA> 2
8 8 No 5 No 7 Yes 3
9 9 Yes 4 No 7 <NA> 4
10 10 No 5 No 7 No 5
I now want to select the id's of the variables ending in _a with "Yes", but also the values of the variables ending in _f, albeit ideally separately.
As one example I can write:
upset_clo_a <- ds$id[which( ds$clo_a == "Yes")]
producing:
> upset_clo_a
[1] 1 3 5 7 9
I'd now like the repeat this for all variables, ideally using a vector with the common denominator in the set of variables, susch as:
ai_list <- c("clo", "man", "pro")
Obviously the following example doesn't work. I tried several variants of using paste() or substitute(), not yielding anything useful.
lapply(ai_list, function(x) {
upset_x <- ds$id[which( ds$x == "Yes")]
})
The output is the same for all variants I tried:
[[1]]
integer(0)
[[2]]
integer(0)
[[3]]
integer(0)
In the end I want to (for example) read the ID-vectors per variable (e.g. upset_clo_f) as list of vectors into an upSet plot.
Maybe you have a great idea. Thanks!

You can paste the missing part (_a) in each vector element and use it to subset your data.frame. Then loop over those columns and get the index of Yes, i.e.
sapply(ds[names(ds) %in% paste0(ai_list, '_a')], function(i)which(i == 'Yes'))
Breaking down the code
paste0(ai_list, '_a') - Pastes the suffix [_a] to each name (clo_a, man_a, pro_a)
names(ds) %in% paste0(ai_list, '_a') - Returns a logical vector (TRUE, FALSE) which will be used for sub setting the columns of interest
ds[names(ds) %in% paste0(ai_list, '_a')] - Returns a data frame with only the columns of interest (as per logical conditions from above step)
sapply(ds[names(ds) %in% paste0(ai_list, '_a')], function(i)which(i == 'Yes')) - Finally, we loop over the columns and apply the which() function to get the indices of YES

Related

dplyr::mutate_if() with multiple conditions including column class not working

really confused why this is not working:
df <- data.frame(a = c("1", "2", "3"),
b = c(2, 3, 4),
c = c(4, 3, 2),
d = c("1", "5", "9"))
varnames = c("a", "c")
df %>%
mutate_if((is.character(.) & names(.) %in% varnames),
funs(mean(as.numeric(.))))
a b c d
1 1 2 4 1
2 2 3 3 5
3 3 4 2 9
Expected output would be
a b c d
1 2 2 4 1
2 2 3 3 5
3 2 4 2 9
It works with a single condition, but the class condition I've actually only gotten to work using this formulation (which I don't know how to combine with the column name condition):
df %>%
mutate_if(function(col) is.character(col),
funs(mean(as.numeric(.))))
a b c d
1 2 2 4 5
2 2 3 3 5
3 2 4 2 5
However is.factor seems to work fine with the column names?
df %>%
mutate_if(!is.factor(.) & (names(.) %in% varnames),
funs(mean(as.numeric(.))))
a b c d
1 2 2 3 1
2 2 3 3 5
3 2 4 3 9
Note that mutate_if is being phased out in favour of across, so the following is perhaps what you want...
df %>%
mutate(across(where(is.character) & matches(varnames), ~mean(as.numeric(.))))
a b c d
1 2 2 4 1
2 2 3 3 5
3 2 4 2 9
mutate_if() doesn't work like you do. In its help page, it says that the second argument to set the conditions need to be one of the following two cases:
A predicate function to be applied to the columns. (In this case, it can be a normal function or a lambda function, i.e. the form of ~ fun(.))
A logical vector.
If you want to calculate means for character columns, the correct syntax is
Code 1:
df %>% mutate_if(~ is.character(.), funs(mean(as.numeric(.))))
instead of
df %>% mutate_if(is.character(.), funs(mean(as.numeric(.))))
which results in an error message. Then, let's talk about the following code:
Code 2:
df %>% mutate_if(names(.) %in% varnames, funs(mean(as.numeric(.))))
Theoretically, mutate_if only extract column values, not column names, so ~ names(.) should make no sense in it. But why does Code 2 work fine without the ~ symbol in front of names(.)? The reason is that the "." in names actually represents df per se instead of each column from df owing to the feature of the pipe operator (%>%). Therefore, Code 2 is actually executed equivalently as
df %>% mutate_if(names(df) %in% varnames,funs(mean(as.numeric(.))))
where a logical vector is passed to it rather than a predicate function. names(df) %in% varnames returns TRUE FALSE TRUE FALSE and hence a and c are selected. This can explain why your first block fails but the last one works.
The first block
df %>% mutate_if(is.character(.) & names(.) %in% varnames,
funs(mean(as.numeric(.))))
Replace all "." with df, you can find
is.character(df) returns FALSE
names(df) %in% varnames returns TRUE FALSE TRUE FALSE
The & operator makes the final condition FALSE FALSE FALSE FALSE and hence no column is selected. The same goes for the last block.

R multiple regular expressions, dataframe column names

I have a dataframe data with a lot of columns in the form of
...v1...min ...v1...max ...v2...min ...v2...max
1 a a a a
2 b b b b
3 c c c c
where in place ... there could be any expression.
I would like to create a function createData that takes three arguments:
X: a dataframe,
cols: a vector containing first part of the column, so i.e. c("v1", "v2")
fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")
and returns filtered dataframe, so - for example:
createData(X, c("v1"), None) would return this kind of dataframe:
...v1...min ...v1...max
1 a a
2 b b
3 c c
while createData(X, c("v1", "v2"), c("min")) would give me
...v1...min ...v2...min
1 a a
2 b b
3 c c
At this point I decided I need to use i.e. select(contains()) from dplyr package.
createData <- function(data, fun, cols)
{
X %>% select(contains())
return(X)
}
What I struggle with is:
how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,
how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?
I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!
EDIT:
An reproducible example:
data = data.frame(AXv1c2min = c(1,2,3),
subv1trwmax = c(4,5,6),
ss25v2xxmin = c(7,8,9),
cwfv2urttmmax = c(10,11,12))
If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:
We can filter for (v1 OR v2) AND min like this:
library(tidyverse)
data %>%
select(contains(c('v1','v2'))) %>%
select(contains('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
So as a function where either argument is optional:
createData <- function(data, fun=NULL, cols=NULL) {
if (!is.null(fun)) data <- select(data, contains(fun))
if (!is.null(cols)) data <- select(data, contains(cols))
return(data)
}
A series of examples:
createData(data, cols=c('v1', 'v2'), fun='min')
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, fun=c('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'), fun=c('min', 'max'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, cols=c('v1'), fun=c('max'))
subv1trwmax
1 4
2 5
3 6

How to inverse subset in R?

I am trying to make non-overlapping subsets of a totally inclusive group in R. The first subset contains pairs of elements from the totally inclusive group. The other subset should be all of the elements in the totally inclusive group, but not in the first subset.
poplength <- 10
samples <- 7
numpair <- 2
totallyinclusivegroup <- sample(1:poplength, samples)
Subset1 <- sample(totallyinclusivegroup, size = numpair*2)
I don't know how to get a "Subset2" that includes everything in "totallyinclusivegroup" but not in Subset 1. I've tried using the "-" operator, with no success. For example,
Subset2 <- totallyinclusivegroup[-Subset1]
does not work, and includes elements from Subset1. Any advice/help is appreciated.
We can negate with ! on the logical vector from %in% so that TRUE -> FALSE and viceversa
out <- totallyinclusivegroup[!totallyinclusivegroup %in% Subset1]
-output
Subset1
#[1] 2 6 9 7
totallyinclusivegroup
#[1] 3 2 6 1 9 7 8
out
#[1] 3 1 8
Or an easier option is setdiff
setdiff(totallyinclusivegroup, Subset1)
#[1] 3 1 8
If there are duplicate elements, it is better to use vsetdiff from vecsets
library(vecsets)
vsetdiff(totallyinclusivegroup, Subset1)
Try:
#Code
Subset2 <- totallyinclusivegroup[-which(totallyinclusivegroup%in% Subset1 )]
Output:
totallyinclusivegroup
[1] 8 5 10 2 9 1 3
Subset1
[1] 5 10 3 9
Subset2
[1] 8 2 1

Choose some items in a dataframe and change them

I have a data frame with some information. Some data is NA. Something like:
id fact sex
1 1 3 M
2 2 6 F
3 3 NA <NA>
4 4 8 F
5 5 2 F
6 6 2 M
7 7 NA <NA>
8 8 1 F
9 9 10 M
10 10 10 M
I have to change fact by some rule(e.x. multiply by 3 elements, that have (data == "M")).
I tried survey$fact[survey$sex== "M"] <- survey$fact[survey$sex== "M"] * 3, but I have some error because of NA.
I know I can check if element is NA with is.na(x), and add this condition in [...], but I hope that exists more beautiful solution
I really like ifelse, it always seems to have the desired behaviour with respect to NA values for me.
survey$fact <- ifelse(survey$sex == "M", survey$fact * 3, survey$fact)
?ifelse shows that the first argument is the test, the second the value assigned if the test is true and the final argument the value if false. If you assign the original data.frame column as the false return value, it will assign rows for which the test fails without modifying them.
This is an extension of what you asked, to show that you can also test for NA values.
survey$fact <- ifelse(is.na(survey$sex), survey$fact * 2, survey$fact)
I also like that it's very readable.
which can filter those NAs:
survey$fact[which(survey$sex == "M")] <- survey$fact[which(survey$sex== "M")] * 3
There are many ways you can make that a little cleaner, e.g.:
males <- which(survey$sex == "M")
survey$fact[males] <- 3 * survey$fact[males]
or
survey <- within(survey, fact[males] <- 3 * fact[males])

Subsetting R data frame results in mysterious NA rows

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

Resources