How to ignore case when using subset function in R?
eos91corr.data <- subset(test.data,select=c(c(X,Y,Z,W,T)))
I would like to select columns with names x,y,z,w,t. what should i do?
Thanks
If you can live without the subset() function, the tolower() function may work:
dat <- data.frame(XY = 1:5, x = 1:5, mm = 1:5,
y = 1:5, z = 1:5, w = 1:5, t = 1:5, r = 1:5)
dat[,tolower(names(dat)) %in% c("xy","x")]
However, this will return a data.frame with the columns in the order they are in the original dataset dat: both
dat[,tolower(names(dat)) %in% c("xy","x")]
and
dat[,tolower(names(dat)) %in% c("x","xy")]
will yield the same result, although the order of the target names has been reversed.
If you want the columns in the result to be in the order of the target vector, you need to be slightly more fancy. The two following commands both return a data.frame with the columns in the order of the target vector (i.e., the results will be different, with columns switched):
dat[,sapply(c("x","xy"),FUN=function(foo)which(foo==tolower(names(dat))))]
dat[,sapply(c("xy","x"),FUN=function(foo)which(foo==tolower(names(dat))))]
You could use regular expressions with the grep function to ignore case when identifying column names to select. Once you have identified the desired column names, then you can pass these to subset.
If your data are
dat <- data.frame(xy = 1:5, x = 1:5, mm = 1:5, y = 1:5, z = 1:5,
w = 1:5, t = 1:5, r = 1:5)
# xy x mm y z w t r
# 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4
# 5 5 5 5 5 5 5 5 5
Then
(selNames <- grep("^[XYZWT]$", names(dat), ignore.case = TRUE, value = TRUE))
# [1] "x" "y" "z" "w" "t"
subset(dat, select = selNames)
# x y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
EDIT If your column names are longer than one letter, the above approach won't work too well. So assuming you can get your desired column names in a vector, you could use the following:
upperNames <- c("XY", "Y", "Z", "W", "T")
(grepPattern <- paste0("^", upperNames, "$", collapse = "|"))
# [1] "^XY$|^Y$|^Z$|^W$|^T$"
(selNames2 <- grep(grepPattern, names(dat), ignore.case = TRUE, value = TRUE))
# [1] "xy" "y" "z" "w" "t"
subset(dat, select = selNames2)
# xy y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
The 'stringr' library is a very neat wrapper for all of this functionality. It has 'ignore.case' option as follows:
also, you may want to consider using match not subset.
Related
I am looking for a way to find clusters of group 2 (pairs).
Is there a simple way to do that?
Imagine I have some kind of data where I want to match on x and y, like
library(cluster)
set.seed(1)
df = data.frame(id = 1:10, x_coord = sample(10,10), y_coord = sample(10,10))
I want to find the closest pair of distances between the x_coord and y_coord:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
I get a dendrogram like the one below. What I would like is that the pairs (9,10), (1,3), (6,7), (4,5) be grouped together. And that in fact the cases 8 and 2, be left alone and removed.
Maybe there is a more effective alternative for doing this than clustering.
Ultimately I would like is to remove the unmatched ids and keep the pairs and have a dataset like this one:
id x_coord y_coord pair_id
1 9 3 1
3 7 5 1
4 1 8 2
5 2 2 2
6 5 6 3
7 3 10 3
9 6 4 4
10 8 7 4
You could use the element h$merge. Any rows of this two-column matrix that both contain negative values represent a pairing of singletons. Therefore you can do:
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
df
#> id x_coord y_coord pair
#> 1 1 9 3 4
#> 3 3 7 5 4
#> 4 4 1 8 1
#> 5 5 2 2 1
#> 6 6 5 6 2
#> 7 7 3 10 2
#> 9 9 6 4 3
#> 10 10 8 7 3
Note that the pair numbers equate to "height" on the dendrogram. If you want them to be in ascending order according to the order of their appearance in the dataframe you can add the line
df$pair <- as.numeric(factor(df$pair, levels = unique(df$pair)))
Anyway, if we repeat your plotting code on our newly modified df, we can see there are no unpaired singletons left:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
And we can see the method scales nicely:
df = data.frame(id = 1:50, x_coord = sample(50), y_coord = sample(50))
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
I have the following data:
set.seed(1)
df_1 <- data.frame(x = replicate(n = 2, expr = sample(x = 1:3, size = 20, replace = T)),
y = as.factor(sample(x = 1:5, size = 20, replace = TRUE)))
I want replace the numbers >=2 by 9 in x.1 and x.2 simultaneoulsy:
df_1[df_1$x.1, df_1$x.2 >= 2] <- 9
Error in [<-.data.frame(*tmp*, df_1$x.1, df_1$x.2 >= 2, value = 9) :
duplicate subscripts for columns
And replace the number 3 by 99 in y.
df_1$y[df_1$y %in% c('3')] <- 99
Warning message:
In [<-.factor(*tmp*, df_1$y %in% c("3"), value = c(2L, 5L, 2L, :
invalid factor level, NA generated
Tks.
We can use replace
df_1[1:2] <- replace(df_1[1:2], df_1[1:2] >=2, 9)
Or another option is create the logical matrix on the subset of 'x.' columns, extract the values and assign it to 9
df_1[1:2][df_1[1:2] >= 2] <- 9
For changing the factor, we either needs to call factor again or add levels beforehand
levels(df_1$y) <- c(levels(df_1$y), "99")
df_1$y
#[1] 4 4 4 2 4 1 1 4 1 2 3 2 2 5 2 1 3 3 4 3
#Levels: 1 2 3 4 5 99
df_1$y[df_1$y == '3'] <- '99'
df_1$y
#[1] 4 4 4 2 4 1 1 4 1 2 99 2 2 5 2 1 99 99 4 99
##Levels: 1 2 3 4 5 99
Or as #thelatemail mentioned, if we are dropping the levels while doing the replacement
levels(df_1$y)[levels(df_1$y) == '3'] <- "99"
Or can use fct_recode from forcats
library(forcats)
df_1$y <- fct_recode(df_1$y, "99" = "3")
I've got a dataset
>view(interval)
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 2 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
>dput(interval)
structure(list(V1 = c(NA, 2, 3, 4, NA),
V2 = c(1, 2, NA, 2, 5),
V3 = c(2, 3, 1, 2, 1), ID = 1:5), row.names = c(NA, -5L), class = "data.frame")
I would like to extract the previous not NA value (or the next, if NA is in the first row) for every row, and store it as a local variable in a custom function, because I have to perform other operations on every row based on this value(which should change for every row i'm applying the function).
I've written this function to print the local variables, but when I apply it the output is not what I want
myFunction<- function(x){
position <- as.data.frame(which(is.na(interval), arr.ind=TRUE))
tempVar <- ifelse(interval$ID == 1, interval[position$row+1,
position$col], interval[position$row-1, position$col])
return(tempVar)
}
I was expecting to get something like this
# [1] 2
# [2] 2
# [3] 4
But I get something pretty messed up instead.
Here's attempt number 1:
dat <- read.table(header=TRUE, text='
V1 V2 V3 ID
NA 1 2 1
2 2 3 2
3 NA 1 3
4 2 2 4
NA 5 1 5')
myfunc1 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# catch first-row NA
ind[,1] <- ifelse(ind[,1] == 1L, 2L, ind[,1] - 1L)
x[ind]
}
myfunc1(dat)
# [1] 2 2 4
The problem with this is when there is a second "stacked" NA:
dat2 <- dat
dat2[2,1] <- NA
dat2
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 NA 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
myfunc1(dat2)
# [1] NA NA 2 4
One fix/safeguard against this is to use zoo::na.locf, which takes the "last observation carried forward". Since the top-row is a special case, we do it twice, second time in reverse. This gives us the "next non-NA value in the column (up or down, depending).
library(zoo)
myfunc2 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# this is to guard against stacked NA
x <- apply(x, 2, zoo::na.locf, na.rm = FALSE)
# this special-case is when there are one or more NAs at the top of a column
x <- apply(x, 2, zoo::na.locf, fromLast = TRUE, na.rm = FALSE)
x[ind]
}
myfunc2(dat2)
# [1] 3 3 2 4
I have a data frame similar to the following format:
Doc Category val
A aa 1
B ab 6
C ab 3
D cc 6.....
I am using the following code to identify all combinations of sums of val and then extracting the rows that add up to a target sum I have already identified.
#all combinations
res <- Map(combn, list(val), seq_along(val), simplify = FALSE)
x=unlist(res, recursive = FALSE)
z=lapply(x, function(x) sum(x))
My issue is determining the best way to preserve the character columns in the data frame as the code above only gives numerical values. The way I am doing it now is a mapping based on val, which normally works fine, however, I can run into issues when there are duplicated values.
For example, if my target sum is 7, I eventually want output that looks like this (there are other ways to get to this value, but for now just returning the first instance works):
Doc Category val
A aa 1
B ab 6
Is there a better way to map to the non-numerical columns to achieve this output ?
Would this solution work for you:
df <- data.frame(Doc = LETTERS[1:7],
Category = c("aa","ab","ab","cc","ca","cb","bb"),
val = c(1,6,3, 6, 4, 5, 2),
stringsAsFactors=FALSE)
df
# Doc Category val
# 1 A aa 1
# 2 B ab 6
# 3 C ab 3
# 4 D cc 6
# 5 E ca 4
# 6 F cb 5
# 7 G bb 2
target.sum=7
# create an "id" variable that is equal to the index of all rows
df$id <- seq_along(df$val)
id.res <- Map(combn, list(df$id), seq_along(df$id), simplify = FALSE)
x=unlist(id.res, recursive = FALSE)
#remove all elements in the list where the sum of
# values in column val is not equal to target value
x.list <- lapply(x,FUN=function(x){ if(sum(df$val[x]) == target.sum ) df[x,] else NA})
#remove missing values
x.list <-x.list[!is.na(x.list)]
x.list
# [[1]]
# Doc Category val id
# 1 A aa 1 1
# 2 B ab 6 2
#
# [[2]]
# Doc Category val id
# 1 A aa 1 1
# 4 D cc 6 4
#
# [[3]]
# Doc Category val id
# 3 C ab 3 3
# 5 E ca 4 5
#
# [[4]]
# Doc Category val id
# 6 F cb 5 6
# 7 G bb 2 7
#
# [[5]]
# Doc Category val id
# 1 A aa 1 1
# 5 E ca 4 5
# 7 G bb 2 7
I am calling mutate using dynamic variable names. An example that mostly works is:
df <- data.frame(a = 1:5, b = 1:5)
func <- function(a,b){
return(a+b)
}
var1 = 'a'
var2 = 'b'
expr <- interp(~func(x, y), x = as.name(var1), y = as.name(var2))
new_name <- "dynamically_created_name"
temp <- df %>% mutate_(.dots = setNames(expr, nm = new_name))
Which produces
temp
a b func(a, b)
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10
This is mostly fine except that set names ignored the nm key. This is solved by wrapping my function in list():
temp <- df %>% mutate_(.dots = setNames(list(expr), nm = new_name))
temp
a b dynamically_created_name
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10
My question is why is setNames ignoring it's key in the first place, and how does list() solve this problem?
As noted in the other answer, the .dots argument is assumed to be a list, and setNames is a convenient way to rename elements in a list.
What is the .dots argument doing? Let's first think about the actual dots ... argument. It is a series of expressions to be evaluated. Below the dots ... are the two named expressions c = ~ a * scale1 and d = ~ a * scale2.
scale1 <- -1
scale2 <- -2
df %>%
mutate_(c = ~ a * scale1, d = ~ a * scale2)
#> a b c d
#> 1 1 1 -1 -2
#> 2 2 2 -2 -4
#> 3 3 3 -3 -6
#> 4 4 4 -4 -8
#> 5 5 5 -5 -10
We could just bundle those expressions together beforehand in a list. That's where .dots comes in. That parameter lets us tell mutate_ to evaluate the expressions in the list.
bundled <- list(
c2 = ~ a * scale1,
d2 = ~ a * scale2
)
df %>%
mutate_(.dots = bundled)
#> a b c2 d2
#> 1 1 1 -1 -2
#> 2 2 2 -2 -4
#> 3 3 3 -3 -6
#> 4 4 4 -4 -8
#> 5 5 5 -5 -10
If we want to programmatically update the names of the expressions in the list, then setNames is a convenient way to do that. If we want to programmatically mix and match constants and variable names when making expressions, then the lazyeval package provides convenient ways to do that. Below I do both to create a list of expressions, name them, and evaluate them with mutate_
# Imagine some dropdown boxes in a Shiny app, and this is what user requested
selected_func1 <- "min"
selected_func2 <- "max"
selected_var1 <- "a"
selected_var2 <- "b"
# Assemble expressions from those choices
bundled2 <- list(
interp(~fun(x), fun = as.name(selected_func1), x = as.name(selected_var1)),
interp(~fun(x), fun = as.name(selected_func2), x = as.name(selected_var2))
)
bundled2
#> [[1]]
#> ~min(a)
#>
#> [[2]]
#> ~max(b)
# Create variable names
exp_name1 <- paste0(selected_func1, "_", selected_var1)
exp_name2 <- paste0(selected_func2, "_", selected_var2)
bundled2 <- setNames(bundled2, c(exp_name1, exp_name2))
bundled2
#> $min_a
#> ~min(a)
#>
#> $max_b
#> ~max(b)
# Evaluate the expressions
df %>%
mutate_(.dots = bundled2)
#> a b min_a max_b
#> 1 1 1 1 5
#> 2 2 2 1 5
#> 3 3 3 1 5
#> 4 4 4 1 5
#> 5 5 5 1 5
From vignettes("nse"):
If you also want to output variables to vary, you need to pass a list of quoted objects to the .dots argument
So perhaps the reason why
temp <- df %>% mutate_(.dots = setNames(expr, nm = new_name))
Doesn't do what you want is, while you successfully set the name attribute here, expr is still a formula, not a list:
foo <- setNames(expr, nm = new_name)
names(foo) #"dynamically_created_name" ""
class(foo) #"formula"
So if you make it a list, it works as expected:
expr <- interp(~func(x, y), x = as.name(var1),
y = as.name(var2))
df %>% mutate_(.dots = list(new_name = expr))
a b new_name
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10