Selecting multiple columns using Regular Expressions - r

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!

Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))

Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"

Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

Related

Nested ifelse() or case_when() for unknown number of queries in R

I have a data frame which I would like to group according to the value in a given row and column of the data frame
my_data <- data.frame(matrix(ncol = 3, nrow = 4))
colnames(my_data) <- c('Position', 'Group', 'Data')
my_data[,1] <- c('A1','B1','C1','D1')
my_data[,3] <- c(1,2,3,4)
grps <- list(c('A1','B1'),
c('C1','D1'))
grp.names = c("Control", "Exp1", "EMPTY")
my_data$Group <- case_when(
my_data$Position %in% grps[[1]] ~ grp.names[1],
my_data$Position %in% grps[[2]] ~ grp.names[2]
)
OR
my_data$Group <- with(my_data, ifelse(Position %in% grps[[1]], grp.names[1],
ifelse(Position %in% grps[[2]], grp.names[2],
grp.names[3])))
These examples work and produce a Group column with appropriate labels, however I need to have flexibility in the length of the grps list from 1 to approximately 25.
I see no way to iterate through case_with or ifelse in a for loop eg.
my_data$Group <- for (i in 1:length(grps)){
case_when(
my_data$Well %in% grps[[i]] ~ grp.names[i])
}
This example simply deletes the Group column
What is the most appropriate way to handle a variable grps length?
I believe your question implies that the grps variable is a list and every element in that list is itself an array that holds all the positions that belong to that group.
Specifically, in your grps variable below, if the Position is "A1" or "B1" it belongs to the whatever your first entry is grp.names. Similarly, if the position is "C1" or "D1" it belongs to whatever your second entry is in grp.names
> grps
[[1]]
[1] "A1" "B1"
[[2]]
[1] "C1" "D1"
Assuming that to be the case you can do the following:
matching_group_df <- sapply(grps, function(x){ my_data$Position %in% x})
selected_group <- apply(matching_group_df, 1, function(x){which(x == TRUE)})
my_data$Group <- grp.names[selected_group]
Position Group Data
1 A1 Control 1
2 B1 Control 2
3 C1 Exp1 3
4 D1 Exp1 4
The way it works is as follows:
matching_group_df is a matrix of True/False (created via the sapply function) that specifies what group index the position belongs to:
> matching_group_df
[,1] [,2]
[1,] TRUE FALSE
[2,] TRUE FALSE
[3,] FALSE TRUE
[4,] FALSE TRUE
You then select the column that has the TRUE value row by row using an apply command:
selected_group <- apply(matching_group_df, 1, function(x){which(x == TRUE)})
> selected_group
[1] 1 1 2 2
Finally you pass those indices to your grp.names list to select the appropriate ones and set them into your original dataframe.
grp.names[selected_group]
[1] "Control" "Control" "Exp1" "Exp1"
This also has the small side benefit of just using base R functions if that is important to you.
Approach 1: Hash table
I would opt for a different approach here, as group makeup might change during analysis, specifically a lookup table of key-value pairs, and write a small accessor function.
library(tidyverse)
# First, a small adjustment to `grps` to reflect an empty group.
grps <- list(c('A1','B1'),
c('C1','D1'),
NULL)
names <- unlist(grps, use.names = F)
values <- rep(grp.names, map_dbl(grps, length))
h = as.list(values) %>%
set_names(names) %>%
list2env()
# find x in h
f <- Vectorize(function(x) h[[x]], c("x")) # scoping here
This takes some time to setup, but usage is quite convenient:
my_data %>%
mutate(Groups = f(Position))
Position Group Data
1 A1 Control 1
2 B1 Control 2
3 C1 Exp1 3
4 D1 Exp1 4
This avoids having to change your code in multiple places, and can take on arbitrary length of groups.
Approach 2: Dynamic switch
Alternatively, we can make an arbitrary length switch expression, building it from the group names and their unique values.
constructor <- function(ids, names){
purrr::imap_chr(as.character(ids), ~paste(paste0("\"", .x ,"\""),
paste0("\"", names[.y], "\""),
sep = "=")) %>%
paste0(collapse = ", ") %>%
paste0("Vectorize(function(x) switch(as.character(x), ", ., ", NA))", collapse = "") %>%
str2expression()
}
my_data %>%
mutate(Group = eval(constructor(names, values)))
In this case, it would evaluate the expression
expression(Vectorize(function(x) switch(as.character(x), A1 = "Control",
B1 = "Control", C1 = "Exp1", D1 = "Exp1",
NA)))
For each item in my_data$Position you want to go through each of the grps and look for a match and assign grp.names, if so. If you don't find a match in any grp, assign grp.names[3]:
my_data$Group <- lapply(my_data$Position, function(position){ # Goes through each my_data$Position
for(i in 1:length(grps)){
if(position %in% grps[[i]]){
return(grp.names[i]) # Give matching index of grp.names to grps
} else if (i == length(grps)){ # if no matches assign grp.names[3]
return(grp.names[3])
}
}
}) %>% unlist() # Put the list into a vector

R Applying self made formatting function over data frame R

I am using R and I need to format the number within a dataframe, notably by imposing the number of digits before the decimal separator as well as after. E.g. 3.56 must become "0003,56000".
So I built my own function:
format <- function(x, nbr_before_comma, nbr_after_comma){
x= round(x, nbr_after_comma)
x = toString(x)
l = strsplit(x, "[.]")[[1]]
#print(l)
#print(nchar(l[2]))
before_comma = paste0(strrep("0",nbr_before_comma - nchar(l[1])),l[1])
after_comma = ifelse(length(l) > 1,
paste0(l[2],strrep("0",nbr_after_comma - nchar(l[2]))),
strrep("0", nbre_after_comma))
res = paste0(before_comma, ",", after_comma)
return(res)
}
Trying this on a single number will work. Now I am trying to apply this to a dataframe. Let's take the toy example:
df <- data.frame("a" = c(2.5,3.56,4.5))
I define moreprecisely what I want:
format44 <- function(x){
return(format(x,4,4))
}
I have tried several possibilities:
df[] <- lapply(df, format44)
with dplyr:
df <- df %>%
mutate(a = format44(a))
and finally:
df["a"] <- lapply(df["a"],format44)
None will work. actually, I get the same output everytime:
a
1 0002,5, 3
2 0002,5, 3
3 0002,5, 3
Any idea what the problem is ?
Use sprintf and then translate the decimal points to comma:
before <- after <- 4
fmt <- sprintf("%%0%d.%df", before + after + 1, after)
transform(df, a = chartr(".", ",", sprintf(fmt, a)))
giving:
a
1 0002,5000
2 0003,5600
3 0004,5000
or writing this with dplyr:
library(dplyr)
before <- after <- 4
df %>%
mutate(a = "%%0%d.%df" %>%
sprintf(before + after + 1, after) %>%
sprintf(a) %>%
chartr(".", ",", .))
giving:
a
1 0002,5000
2 0003,5600
3 0004,5000
In this case, mapply suits better you:
df$b <- mapply(format44, df$a)
You do not even need the format44 wrapper. You can use:
df$c <- mapply(format, df$a, 4,4)

Flipping two sides of string

I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.
Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30

R: Apply a function over only character columns without type coercion

I have a data frame with many columns. The columns differ in their types: some are numeric, some are character, etc. Here's a small example where we just have 3 variables with 2 types:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
I want to replace the value 3 with a space, but only for character columns. Here's my current code:
out <- as.data.frame(lapply(dat, function(x) {
ifelse(is.character(x),
gsub("3", " ", x),
x)}),
stringsAsFactors = FALSE)
The problem is that the ifelse() function ignores that y and z are numeric and that it also seems to coerce the numeric variables to character anyway.
And idea has been to pull out the character columns, gsub() them, then bind them back to the original data frame. This, however, changes the ordering of the columns. Key to any solution is that I do not need to specify variables by name but only by type.
One can also do this trivially using dplyr:
# Load package
library(dplyr)
# Create data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
# Replace 3's with spaces for character columns
dat <- dat %>% mutate_if(is.character, function(x) gsub(pattern = "3", " ", x))
I tried your code and for me it seems like ifelse did not work but separating if ad else does. Below is the code which works:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
> lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x })
$x
[1] "1" "2" " "
$y
[1] 1.0 2.5 3.3
$z
[1] 1 2 3
> as.data.frame(lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x }))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3
It comes down to this line in ?ifelse
ifelse returns a value with the same shape as test ...
is.character is length one so the returned value is length 1. You can use if(...) yes else no as you have suggested instead as #Heikki have suggested.
Similar to #user3614648 solution:
library(dplyr)
dat %>%
mutate_if(is.character, funs(ifelse(. == "3", " ", .)))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3

In R, how can I copy rows from one dataframe to another when the df being copied to has 2 additional columns?

I have a tab delimited text file with 12 columns that I am uploading to my program. I go on to create another dataframe with a structure similar to the one uploaded and add 2 more columns to it.
excelfile = read.delim(ExcelPath)
matchedPictures<- excelfile[0,]
matchedPictures$beforeName <- character()
matchedPictures$afterName <- character()
Now I have a function in which I do the following:
Based on a condition, I obtain the row number pictureMatchNum of the row I need to copy from excelfile to matchedPictures.
I should then copy the row from excelfile to matchedPictures. I tried a couple of different ways so far.
a.
rowNumber = nrow(matchedPictures) + 1
matchedPictures[rowNumber,1:12] <<- excelfile[pictureMatchNum,1:12]
b.
matchedPictures[rowNumber,1:12] <<- rbind(matchedPictures, excelfile[pictureWordMatches,1:12], make.row.names = FALSE)
2a. doesn't seem to work because it copies the indices from the excelfileand uses them as row names in the matchedPictures - which is why I decided to go with rbind
2b. doesn't seem to work because rbind needs to have the columns be identical and matchedPictureshas 2 extra columns.
EDIT START - Including reproducible example.
Here is some reproducible code (with fewer columns and fake data)
excelfile <- data.frame(x = letters, y = words[length(letters)], z= fruit[length(letters)] )
matchedPictures <- excelfile[0,]
matchedPictures$beforeName <- character()
matchedPictures$afterName <- character()
pictureMatchNum1 = match(1, str_detect("A", regex(excelfile$x, ignore_case = TRUE)))
rowNumber1 = nrow(matchedPictures) + 1
pictureMatchNum2 = match(1, str_detect("D", regex(excelfile$x, ignore_case = TRUE)))
rowNumber2 = nrow(matchedPictures) + 1
The 2 options I tried are
2a.
matchedPictures[rowNumber1,1:3] <<- excelfile[pictureMatchNum1,1:3]
matchedPictures[rowNumber1,"beforeName"] <<- "xxx"
matchedPictures[rowNumber1,"afterName"] <<- "yyy"
matchedPictures[rowNumber2,1:3] <<- excelfile[pictureMatchNum2,1:3]
matchedPictures[rowNumber2,"beforeName"] <<- "uuu"
matchedPictures[rowNumber2,"afterName"] <<- "www"
OR
2b.
matchedPictures[rowNumber1,1:3] <<- rbind(matchedPictures, excelfile[pictureMatchNum1,1:3], make.row.names = FALSE)
matchedPictures[rowNumber1,"beforeName"] <<- "xxx"
matchedPictures[rowNumber1,"afterName"] <<- "yyy"
matchedPictures[rowNumber2,1:3] <<- rbind(matchedPictures, excelfile[pictureMatchNum2,1:3], make.row.names = FALSE)
matchedPictures[rowNumber2,"beforeName"] <<- "uuu"
matchedPictures[rowNumber2,"afterName"] <<- "www"
EDIT END
Additionally, I have also seen the suggestions in many places that rather than using empty dataframes, one should have vectors and append data to the vectors and then combine them into a dataframe. Is this suggestion valid when I have so many columns and would need to have 14 separate vectors and copy each one of them individually?
What can I do to make this work?
You could
first determine the row indices of excelfile that match your criteria
extract these rows
then generate the data to fill your columns beforeName and afterName
then append these columns to your new data frame
Example:
excelfile <- data.frame(x = letters, y = words[length(letters)],
z = fruit[length(letters)])
## Vector of patterns:
patternVec <- c("A", "D", "M")
## Look for appropriate rows in file 'excelfile':
indexVec <- vapply(patternVec,
function(myPattern) which(str_detect(myPattern,
regex(excelfile$x, ignore_case = TRUE))), integer(1))
## Extract these rows:
matchedPictures <- excelfile[indexVec,]
## Somehow generate the data for columns 'beforeName' and 'afterName':
## I do not know how this information is generated so I just insert
## some dummy code here:
beforeNameVec <- c("xxx", "uuu", "mmm")
afterNameVec <- c("yyy", "www", "nnn")
## Then assign these variables:
matchedPictures$beforeName <- beforeNameVec
matchedPictures$afterName <- afterNameVec
matchedPictures
# x y z beforeName afterName
# a air dragonfruit xxx yyy
# d air dragonfruit uuu www
# m air dragonfruit mmm nnn
You can make this much simpler by using dplyr
library(dplyr)
library(stringr)
excelfile <- data.frame(x = letters, y = words[length(letters)], z= fruit[length(letters)],
stringsAsFactors = FALSE ) #add stringsAsFactors to have character columns
pictureMatch <- excelfile %>%
#create a match column
mutate(match = ifelse(str_detect(x,"a") | str_detect(x,'d'),1,0)) %>%
#filter to only the columns that match your condition
filter(match ==1)
pictureMatch <- pictureMatch[['x']] #convert to a vector
matchedPictures <- excelfile %>%
filter(x %in% pictureMatch) %>% #grab the rows that match your condition
mutate(beforeName = c('xxx','uuu'), #add your names
afterName = c('yyy','www'))

Resources