Extract multiple variables by naming convention, for more than two types of naming convention - r

I'm trying to extract multiple variables that start with certain strings. For this example I'd like to write a code that will extract all variables that start with X1 and Y2.
set.seed(123)
df <- data.frame(X1_1=sample(1:5,10,TRUE),
X1_2=sample(1:5,10,TRUE),
X2_1=sample(1:5,10,TRUE),
X2_2=sample(1:5,10,TRUE),
Y1_1=sample(1:5,10,TRUE),
Y1_2=sample(1:5,10,TRUE),
Y2_1=sample(1:5,10,TRUE),
Y2_2=sample(1:5,10,TRUE))
I know I can use the following to extract variables that begin with "X1"
Vars_to_extract <- c("X1")
tempdf <- df[ , grep( paste0(Vars_to_extract,".*" ) , names(df), value=TRUE)]
X1_1 X1_2
1 3 5
2 3 4
3 2 1
4 2 2
5 3 3
But I need to adapt above code to extract variables multiple variable types, if specified like this
Vars_to_extract <- c("X1","Y2")
I've been trying to do it using an %in% with .* within the grep part, but with little success. I know to I can write the following which is pretty manual, merging each set of variables separately.
tempdf <- data.frame(df[, grep("X1.*", names(df), value=TRUE)] , df[, grep("Y2.*", names(df), value=TRUE)] )
X1_1 X1_2 Y2_1 Y2_2
1 3 5 1 5
2 3 4 1 5
3 2 1 2 3
4 2 2 3 1
5 3 3 4 2
However, in real world situation, I often work with lots of variables and would have to do this numerous times. Is it possible to write it in this way using %in% or does I need use a loop? Any help or tips will be gratefully appreciated. Thanks

We could use contains if we want to extract column names that have the substring anywhere in the string
library(dplyr)
df %>%
select(contains(Vars_to_extract))
Or with matches, we can use a regex to specify the the string starts (^) with the specific substring
library(stringr)
df %>%
select(matches(str_c('^(', Vars_to_extract, ')', collapse="|")))
With grep, we could create a single pattern by paste with collapse = "|"
df[grep(paste0("^(",paste(Vars_to_extract, collapse='|'), ")"), names(df))]
# X1_1 X1_2 Y2_1 Y2_2
#1 3 5 5 3
#2 3 3 5 5
#3 2 3 3 3
#4 2 1 1 2
#5 3 4 4 5
#6 5 1 1 5
#7 4 1 1 3
#8 1 5 3 2
#9 2 3 4 2
#10 3 2 1 2
Or another approach is to startsWith with lapply and Reduce
df[Reduce(`|`, lapply(Vars_to_extract, startsWith, x = names(df)))]

Related

Rename dataframe column names by switching string patterns

I have following dataframe and I want to rename the column names to c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")
> dataf <- data.frame(
+ WBC_D7_MIN=1:4,WBC_D7_MAX=1:4,DBP_D3_MIN=1:4
+ )
> dataf
WBC_D7_MIN WBC_D7_MAX DBP_D3_MIN
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
> names(dataf)
[1] "WBC_D7_MIN" "WBC_D7_MAX" "DBP_D3_MIN"
Probably, the rename_with function in tidyverse can do it, But I cannot figure out how to do it.
You can use capture groups with sub to extract values in order -
names(dataf) <- sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', names(dataf))
Same regex can be used in rename_with -
library(dplyr)
dataf %>% rename_with(~ sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', .))
# WBC_MIN_D7 WBC_MAX_D7 DBP_MIN_D3
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
You can rename your dataf with your vector with names(yourDF) <- c("A","B",...,"Z"):
names(dataf) <- c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")

Excluding variables with grep in R

I have a dataset like the following. Of course mine is a lot bigger with much more variables. I want to compute some stuff, for which I need to choose specific variables. For example I want to choose the variables T_H_01 - T_H_03, but I don't want to have T_H_G and T_H_S within. I tried doing it with grep, but I don't know how to tell the grep function to take all the "T_H" Items but exclude specific variables such as T_H_G and T_H_S.
df <- read.table(header=TRUE, text="
T_H_01 T_H_02 T_H_03 T_H_G T_H_S
5 1 2 1 5
3 1 3 3 4
2 1 3 1 3
4 2 5 5 3
5 1 4 1 2
")
df[,grep("T_H.",names(df))]
Thank you!
If you just want columns T_H_ followed by a number, then simply phrase that in your call to grep:
df[, grep("^T_H_\\d+$", names(df))]
If instead you want to phrase the search as explicitly excluding T_H_G and T_H_S, then you could use a negative lookahead for that:
df[, grep("^T_H_(?![GS]$).+$", names(df), perl=TRUE)]
You could do something like this
ex <- c('T_H_G', 'T_H_S' )
df[,grepl("T_H.", names(df)) & !names(df) %in% ex]
You can use this approach, to filter out not useful column:
df[,grep("T_H.",names(df))[!(grep("T_H.",names(df)) %in% c(grep("T_H_G",names(df)),grep("T_H_S",names(df))))]]
T_H_01 T_H_02 T_H_03
1 5 1 2
2 3 1 3
3 2 1 3
4 4 2 5
5 5 1 4
If you have a generic pattern to exclude specific columns, you can improve the grep condition with it.

Convert list df (with multiple columns) to numeric

I have a df below:
view(fds)
#1 #2 #3 #4
1# 1 3 4 2
2# 4 5 3 2
3# 2 5 3 1
4# 3 5 1 3
I want to fds.sum <- rowSums(fds) but I get an "Error in rowSums(fds) : 'x' must be numeric"... Then, when I try fds.mun <- as.numeric(fds), I get an "Error: 'list' object cannot be coerced to type 'double'"...
I have tried fds.num <- lapply(fds, as.numeric) but that gives me:
fds.num list[4] List of Length 4
1# double[101] 1 4 2 3
2# double[101] 3 5 5 5
3# double[101] 4 3 3 1
4# double[101] 2 2 1 3
I just want a sum of my rows in a new column such that:
#1 #2 #3 #4 sum
1# 1 3 4 2 10
2# 4 5 3 2 14
3# 2 5 3 1 11
4# 3 5 1 3 12
Anyone know how to do that?
If we want to use the OP's code, just Reduce with +
fds$sum <- Reduce(`+`, lapply(fds, as.numeric) )
Or after converting to numeric, bind them as a matrix or update the original data
fds[] <- lapply(fds, as.numeric)
fds$sum <- rowSums(fds, na.rm = TRUE)
Or it can be done on the fly with sapply
fds$sum <- rowSums(sapply(fds, as.numeric))
Or even without doing as.numeric, can be automated with type.convert
fds$sum <- rowSums(type.convert(fds, as.is = TRUE))
The error showed in OP's code is a a result of applying rowSums directly on a list as lapply always returns a list

How to apply function to colnames in tidyverse

Just like in title: is there any function that allows applying another function to column names of data frame? I mean something like forcats::fct_relabel that applies some function to factor labels.
To give an example, supose I have a data.frame as below:
X<-data.frame(
firababst = c(1,1,1),
secababond = c(2,2,2),
thiababrd = c(3,3,3)
)
X
firababst secababond thiababrd
1 1 2 3
2 1 2 3
3 1 2 3
Now I wish to get rid of abab from column names by applying stringr::str_remove. My workaround involves magrittr::set_colnames:
X %>%
set_colnames(colnames(.) %>% str_remove('abab'))
first second third
1 1 2 3
2 1 2 3
3 1 2 3
Can you suggest some more strightforward way? Ideally, something like:
X %>%
magic_foo(str_remove, 'abab')
You can do:
X %>%
rename_all(~ str_remove(., "abab"))
first second third
1 1 2 3
2 1 2 3
3 1 2 3
With base R, we can do
names(X) <- sub("abab", "", names(X))

R: Converting wide format to long format with multiple 3 time period variables [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
Apologies if this is a simple question, but I haven't been able to find a simple solution after searching. I'm fairly new to R, and am having trouble converting wide format to long format using either the melt (reshape2) or gather(tidyr) functions. The dataset that I'm working with contains 22 different time variables that are each 3 time periods. The problem occurs when I try to convert all of these from wide to long format at once. I have had success in converting them individually, but it's a very inefficient and long, so I was wondering if anyone could suggest a simpler solution. Below is a sample dataset I created that is formatted in a similar way as the dataset I am working with:
Subject <- c(1, 2, 3)
BlueTime1 <- c(2, 5, 6)
BlueTime2 <- c(4, 6, 7)
BlueTime3 <- c(1, 2, 3)
RedTime1 <- c(2, 5, 6)
RedTime2 <- c(4, 6, 7)
RedTime3 <- c(1, 2, 3)
GreenTime1 <- c(2, 5, 6)
GreenTime2 <- c(4, 6, 7)
GreenTime3 <- c(1, 2, 3)
sample.df <- data.frame(Subject, BlueTime1, BlueTime2, BlueTime3,
RedTime1, RedTime2, RedTime3,
GreenTime1,GreenTime2, GreenTime3)
A solution that has worked for me is to use the gather function from tidyr, arranging the data by Subject (so that each subject's data is grouped together), and then selecting only the subject, time period, and rating. This was done for each variable (in my case 22).
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
BlueGather <- gather(sample.df, Time_Blue, Rating_Blue, c(BlueTime1,
BlueTime2,
BlueTime3))
BlueSorted <- arrange(BlueGather, Subject)
BlueSubtracted <- select(BlueSorted, Subject, Time_Blue, Rating_Blue)
After this code, I combine everything into one data frame. This seems very slow and inefficient to me, and was hoping that someone could help me find a simpler solution. Thank you!
The idea here is to gather() all the time variables (all variables but Subject), use separate() on key to split them into a label and a time and then spread() the label and value to obtain your desired output.
library(dplyr)
library(tidyr)
sample.df %>%
gather(key, value, -Subject) %>%
separate(key, into = c("label", "time"), "(?<=[a-z])(?=[0-9])") %>%
spread(label, value)
Which gives:
# Subject time BlueTime GreenTime RedTime
#1 1 1 2 2 2
#2 1 2 4 4 4
#3 1 3 1 1 1
#4 2 1 5 5 5
#5 2 2 6 6 6
#6 2 3 2 2 2
#7 3 1 6 6 6
#8 3 2 7 7 7
#9 3 3 3 3 3
Note
Here we use the regex in separate() from this answer by #RichardScriven to split the column on the first encountered digit.
Edit
I understand from your comments that your dataset column names are actually in the form ColorTime_Pre, ColorTime_Post, ColorTime_Final. If that is the case, you don't have to specify a regex in separate() as the default one sep = "[^[:alnum:]]+" will match your _ and split the key into label and time accordingly:
sample.df %>%
gather(key, value, -Subject) %>%
separate(key, into = c("label", "time")) %>%
spread(label, value)
Will give:
# Subject time BlueTime GreenTime RedTime
#1 1 Final 1 1 1
#2 1 Post 4 4 4
#3 1 Pre 2 2 2
#4 2 Final 2 2 2
#5 2 Post 6 6 6
#6 2 Pre 5 5 5
#7 3 Final 3 3 3
#8 3 Post 7 7 7
#9 3 Pre 6 6 6
We can use melt from data.table which can take multiple measure columns as a regex pattern
library(data.table)
melt(setDT(sample.df), measure = patterns("^Blue", "^Red", "^Green"),
value.name = c("BlueTime", "RedTime", "GreenTime"), variable.name = "time")
# Subject time BlueTime RedTime GreenTime
#1: 1 1 2 2 2
#2: 2 1 5 5 5
#3: 3 1 6 6 6
#4: 1 2 4 4 4
#5: 2 2 6 6 6
#6: 3 2 7 7 7
#7: 1 3 1 1 1
#8: 2 3 2 2 2
#9: 3 3 3 3 3
Or as #StevenBeaupré mentioned in the comments, if there are many patterns, one option would be to use the names of the dataset after extracting the substring as the patterns argument
melt(setDT(sample.df), measure = patterns(as.list(unique(sub("\\d+", "",
names(sample.df)[-1])))),value.name = c("BlueTime", "RedTime",
"GreenTime"), variable.name = "time")
If your goal is to convert the three colors to long this can be accomplished with the base R reshape function:
reshape(sample.df, idvar="subject", varying=2:length(sample.df), sep="", direction="long")
Subject time BlueTime RedTime GreenTime subject
1.1 1 1 2 2 2 1
2.1 2 1 5 5 5 2
3.1 3 1 6 6 6 3
1.2 1 2 4 4 4 1
2.2 2 2 6 6 6 2
3.2 3 2 7 7 7 3
1.3 1 3 1 1 1 1
2.3 2 3 2 2 2 2
3.3 3 3 3 3 3 3
The time variable captures the 1,2,3 in the names of the wide variables. The varying argument tells reshape which variables should be converted to long. The sep argument tells reshape to look for numbers at the end of the varying variables that are not separated by any characters, while the direction argument tells the function to attempt a long conversion.
I always add the id variable, even if it is not necessary for future reference.
If your data.frame doesn't have actually have the numbers for the time variable, a fairly simple solution is to change the variable names so that they do. For example, the following would replace "_Pre" with "1" at the end of any such variables.
names(df)[grep("_Pre$", names(df))] <- gsub("_Pre$", "1",
names(df)[grep("_Pre$", names(df))])

Resources