Excluding variables with grep in R - r

I have a dataset like the following. Of course mine is a lot bigger with much more variables. I want to compute some stuff, for which I need to choose specific variables. For example I want to choose the variables T_H_01 - T_H_03, but I don't want to have T_H_G and T_H_S within. I tried doing it with grep, but I don't know how to tell the grep function to take all the "T_H" Items but exclude specific variables such as T_H_G and T_H_S.
df <- read.table(header=TRUE, text="
T_H_01 T_H_02 T_H_03 T_H_G T_H_S
5 1 2 1 5
3 1 3 3 4
2 1 3 1 3
4 2 5 5 3
5 1 4 1 2
")
df[,grep("T_H.",names(df))]
Thank you!

If you just want columns T_H_ followed by a number, then simply phrase that in your call to grep:
df[, grep("^T_H_\\d+$", names(df))]
If instead you want to phrase the search as explicitly excluding T_H_G and T_H_S, then you could use a negative lookahead for that:
df[, grep("^T_H_(?![GS]$).+$", names(df), perl=TRUE)]

You could do something like this
ex <- c('T_H_G', 'T_H_S' )
df[,grepl("T_H.", names(df)) & !names(df) %in% ex]

You can use this approach, to filter out not useful column:
df[,grep("T_H.",names(df))[!(grep("T_H.",names(df)) %in% c(grep("T_H_G",names(df)),grep("T_H_S",names(df))))]]
T_H_01 T_H_02 T_H_03
1 5 1 2
2 3 1 3
3 2 1 3
4 4 2 5
5 5 1 4
If you have a generic pattern to exclude specific columns, you can improve the grep condition with it.

Related

Extract multiple variables by naming convention, for more than two types of naming convention

I'm trying to extract multiple variables that start with certain strings. For this example I'd like to write a code that will extract all variables that start with X1 and Y2.
set.seed(123)
df <- data.frame(X1_1=sample(1:5,10,TRUE),
X1_2=sample(1:5,10,TRUE),
X2_1=sample(1:5,10,TRUE),
X2_2=sample(1:5,10,TRUE),
Y1_1=sample(1:5,10,TRUE),
Y1_2=sample(1:5,10,TRUE),
Y2_1=sample(1:5,10,TRUE),
Y2_2=sample(1:5,10,TRUE))
I know I can use the following to extract variables that begin with "X1"
Vars_to_extract <- c("X1")
tempdf <- df[ , grep( paste0(Vars_to_extract,".*" ) , names(df), value=TRUE)]
X1_1 X1_2
1 3 5
2 3 4
3 2 1
4 2 2
5 3 3
But I need to adapt above code to extract variables multiple variable types, if specified like this
Vars_to_extract <- c("X1","Y2")
I've been trying to do it using an %in% with .* within the grep part, but with little success. I know to I can write the following which is pretty manual, merging each set of variables separately.
tempdf <- data.frame(df[, grep("X1.*", names(df), value=TRUE)] , df[, grep("Y2.*", names(df), value=TRUE)] )
X1_1 X1_2 Y2_1 Y2_2
1 3 5 1 5
2 3 4 1 5
3 2 1 2 3
4 2 2 3 1
5 3 3 4 2
However, in real world situation, I often work with lots of variables and would have to do this numerous times. Is it possible to write it in this way using %in% or does I need use a loop? Any help or tips will be gratefully appreciated. Thanks
We could use contains if we want to extract column names that have the substring anywhere in the string
library(dplyr)
df %>%
select(contains(Vars_to_extract))
Or with matches, we can use a regex to specify the the string starts (^) with the specific substring
library(stringr)
df %>%
select(matches(str_c('^(', Vars_to_extract, ')', collapse="|")))
With grep, we could create a single pattern by paste with collapse = "|"
df[grep(paste0("^(",paste(Vars_to_extract, collapse='|'), ")"), names(df))]
# X1_1 X1_2 Y2_1 Y2_2
#1 3 5 5 3
#2 3 3 5 5
#3 2 3 3 3
#4 2 1 1 2
#5 3 4 4 5
#6 5 1 1 5
#7 4 1 1 3
#8 1 5 3 2
#9 2 3 4 2
#10 3 2 1 2
Or another approach is to startsWith with lapply and Reduce
df[Reduce(`|`, lapply(Vars_to_extract, startsWith, x = names(df)))]

Expand.grid with unknown number of columns

I have the following data frame:
map_value LDGroup ComboNum
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 3 1
1 3 2
I want to find all combinations, selecting one from each LD group. Expand.grid seems to work for this, doing
expand.grid(df[df$LDGroup==1,3],df[df$LDGroup==2,3],df[df$LDGroup==3,3])
My problem is that I have about 500 map_values I need to do this for and I do not know what number of LDGroups will exist for each map_value. Is there a way to dynamically provide the function arguments?
We can split the 3rd column by the 'LDGroup' and apply the expand.grid
out <- expand.grid(split(df$ComboNum, df$LDGroup))
names(out) <- paste0("Var", names(out))

How to tidy up a character column?

What I have:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"))
> test_df
isolate label alignment
1 1 1 --at
2 2 1 at--
3 3 1 --at
4 4 1 --at
5 1 2 a--
6 2 2 acg
7 3 2 a--
8 4 2 a--
9 5 2 agg
What I want:
I'd like to explode the alignment field into two columns, position and character:
> test_df
isolate label aln_pos aln_char
1 1 1 1 -
2 1 1 2 -
3 1 1 3 a
4 1 1 4 t
...
Not all alignments are the same length, but all alignments with the same label have the same length.
What I've tried:
I was thinking I could use separate to first make each position have its own column, then use gather turn those columns into key value pairs. However, I haven't been able to get the separate part right.
Since you mentioned tidyr::gather, you could try this:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),
label=c(1,1,1,1,2,2,2,2,2),
alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"),
stringsAsFactors = FALSE)
library(tidyverse)
test_df %>%
mutate(alignment = strsplit(alignment,"")) %>%
unnest(alignment)
In base R, you can use indexing along with creation of a list with strsplit like this.
# make variable a character vector
test_df$alignment <- as.character(test_df$alignment)
# get list of individual characters
myList <- strsplit(test_df$alignment, split="")
then build the data.frame
# construct data.frame
final_df <- cbind(test_df[rep(seq_len(nrow(test_df)), lengths(myList)),
c("isolate", "label")],
aln_pos=sequence(lengths(myList)),
aln_char=unlist(myList))
Here, we take the first two columns of the original data.frame and repeat the rows using rep with a vector input in its second argument telling it how many times to repeat the corresponding value in its first argument. The number of times is calculated with lengths. The second argument of cbind is a call to sequence taking the same lengths output. this produces counts from 1 to the corresponding length. The third argument is the unlisted character values.
this returns
head(final_df, 10)
isolate label aln_pos aln_char
1 1 1 1 -
1.1 1 1 2 -
1.2 1 1 3 a
1.3 1 1 4 t
2 2 1 1 a
2.1 2 1 2 t
2.2 2 1 3 -
2.3 2 1 4 -
3 3 1 1 -
3.1 3 1 2 -

How to remove columns of data from a data frame using a vector with a regular expression

I am trying to remove columns from a dataframe using a vector of numbers, with those numbers being just a part of the whole column header. What I'm looking to use is something like the wildcard "*" in unix, so that I can say that I want to remove columns with labels xxxx, xxkx, etc... To illustrate what I mean, if I have the following data:
data_test_read <- read.table("batch_1_8c9.structure-edit.tsv",sep="\t", header=TRUE)
data_test_read[1:5,1:5]
samp pop X12706_10 X14223_16 X14481_7
1 BayOfIslands_s088.fq 1 4 1 3
2 BayOfIslands_s088.fq 1 4 1 3
3 BayOfIslands_s089.fq 1 4 1 3
4 BayOfIslands_s089.fq 1 4 3 3
5 BayOfIslands_s090.fq 1 4 1 3
And I want to take out, for example, columns with headers (X12706_10, X14481_7), the following works
data_subs1=subset(data_test_read, select = -c(X12706_10, X14481_7))
data_subs1[1:4,1:4]
samp pop X14223_16 X15213_19
1 BayOfIslands_s088.fq 1 1 3
2 BayOfIslands_s088.fq 1 1 3
3 BayOfIslands_s089.fq 1 1 3
4 BayOfIslands_s089.fq 1 3 3
However, what I need is to be able to identify these columns by only the numbers, so, using (12706,14481). But, if I try this, I get the following
data_subs2=subset(data_test_read, select = -c(12706,14481))
data_subs2[1:4,1:4]
samp pop X12706_10 X14223_16
1 BayOfIslands_s088.fq 1 4 1
2 BayOfIslands_s088.fq 1 4 1
3 BayOfIslands_s089.fq 1 4 1
4 BayOfIslands_s089.fq 1 4 3
This is clearly because I haven't specified anything to do with the "x", or the "_" or what is after the underscore. I've read so many answers on using regular expressions, and I just can't seem to sort it out. Any thoughts, or pointers to what I might turn to would be appreciated.
First you can just extract the numbers from the headers
# for testing
col_names <- c("X12706_10","X14223_16","X14481_7")
# in practice, use
# col_names <- names(data_test_read)
samples <- gsub("X(\\d+)_.*","\\1",col_names)
The find the indexes of the samples you want to drop.
samples_to_drop <- c(12706, 14481)
cols_to_drop <- match(samples_to_drop, samples)
Then you can use
data_subs2 <- subset(data_test_read, select = -cols_to_drop)
to actually get rid of those columns.
Perhaps put this all in a function to make it easier to use
sample_subset <- function(x, drop) {
samples <- gsub("X(\\d+)_.*","\\1", names(x))
subset(x, select = -match(drop, samples))
}
sample_subset(data_test_read, c(12706, 14481))

Generate dummy variable with multiple levels in R

My question involves how to generate a dummy-variable from a character variable with multiple repeated characters in R. The number of times
that a certain character is repeated varies.
There are several questions about this topic, but none of them seem to address my specific problem.
Below is a minimal example of the data:
df <- data.frame(ID=c("C/004","C/004","C/005","C/005","C/005","C/007",
"C/007", "C/007"))
The result I expect is as follows:
> df
ID newID
1 C/004 1
2 C/004 1
3 C/005 2
4 C/005 2
5 C/005 2
6 C/007 3
7 C/007 3
8 C/007 3
I would like to have the resulting variable newID as of numeric class and not a factor and so I would not go for the function factor(.., levels=...)
since it results into a factor variable and besides I would be required to supply factor levels which are too many.
Any assistance would be greatly appreciated.
You can do this in a couple of ways
match(df$ID, unique(df$ID))
#[1] 1 1 2 2 2 3 3 3
Or
as.numeric(factor(df$ID))
#[1] 1 1 2 2 2 3 3 3
Or
cumsum(!duplicated(df$ID))
#[1] 1 1 2 2 2 3 3 3
All factors are numerics underneath. Therefore, if you want a numeric, simply convert
df$newID <- as.numeric(factor(df$ID))

Resources