Remove rows containing string in any vector in data frame

Remove rows containing string in any vector in data frame - r

I have a data frame containing a number of vectors that contain strings I would like to remove rows that contain a certain string.
df <- data.frame(id=seq(1:10),
foo=runif(10),
sapply(letters[1:5],function(x) {sample(letters,10,T)} ),
bar=runif(10))
This can be done on a single vector by specifying the vector name i.e.
df <- df[!grepl("b", df$a),]
which I can then repeat specifying each vector e.g.
df <- df[!grepl("b", df$b),]
df <- df[!grepl("b", df$c),]
df <- df[!grepl("b", df$d),]
df <- df[!grepl("b", df$e),]
but is it possible to do it in one line without having to specify which columns contain the string? Something like:
df <- df[!grepl("b", df),]

You could try
df[-which(df=="b", arr.ind=TRUE)[,1],]
or, as suggested by #docendodiscimus
df[rowSums(df == "b") == 0,]
This second option is preferable because it does not lead to any difficulty if no matching pattern is found.

Paste columns then grepl:
df[!grepl("b", paste0(df$a, df$b, df$c, df$d, df$e)), ]
Identify factor (or character columns) then paste:
df[!grepl("b",
apply(df[, sapply(df, class) == "factor"], 1, paste0, collapse = ",")), ]

target_cols <- c("a", "b", "c", "d", "e")
df[!Reduce(`|`, lapply(df[,target_cols], function(col) grepl("b", col))),]

Related

R change column names, retain part of colnames

I have a df with colnames thus:
Resampled.Band.1..raster.bsq...404.502014.Nanometers.
...
Resampled.Band.74..raster.bsq...950.851990.Nanometers.
I want them like this:
950.851990_nm
With:
orig_names <- names(df)
new_name <- gsub("Resampled.Band.", "", orig_names)
and
new_name <- gsub(".Nanometers.", "_nm", new_name)
names(all_roi_rfl) <- new_name
I achieve part of what I want: to change the first and last parts of the colnames:
1..raster.bsq...404.502014_nm
I could repeat this to clean the colnames up most of the way.
But how do I deal with the part of the colnames that varies itself, the band number?

Extract the values that you want using regex and replace the column names.
x <- c('Resampled.Band.1..raster.bsq...404.502014.Nanometers.',
'Resampled.Band.74..raster.bsq...950.851990.Nanometers.')
sub('.*raster.bsq\\.+(\\d+\\.\\d+)\\.Nanometers\\.', '\\1_nm', x)
#[1] "404.502014_nm" "950.851990_nm"
This extracts number that occur between "raster.bsq" and "Nanometers" and appends "_nm" to extracted value.
In your case to replace column names it would be :
names(all_roi_rfl) <- sub('.*raster.bsq\\.+(\\d+\\.\\d+)\\.Nanometers\\.', '\\1_nm', names(all_roi_rfl))

A similar answer to Ronak's but using gsub instead.
First generate a dataframe...
df <-
data.frame(
Resampled.Band.1..raster.bsq...404.502014.Nanometers. = c(1, 2, 1, 2),
Resampled.Band.74..raster.bsq...950.851990.Nanometers. = c('a', 'b', 'c', NA))
using gsub identify the string before and after the piece you want to extract
colnames(df) <- gsub(".*raster.bsq...(.+).Nanometers.", "\\1_nm", colnames(df))

Provide column and row names for data.frame in 1 line

I have a vector of rownames (x) and I want to name my columns (2) "A" and "B".
I want to do it in one line of code - data.frame(row.names = x, "A", "B").
Please advise what I am doing wrong? Should I use multiple lines of code for this?

I am not quite sure what you are after. But you can re-name row and column names as below using dimnames - this can be extended to multidimensional arrays as well.
df <- data.frame(A=c(1:3), B=c(4:6))
dimnames(df)[[1]] <- row_names_vector
dimnames(df)[[2]] <- col_names_vector
Other option is
rownames(df) <- row_names_vector
colnames(df) <- col_names_vector
One line
dimnames(df) <- list(row_names_vector, col_names_vector)
Example
row_names_vector <- letters[1:3]
col_names_vector <- letters[1:2]
dimnames(df) <- list(row_names_vector, col_names_vector)

Removing rows from a data frame

I have this data.frame:
set.seed(1)
df <- data.frame(id1=LETTERS[sample(26,100,replace = T)],id2=LETTERS[sample(26,100,replace = T)],stringsAsFactors = F)
and this vector:
vec <- LETTERS[sample(26,10,replace = F)]
I want to remove from df any row which either df$id1 or df$id2 are not in vec
Is there any faster way of finding the row indices which meet this condition than this:
rm.idx <- which(!apply(df,1,function(x) all(x %in% vec)))

I used dplyr with such script
df1 <- df %>% filter(!(df$id1 %in% vec)|!(df$id2 %in% vec))

Looping over the columns might be faster than over rows. So, use lapply to loop over the columns, create a list of logical vectors with %in%, use Reduce with | to check whether there are any TRUE values for each corresponding row and use that to subset the 'df'
df[Reduce(`|`, lapply(df, `%in%`, vec)),]
If we need both elements, then replace | with &
df[Reduce(`&`, lapply(df, `%in%`, vec)),]

Actually
rm.idx <- unique(which(!(df$id1 %in% vec) | !(df$id2 %in% vec)))
is also fast.

strsplit intermediate pattern in first column in a data frame

I have a data frame and I would like to split the first column into two columns but the separate pattern is similar to others and I only want to split the pattern located on number 4.
data frame:
TCGA-TS-A7P1-01A-41D-A39S-05 0.8637304
TCGA-NQ-A57I-01A-11D-A34E-05 0.7812147
TCGA-3H-AB3O-01A-11D-A39S-05 0.8963944
TCGA-LK-A4O2-01A-11D-A34E-05 0.6942843
TCGA-MQ-A4LI-01A-11D-A34E-05 0.8882558
desired output:
TCGA-TS-A7P1-01A 41D-A39S-05 0.8637304
TCGA-NQ-A57I-01A 11D-A34E-05 0.7812147
TCGA-3H-AB3O-01A 11D-A39S-05 0.8963944
TCGA-LK-A4O2-01A 11D-A34E-05 0.6942843
TCGA-MQ-A4LI-01A 11D-A34E-05 0.8882558
I tried:
sapply(strsplit(as.character(df$ID), "-"), '[', 1:4)
However, it is not the desired output above that I want. Thank you very much.

It seems all the elements of your first column are of the same length so one simple way could be:
df <- data.frame(col1 = c("TCGA-TS-A7P1-01A-41D-A39S-05","TCGA-NQ-A57I-01A-11D-A34E-05","TCGA-3H-AB3O-01A-11D-A39S-05"),
col2 = c(0.8637304,0.7812147,0.8963944), stringsAsFactors = FALSE)
df$col1bis <- substr(df$col1,18,28)
df$col1 <- substr(df$col1,1,16)
Then I reaggange the order of the columns:
df <- df[, c(1,3,2)]
resulting in:
> df
col1 col1bis col2
1 TCGA-TS-A7P1-01A 41D-A39S-05 0.8637304
2 TCGA-NQ-A57I-01A 11D-A34E-05 0.7812147
3 TCGA-3H-AB3O-01A 11D-A39S-05 0.8963944

I tried this one and it worked well.
df <- cbind(df[,1],df)
df[,1] <- substr(df[,1],1,16)
df[,2] <- substr(df[,2],18,28)

Systematic replace part of variable name with 1st element of an associated R vector

I have a dataframe in which the 1st element of an associated 'name' vector is related to subsequent named numerical vectors. I am attempting to replace the meaningless number with the 1st element of the associated name vector.
Here is an example dataframe:
df <- data.frame(data.0.name = c("A", "A", "A"), data.0.one_minute_ago = c(1,2,1), data.0.one_hour_ago = c(2,2,3),
data.1.name = c("B", "B", "B"), data.1.one_minute_ago = c(3,3,2), data.1.one_hour_ago = c(5,6,2))`
Each number.name vector is associated with a construct (either A or B in this case) and each number.time is associated with a time dimension. So, data.0.one_minute_ago is actually the number of A's you had one_minute_ago.
What I would like to do (because I have a large dataset with lots of the transformations) is to replace the number.dimension with the construct.dimension, and of course do that for each number. from 0:9
I've written some grep code to begin with this task, but to no avail (I am stuck with retaining everything after the number.
grep( "data.[0-9].name" ,names(df), perl=TRUE)
as.character(df[1, 1])
as.character(df[1, 4])
as.character(names(df[2]))
as.character(names(df[3]))
as.character(names(df[5]))
as.character(names(df[6]))
df.1 <- (df[1, grep( "data.[0-9].name" ,names(df))])
df.1 <- (df[1, grep( "data.[0-9].name" ,names(df))])
df.1 <- data.frame(lapply(df.1, as.character), stringsAsFactors=FALSE)
constructs <- as.character(df.1[1,c(1:2)])
Here the 1st and 2nd element of constructs are the constructs associated with 0.name/0.dimension and 1.name/1.dimension respectively.
constructs [1]
constructs [2]
From there, I'm fairly certain the code would involve some names(df)[] <- but am uncertain on where to go from here.
Any and all help appreciated.
EDIT: here is the desired variable name output: simply changing the variable names (and of course retain the values associated with the variable names:
data.A.name data.A.one_minute_ago data.A.one_hour_ago data.B.name data.B.one_minute_ago data.B.one_hour_ago
EDIT 2: In my true dataset, the number of repetitions per dimensions (i.e., one_minute_ago, one_hour_ago, one_day_ago) can vary across construct (i.e, two dimensions for one construct and 3 for another, and 9 for another). I would like the solution to take that into account.
Here is a modified sample dataset to reflect this subtlety:
df <- data.frame(data.0.name = c("A", "A", "A"), data.0.one_minute_ago = c(1,2,1), data.0.one_hour_ago = c(2,2,3),
data.1.name = c("B", "B", "B"), data.1.one_minute_ago = c(3,3,2), data.1.one_hour_ago = c(5,6,2),
data.2.name = c("C", "C", "C"), data.2.one_minute_ago = c(3,3,2), data.2.one_hour_ago = c(5,6,2), data.2.one_day_ago = c(3,2,3))

We create a grouping 'indx' based on the 'number' in the column names. split the column names based on the 'indx' ('lst'). Get one element from the columns having 'name' as suffix ('r1'). Use 'Map' and gsub to replace the 'number' in each element of 'lst' with that of 'r1'.
indx <- gsub('[^0-9]+', '', names(df))
lst <- split(names(df), indx)
r1 <- as.character(unlist(df[1,grep('name', names(df))]))
lst2 <- Map(function(x,y) gsub('[0-9]+', y, x), lst, r1)
names(df) <- unsplit(lst2, indx)
names(df)
# [1] "data.A.name" "data.A.one_minute_ago" "data.A.one_hour_ago"
#[4] "data.B.name" "data.B.one_minute_ago" "data.B.one_hour_ago"
#[7] "data.C.name" "data.C.one_minute_ago" "data.C.one_hour_ago"
#[10] "data.C.one_day_ago"

I think this works:
library(stringr)
splits <- str_split(names(df), "\\.")
trailing_name <- sapply(splits, "[[", 3)
constructs <- rep(constructs, each = 3)
constructs
# [1] "A" "A" "A" "B" "B" "B"
names(df) <- str_c("data", constructs, trailing_name, sep=".")
names(df)
# [1] "data.A.name" "data.A.one_minute_ago" "data.A.one_hour_ago" "data.B.name"
# [5] "data.B.one_minute_ago" "data.B.one_hour_ago"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove rows containing string in any vector in data frame - r

You could try df[-which(df=="b", arr.ind=TRUE)[,1],] or, as suggested by #docendodiscimus df[rowSums(df == "b") == 0,] This second option is preferable because it does not lead to any difficulty if no matching pattern is found.

Paste columns then grepl: df[!grepl("b", paste0(df$a, df$b, df$c, df$d, df$e)), ] Identify factor (or character columns) then paste: df[!grepl("b", apply(df[, sapply(df, class) == "factor"], 1, paste0, collapse = ",")), ]

target_cols <- c("a", "b", "c", "d", "e") df[!Reduce(`|`, lapply(df[,target_cols], function(col) grepl("b", col))),]

Related

R change column names, retain part of colnames

Provide column and row names for data.frame in 1 line

Removing rows from a data frame

strsplit intermediate pattern in first column in a data frame

Systematic replace part of variable name with 1st element of an associated R vector

Categories

Resources