matches patterns in vector with strings in data frame [duplicate] - r

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 6 years ago.
I have a data frame that contains two types cols and vector with names.
How select some rows in data frame matches with vector strings.
name = c("p4#HPS1", "p7#HPS2", "p4#HPS3", "p7#HPS4", "p7#HPS5", "p9#HPS6", "p11#HPS7", "p10#HPS8", "p15#HPS9")
expression = c(118.84, 90.04, 106.6, 104.99, 93.2, 66.84, 90.02, 108.03, 111.83)
dataset <- as.data.frame(cbind(name, expression))
nam <- c("HPS5", "HPS6", "HPS9", "HPS2")
The function should return date frame only for the specified lines
I try
dataset[mapply(grepl,nam,dataset$name)]
but it didn't work

We can use paste with collapse on the 'nam', use it as pattern argument in grep, get the index and subset the 'dataset'
dataset[grep(paste(nam, collapse="|"), dataset$name),]
If we are using the OP's code, wrap the 'name' column inside a list or else the mapply will go through individual elements of 'name' and as the number elements are not the same in 'name' and 'nam', this will throw a warning about the longer argument not a multiple of length of shorter. The mapply will return a logical matrix from which we take the rowSums and check whether it is greater than 0 to get a logical vector for subsetting the rows.
dataset[rowSums(mapply(grepl, nam, list(dataset$name)))>0,]

Related

Passing a vector through a select statement [duplicate]

This question already has answers here:
grep using a character vector with multiple patterns
(11 answers)
Closed 3 years ago.
Looking for help to find a way to pass a vector of strings into a select statement. I want to subset a data frame to only output variables that contain the same string as my vector. I don't want it to match exactly and hence need to pass a function like contains as there are some text in the data frame variables that I do not have in my vector.
here is an example of the vector I want to pass into my select statement.
c("clrs_name", "_clrs_sitedetails_value", "_clrs_targetlicence_value",
"clrs_licenceclass", "clrs_licenceownership", "clrs_type", "statuscode")
For example, I want to extract the variable "odate_value_clrs_name" from my data frame and the string "clrs_name" in vector should extract that, but I am not sure how to incorporate contains and a vector into a select statement.
We can use matches in select after collapseing the pattern vector with | by either paste from base R or str_c (str_c would also return NA if there are any NAs). This would not return any error or warning if one of the pattern is missing or doesn't have any match with the column names
library(dplyr)
library(stringr)
df1 %>%
select(matches(str_c(v1, collapse = "|")))
where
v1 <- c("clrs_name", "_clrs_sitedetails_value", "_clrs_targetlicence_value",
"clrs_licenceclass", "clrs_licenceownership", "clrs_type", "statuscode")

R: filtering elements of large vector that appear in a smaller vector [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 3 years ago.
Suppose we have a numeric vector. Actually, suppose we have a dataframe consisting of a single column.
example = data.frame("column" = rnorm(10000, 10, 3))
We'll be treating it as a dataframe in order to use the filter function of the dplyr package.
Also, suppose we have another vector of smaller length. This particular vector is just for the sake of the example. It doesn't necessarily have to be a sequence.
numbers = 8:100
What I would like to do is to keep those values of the larger vector that are equal to any of the values of the smaller vector and discard those values that are not.
Fair enough. The filter function can do that. Except that I would have to write this:
filtered = dplyr::filter(example, column == numbers[1] | column == numbers[2] | ... | column == numbers[length(numbers)])
I would have to write the condition column == numbers[i] for each of the elements of the numbers vector.
Executing this code
filtered = dplyr::filter(example, column == numbers)
gives as output a dataframe called filtered that consists of a single column with no rows. There are no rows because, since all the rows of the example dataframe consist of scalars, none of those rows is equal to the whole numbers vector.
Is there an smarter method that doesn't require me to write that condition for each element of the numbers vector?
You can use the operator %in% to check if your values are "in" the vector.
Code:
new_data <- old_data %>%
dplyr::filter(column %in% numbers)
Are you looking for:
filtered <- dplyr::filter(example, column %in% numbers)
An option with base R
subset(example, column %in% numbers)

How do I extract elements from a dataframe by pattern? [duplicate]

This question already has answers here:
Subset data to contain only columns whose names match a condition
(10 answers)
Closed 3 years ago.
I have a dataframe dat that has many variables like
"x_tp1_y"
"g_tp1_z"
"f_tp2_h"
I would like to extract elements that include "tp1".
I already tried this:
grep("tp1", dat)
grepl("tp1", dat)
dat["tp1",]
I just want R to give me elements with this pattern so I do not have to type in all variable names that are in the dataframe dat.
Like this:
command that extracts elements with pattern "tp1"
R returns parts of the dataframe that have pattern "tp1":
x_tp1_y g_tp1_z
1 2
0 3
And then I would like to create a new dataframe.
I know that I just can use
newdat <- data.frame( dat[[1]], dat[ c(1:30)])
but I have so many elements in my dataframe that this would take ages.
Thank you for your help!
dat[,grep("tp1", colnames(dat))]
grep finds the index numbers in the column names of the data.frame (the vector colnames(dat)) that contain the necessary pattern. "[" subsets

Loop Through Column Names with Similar Structure [duplicate]

This question already has answers here:
How to extract columns with same name but different identifiers in R
(3 answers)
Closed 3 years ago.
I have a very large dataset. Of those, a small subset have the same column name with an indexing value that is numeric (unlike the post "How to extract columns with same name but different identifiers in R" where the indexing value is a string). For example
Q_1_1, Q_1_2, Q_1_3, ...
I am looking for a way to either loop through just those columns using the indices or to subset them all at once.
I have tried to use paste() to write their column names but have had no luck. See sample code below
Define Dataframe
df = data.frame("Q_1_1" = rep(1,5),"Q_1_2" = rep(2,5),"Q_1_3" = rep(3,5))
Define the Column Name Using Paste
cn <- as.symbol(paste("Q_1_",1, sep=""))
cn
df$cn
df$Q_1_1
I want df$cn to return the same thing as df$Q_1_1, but df$cn returns NULL.
If you are just trying to subset your data frame by column name, you could use dplyr for subseting all your indexed columns at once and a regex to match all column names with a certain pattern:
library(dplyr)
df = data.frame("Q_1_1" = rep(1,5),"Q_1_2" = rep(2,5),"Q_1_3" = rep(3,5), "A_1" = rep(4,5))
newdf <- df %>%
dplyr::select(matches("Q_[0-9]_[0-9]"))
the [0-9] in the regex matches any digit between the _. Depending on what variable you're trying to match you might have to change the regular expression.
The problem with your solution was that you only saved the name of your columns but did not actually assign it back to the data frame / to a column.
I hope this helps!

Replace column names with single line of code [duplicate]

This question already has answers here:
Rename multiple columns by names
(20 answers)
Closed 4 years ago.
Data frame with 4 columns and want to replace 2nd and 3rd column names only.
data frame=df
col.names =A,B,C,D
New col.names= Z,F
i have tried with the below code :
colnames(df)[2]<-"Z"
colnames(df)[3]<-"F"
but is there any possibility to rename with single line of code ?
Actual data frame contains 150+ colnames, so searching for better solution.
As it is a data.frame, names can also work in place of colnames as names of a data.frame is the column names. Subset the column names with index [2:3] (if it is a range of columns or use [c(2, 3)]) and assign it to the new column names by concatenating (c) names as a vector
names(df)[2:3] <- c("Z", "F")

Resources