R subsetting if both conditions are met - r

I am trying to subset a data frame df.1 based on two conditions:
observations in Accession variable should contain ;
observations in kinase.or.not should be kinase
Below is the code I used. But it seems that the first condition grep(";", df.1$Accession) is ignored. Why is that? Thanks!
df.2 <- df.1[grep(";", df.1$Accession) & df.1$kinase.or.not == "Kinase",]

We need grepl instead of grep - the difference is grep returns the numeric position index whereas grepl returns a logical vector which can be used along with & to join the compound expression
df.1[grepl(";", df.1$Accession) & df.1$kinase.or.not == "Kinase",]

Related

R: filtering elements of large vector that appear in a smaller vector [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 3 years ago.
Suppose we have a numeric vector. Actually, suppose we have a dataframe consisting of a single column.
example = data.frame("column" = rnorm(10000, 10, 3))
We'll be treating it as a dataframe in order to use the filter function of the dplyr package.
Also, suppose we have another vector of smaller length. This particular vector is just for the sake of the example. It doesn't necessarily have to be a sequence.
numbers = 8:100
What I would like to do is to keep those values of the larger vector that are equal to any of the values of the smaller vector and discard those values that are not.
Fair enough. The filter function can do that. Except that I would have to write this:
filtered = dplyr::filter(example, column == numbers[1] | column == numbers[2] | ... | column == numbers[length(numbers)])
I would have to write the condition column == numbers[i] for each of the elements of the numbers vector.
Executing this code
filtered = dplyr::filter(example, column == numbers)
gives as output a dataframe called filtered that consists of a single column with no rows. There are no rows because, since all the rows of the example dataframe consist of scalars, none of those rows is equal to the whole numbers vector.
Is there an smarter method that doesn't require me to write that condition for each element of the numbers vector?
You can use the operator %in% to check if your values are "in" the vector.
Code:
new_data <- old_data %>%
dplyr::filter(column %in% numbers)
Are you looking for:
filtered <- dplyr::filter(example, column %in% numbers)
An option with base R
subset(example, column %in% numbers)

How to subset data with multiple criteria from one column

I need to create a data subset from multiple "inclusion" criteria from a column (V5:Format) of my df.
I have tried :
new.data <- old.data[grep("text1", old.data$V5), ]
This works for 1 inclusion criteria. I want to add a second inclusion criteria - data must include "text1" & "text2" for data subset
Thanks in advance.
You can use grepl() instead of grep() to get a boolean vector which tells you which strings contain the pattern. On these vectors, you can use logical conditions like &:
new.data <- old.data[grepl("text1", old.data$V5)&grepl("text2", old.data$V5), ]

create flag based on row values in grep()

I have a 10-row data frame of tweets about potatoes and need to flag them based on the punctuation each tweet contains (questions marks or exclamation points). The grep function will return row numbers where these characters appear:
grep("\\?", potatoes$tweet)
grep("!", potatoes$tweet)
I've tried to create the flag variable question with mutate in dplyr as shown...
potatoes$question <- NA
potatoes <- mutate(potatoes, question = +row_number(grep("\\?", potatoes$tweet)))
Error in mutate_impl(.data, dots) :
Column `question` must be length 10 (the number of rows) or one, not 3
I'm also happy to consider more elegant solutions than conditioning on the output of grep. Any help appreciated!
We can use grepl instead of grep as grep returns the index/position where the matches occurs, whereas grepl returns a logical vector where TRUE denotes the matching element and FALSE non-matching. It can be used as a flag
i1 <- grepl("!", potatoes$tweet)
and if we need to change to row numbers,
potatoes$question <- i1 * seq_len(nrow(potatoes$sweet))
Similarly, grep with row index can be used for assignment
i2 <- grep("!", potatoes$tweet)
potatoes$question[i2] <- seq_len(nrow(potatoes))[i2]

Subsetting data frames while dropping rows which fulfill certain conditions

I want to subset a data Frame. Most time one reduces an original data frame by keeping observations which fulfil certain conditions in its variables and dropping the rest.
A working code is:
Companies.Exchanges.1 <- subset(Companies.Exchanges.0,
(Frankfurt == 1 & London == 1))
I want to do it the other way round: Dropping all observations which fulfil certain conditions and keeping the rest - which violates at last one condition - in a new data frame.
How do I have to reformulate the above code to do this this?
Try negating your filtering conditions with !
Companies.Exchanges.1 <- subset(Companies.Exchanges.0,
!(Frankfurt == 1 & London == 1))
When you specify filtering conditions for subset or in general, R takes all of your rows and checks them against the conditions you set. Think of it as adding another boolean vector to your dataframe where matching criteria = TRUE, and not matching = FALSE. The ! operator reverses this invisible vector.

Extracting parts of a dataframe

I need to extract parts of a dataframe, using the values which I have generated previously. For example, I have the following data:
a<-c(1,2,3,4,6,7,10,12,17,20)
df1<-data.frame(a)
I then want to exclude these values (in "a" in df1) from df2 when they appear in column b:
b<-c(1,2,3,4,5,6,6,6,7,8,9,10,11,11,11,12,13,14,14:20)
c<-c(1:25)
df1<-data.frame(b,c)
So, I should be left with a dataframe with rows 5,8,9,11 etc...
Can anyone help me out with the code to remove these values from my dataframe (df1).
Many thanks.
subset() will be a good friend to you for this sort of thing:
subset(df1, !b %in% a)
(The sub-expression b %in% a tests each element of b to determine whether or not it is in a, returning a vector of TRUEs and FALSEes. !b %in% a just negates/flips those Boolean values, so that you end up with a logical vector indexing with TRUEs the rows of df1 that you would like to keep (i.e. those that don't appear in a).)

Resources