Extracting parts of a dataframe - r

I need to extract parts of a dataframe, using the values which I have generated previously. For example, I have the following data:
a<-c(1,2,3,4,6,7,10,12,17,20)
df1<-data.frame(a)
I then want to exclude these values (in "a" in df1) from df2 when they appear in column b:
b<-c(1,2,3,4,5,6,6,6,7,8,9,10,11,11,11,12,13,14,14:20)
c<-c(1:25)
df1<-data.frame(b,c)
So, I should be left with a dataframe with rows 5,8,9,11 etc...
Can anyone help me out with the code to remove these values from my dataframe (df1).
Many thanks.

subset() will be a good friend to you for this sort of thing:
subset(df1, !b %in% a)
(The sub-expression b %in% a tests each element of b to determine whether or not it is in a, returning a vector of TRUEs and FALSEes. !b %in% a just negates/flips those Boolean values, so that you end up with a logical vector indexing with TRUEs the rows of df1 that you would like to keep (i.e. those that don't appear in a).)

Related

Compare two lists, create new column and assign 1 to matching data (RNA-seq genes)

I have two dataframe:
df1: big data frame, a gene list with ca. 20000 genes
df2: small gene list (e.g. 50 genes).
Now I would like to find which genes from df2 are present in df1, add a new column in df1 and mark/highlight the matching genes with 1 and the nonmatching with 0.
I usually find out which genes are present in a list by using the subset function. I also know how to create a new table with only the matching genes etc. but I really do not know how to "mark them" in the same file, thus I thought of adding a column with 1 or 0. But it can be also true or false.
I hope I explain myself let me know if that is not the case
Thanks, Lore
Using %in%, then convert to integer:
df1$match <- as.integer(df1$gene %in% df2$gene)
Note the caveats mentioned by jay.sf:
Be careful with using %in%, if you have NA it gives unexpected results. Try NA %in% 1, it should give NA but gives FALSE.
%in% is a great feature as it is. However, it is often recommended as a multivariate substitute for ==. And there's the rub; where == yields NA, %in% won't, and through unawareness the naïve user might generate observations out of nothing.

R: filtering elements of large vector that appear in a smaller vector [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 3 years ago.
Suppose we have a numeric vector. Actually, suppose we have a dataframe consisting of a single column.
example = data.frame("column" = rnorm(10000, 10, 3))
We'll be treating it as a dataframe in order to use the filter function of the dplyr package.
Also, suppose we have another vector of smaller length. This particular vector is just for the sake of the example. It doesn't necessarily have to be a sequence.
numbers = 8:100
What I would like to do is to keep those values of the larger vector that are equal to any of the values of the smaller vector and discard those values that are not.
Fair enough. The filter function can do that. Except that I would have to write this:
filtered = dplyr::filter(example, column == numbers[1] | column == numbers[2] | ... | column == numbers[length(numbers)])
I would have to write the condition column == numbers[i] for each of the elements of the numbers vector.
Executing this code
filtered = dplyr::filter(example, column == numbers)
gives as output a dataframe called filtered that consists of a single column with no rows. There are no rows because, since all the rows of the example dataframe consist of scalars, none of those rows is equal to the whole numbers vector.
Is there an smarter method that doesn't require me to write that condition for each element of the numbers vector?
You can use the operator %in% to check if your values are "in" the vector.
Code:
new_data <- old_data %>%
dplyr::filter(column %in% numbers)
Are you looking for:
filtered <- dplyr::filter(example, column %in% numbers)
An option with base R
subset(example, column %in% numbers)

Check if a column has more than one value

I have a dataframe in which I only want to run a function on if I know that in certain columns (say there are 11 columns and I want to know this on 4 of them) there is more than one value (e.g. they are not all 2).
Is there any specific function to find this out or would I have to loop through each of the columns and check?
We can use sapply to loop over the columns, get the unique elements in each column, check whether the length is greater than 1. It gives a logical vector which can be used for subsetting the dataset if needed.
i1 <- sapply(df1, function(x) length(unique(x)) >1)
df1[i1]
Or another option to subset columns will be filter
Filter(var, df1)
For each column run length(unique(x)). This will print the number of unique columns. If you provide more information this can be nested into a function that decides whether or not to run based on the sums of length(unique(x)).

R: Assign values to a new column based on values of another column where a condition is satisfied

I want to create a new column in a data.frame where its value is equal to the value in another data.frame where a particular condition is satisfied between two columns in each data frame.
The R pseudo-code being something like this:
DF1$Activity <- DF2$Activity where DF2$NAME == DF1$NAME
In each data.frame values for $NAME are unique in the column.
Use the ifelse function. Here, I put NA when the condition is not met. However, you may choose any value or values from any vector.
Recycling rules1 apply.
DF1$Activity <- ifelse(DF2$NAME == DF1$NAME, DF2$Activity, NA)
I'm not sure this one actually needs an example. What happens when you create a column with a set of NA values and then assign the required rows with the same logical vector on both sides:
DF1$Activity <- NA
DF1$Activity[DF2$NAME == DF1$NAME] <- DF2$Activity[DF2$NAME == DF1$NAME]
without an example its quite hard to tell. But from your description it sounds like a base::merge or dplyr::inner_join operation. Those are quite fast in comparison to if statements.
Cheers

R - is subset guaranteed to return the same order of values in repeated calls?

When subsetting a data.frame or vector, is the same subset call guaranteed to return the same order of values/rows no matter how many times the call is made?
For a vector, definitely yes. From the documentation for subset:
For ordinary vectors, the result is simply x[subset & !is.na(subset)].
For data frames, the same would appear to be true, since the subsetting is just applied to each row effectively as a vector. For instance, the following will always return just entries from the b column of d whose corresponding a value is greater than 5. No reordering of rows occurs.
d <- data.frame(a=1:10, b=20:29)
subset(d, a>5, b)

Resources