Extracting rows from data frame based on another data frame - r

I'm trying to extract a set of genes (row names) from my large data set based on another data matrix that contains a list of my genes of interest. I've read about that I should use the filter and %in% command, but am unsure as to how to write it properly.
example:
my large database:
Gene Week1 Week 2. Week 3
A. 20. 14. 5
B. 5. 10. 15
C. 2. 4. 6
D. 20. 18. 19
my small data base:
Gene
A
C
D
And I want my result to be:
Gene Week1 Week 2. Week 3
A. 20. 14. 5
C. 2. 4. 6
D. 20. 18. 19
Could anybody please help out? I'd really appreciate it and my apologies for the rather simple question :)

Using logical row indexes:
large_database[large_database$Gene %in% unique(small_data_base$Gene), ]
Explanation:
large_database$Gene %in% unique(small_data_base$Gene)
Checks for each entry (i.e. row) in large_database$Gene if it appears in unique(small_database$Gene) i.e. the list of unique values in the column Gene of small_data_base and returns a boolean vector (a vector of TRUE and FALSE).
We then can use this vector as a row 'index' to selecet only rows where the vector is TRUE (i.e. the value of large_database$Gene was in unique(small_database$Gene)

Related

R: Deleting rows from a data frame based on values of other vector

So I have a data frame with baskets of products of purchases of individuals. A row stands for a basket of products of one individual. I want to remove all the rows (baskets) that contain a product (expressed as a integer) that are listed in a vector named products.to.delete . Here is a small image of how the data set looks like.
Next to that I have a vector containing a large number of numbers that must be deleted. I would like to delete all the rows that contain a value from this vector.
here is some code to make it reproducable:
dataframe <- as.data.frame( matrix(data = sample(10000,1000,replace = TRUE),20,50))
products.to.delete <- sample(10000,200,replace = FALSE)
Thank you in advance for helping me out!
If your data is data, and your vector of target values is vals, you could do this:
data[apply(data,1,\(r) !any(r %in% vals)),]
That is, within each row of data (i.e. apply(data,1...)), you can check if any of the values are in vals. Reverse the boolean using !, to create an global logical vector for selecting the remaining rows
For your next questions, please create reproducible examples such as the one below.
What you're after is called filtering and can be done in base R by the following.
First, create an object called for example myfilter which is a boolean vector with the same length as the number of rows in your data.frame.
mydat <- data.frame("col1"=1:5, "col2"=letters[1:5])
col1 col2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
myfilter <- mydat$col2 %in% c("a", "c")
[1] TRUE FALSE TRUE FALSE FALSE
mydat[myfilter,]
col1 col2
1 1 a
3 3 c
Then simply include this object into brackets []. R will keep rows where values are TRUE

Compare value in R data frame after certain index

I have a data.frame as given below. I want to get the index/row number where (b-a)>8 but I want to compare them after row 7 not from row 1. I have written the code to get me the row number where b-a>8 satisfies but it checks from row 1. How to check it from row 7?
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
b <- c(2,12,4,5,2,5,8,5,7,19,6,7,4,23,1,2)
df <- data.frame(a,b)
which((df$b-df$a)>8)[1]
Desired output: Row number 10 not 2.
One can start with offset as in both vectors as:
which((df$b[7:nrow(df)]-df$a[7:nrow(df)])>8)
#[1] 8
This is just a math calculation
(which(with(df[-(1:7),],b-a>8))+7)[1]
[1] 10
(a<-which((df$b-df$a)>8))[a>7][1]
[1] 10

Using list of row numbers as criteria to populate field

I have a list of row numbers that represent row containing outliers in a data set. I would like to add an "outlier" column to the original data set that flags the rows containing outliers, but I can't figure out how to use row numbers as criteria in r.
Example:
I have a dataframe like this:
id <-c("a","b","c","d")
values <-c(10,11,22,33)
df<-data.frame(names,values)
id values
1 a 10
2 b 11
3 c 22
4 d 33
And a list like this containing row number (more correctly "row names"):
outliers <-c(2,4)
I'd like to find a way to use the list of row numbers as criteria in something like:
df$outlier_test<-ifelse( if row number is on my list, "outlier","")
to produce something like this:
id values outlier_test
1 a 10
2 b 11 outlier
3 c 22
4 d 33 outlier
Spent quite a while trying to puzzle this out and had inspiration as soon as I posted the question. For anyone else who comes here with this question:
First:
df$rownumber<- row.names(df)
then:
df$outlier_test<- ifelse(df$rownumber %in% outliers,"outlier","")

R select multiple rows by conditional row number

I have a R dataframe like this one:
a<-c(1,2,3,4,5)
b<-c(6,7,8,9,10)
df<-data.frame(a,b)
colnames(df)<-c("a","b")
df
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I would like to get the 1st, 2nd, 3rd AND 5th row of the column a, so 1 2 3 5, by selecting rows by their number.
I have tried df$a[1:3,5] but I get Error in df$a[1:3, 5] : incorrect number of dimensions.
What DOES work is c(df$a[1:3],df$a[5]) but I was wondering if there was an easier way to achieve this with R?
Your data frame has two dimensions (rows and columns). When you use the square brackets to extract values, R expects everything prior to the comma to indicate the rows desired, and everything after the comma to indicate the columns desired (see: ?[). Hence, df[1:3,5] means rows 1 through 3, from column 5. To turn your desired rows into a single vector, you need to concatenate (i.e., c(1:3,5)). That would all go before the comma, the column indicator, 1 or "a", would go after the comma. Thus, df[c(1:3,5), 1] is what you need.
For alternative answer (that might be more appropriate to a dataframe with many more columns), df[c(1:3, 5), "a"] as suggested by #Mamoun Benghezal would also get it done!

Problems with using subset in r

I need to subset my data frame, but I do not know what condition to use.
df2<-subset(df, condition )
A part of the dataframe, `df`:
state value
a 1
b 2
c 3
a 1
b 4
c 5
I count the sum of the value column for each state using : table(df$state)
I need to create a date frame where I show just the rows where the sum of the value column is bigger then a given value x.
If x is 3, I need to have in the new data frame just the rows that have the "state" column equal to b or c.
What should I replace "condition" with? How can I use : table(df$state) in the condition?
It is not clear what are you trying to do.
table(df$state) count the occurence of each state in your data, not the sum of variable "value" for each "state".You should instead use something like this:
vv <- tapply(dat$value,dat$state,sum)
vv
a b c
2 6 8
Now you can use the result within subset, to get the sum of the value column is bigger then a given value x. For example x == 3:
subset(dat,state %in% names(vv)[vv>3])
or without using `subset ( more efficient)
dat[dat$state %in% names(vv)[vv>3],]

Resources