Subsetting a dataframe - r

I have a dataframe with 23000 rows and 8 columns
I want to subset it using only unique identifiers that are in column 1. I do this by,
total_res2 <- unique(total_res['Entrez.ID']);
This produces 17,000 rows with only the information from column 1.
I am wondering how to extract the unique rows, based on this column and also take the information from the other 7 columns using only these unique rows.

This returns the rows of total_res containing the first occurrences of each Entrez.ID value:
subset(total_res, ! duplicated( Entrez.ID ) )
or did you mean you only want rows whose Entrez.ID is not duplicated:
subset(total_res, ave(seq_along(Entrez.ID), Entrez.ID, FUN = length) == 1 )
Next time please provide test data and expected output.

Related

Conditional removal of rows in R

I have a data frame with 2 columns and 26 rows, the first column is composed of characters while the second column is composed of numbers.
I also have a vector with a random selection of 5 characters.
I want to sum the numbers from column two of the 5 random characters.
How can I calculate this sum?
We can use aggregate
aggregate(ints ~ char, data1, sum)
Maybe what you need is :
result <- sum(data1$ints[data1$char %in% sample1], na.rm = TRUE)
This will sum the ints value in data1 which is present in sample1.

Given large data.table, use binary search to find the correct row based on the first two columns and then add 1 to third column

I have a dataframe with 3 columns. First two columns are IDs (ID1 and ID2) referring to the same item and the third column is a count of how many times items with these two IDs appear. The dataframe has many rows so I want to use binary search to first find the appropriate row where both IDs match and then add 1 to the cell under the count column in that row.
I have used the which() function to find the index of the correct row and then using the index added 1 to the count column.
For example:
index <- which(DF$ID1 == x & DF$ID1 == y)
DF$Count[index] <- DF$Count[index] + 1
While this works, the which function is very inefficient. Because I have to do this within a for loop for more than a trillion times, it takes a lot of time. Also, there is only one row in the data frame with this ID combination. While the which function goes through all the rows, a function that stops once it finds the correct row should suffice. I have looked into using data.table and setkey for this purpose but do not know how to implement that for my purpose. Thank you in advance.
Indeed you can use data.table and setkeyv (not setkey because you need 2 columns as indexes)
library(data.table)
DF <- data.frame(ID1=sample(1:100,100000,replace=TRUE),ID2=sample(1:100,100000,replace=TRUE))
# convert DF to a data.table
DF <- as.data.table(DF)
# put both ID1 and ID2 as indexes, in that order
setkeyv(DF,c("ID1","ID2"))
# random x and y values
x <- 10
y <- 18
# select value for ID1=x and ID2=y and add 1 in the Count column
DF[.(x,y),"Count"] <- DF[,.(x,y),"Count"]+1

Get row numbers of unique rows in a matrix

I have a matrix that has some unique rows and I would like to get the row names of those unique rows only.
m <- matrix( data = c(1,1,2,1,1,2,1,1,2), ncol = 3 )
If the expected row index is '3' as the other two rows are duplicates, then use duplicated to get the logical index and wrap with which for the numeric index.
which(!(duplicated(m)|duplicated(m,fromLast=TRUE)))
#[1] 3
If we consider the 1st and 3rd as the unique rows, then
which(!duplicated(m))

Select only rows if the value in a particular set of columns is 'NA' in R

I have a data frame with many rows and columns in it (3000x37) and I want to be able to select only rows that may have >= 2 columns of value "NA". These columns have data of different data types. I know how to do this in case I want to select only one column via:
df[is.na(df$col.name), ]
How to make this selection if I want to select two (or more) columns?
First create a vector nn with the of the number of NA's in each row and then select only those rows with >= 2 NA's d[nn>=2,]
d = data.frame(x=c(NA,1,2,3), y=c(NA,"a",NA,"c"))
nn = apply(d, 1, FUN=function (x) {sum(is.na(x))})
d[nn>=2,]
x y
1 NA <NA>

remove rows from data frame whose column values don't match another data frame's column values - R

So I have two data frames of different dimensions.
The first one, x, is about 10,000 rows and long looks like:
Year ID Number
2008.1 38573 1
2008.2 24395 3
(a lot of data in between)
2008.4 532 4
The second one, x2, is about 80,000 rows long and looks like:
Year ID Number
2008.1 38573 2
2008.2 24395 3
(a lot of data in between)
2008.4 532 4
Basically, I want to remove the rows in the second data that satisfy the following condition: that the Year, ID and Number values in the row don't match any rows of the first data frame. So in the above example, I'd remove row 1 from the second data frame, because the Number doesn't match.
I've tried:
x2new <- x2[(x2$ID == x$ID && x2$Year==x$Year && x2$Number == x$Number),]
But it doesn't work because the lengths of the two data frames are different.
I've tried doing a double for loop to remove rows that don't have all 3 conditions, but R simply can't do that many iterations.
Please help! Thanks.
A simple merge
merge(dat1,dat2)
Using your data for example:
dat1 <- read.table(text='Year,ID,Number
2008.1,38573,1
2008.4,532,4
2008.2,24395,3',header=TRUE,sep=',')
dat2 <- read.table(text='Year,ID,Number
2008.1,38573,2
2008.4,532,4
2008.2,24395,3',header=TRUE,sep=',')
Then you get :
merge(dat1,dat2)
Year ID Number
1 2008.2 24395 3
2 2008.4 532 4
I understood that you want to remove all the rows where no one of three columns has a match in the first data frame, and keep all the row where at least one column has a match, right? if so, just do this:
newX2 <- x2[ x2$ID %in% x$ID | x2$Year %in% x$Year | x2$Number %in% x$Number,]

Resources