subset based on frequency level [duplicate] - r

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))

df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.
See:
?`[`
?'%in%'

Using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)

Maybe closer to what you had in mind is to create a vector of frequencies using ave:
subset(df1, ave(ID, ID, FUN = length) > cutoff)

Related

R: subsetting first 30 groups in data frame [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 2 years ago.
I'm trying to find a way to subset the first 30 groups in my data frame (171 in total, of unequal length).
Here's a smaller dummy data frame I've been practicing with (in this case I only try to subsample the first 3 groups):
groups=c(rep("A",times=5),rep("B",times=2), rep("C",times=3),rep("D",times=2), rep("E",times=8)) value=c(1,2,4,3,5,7,6,8,7,5,2,3,5,7,1,1,2,3,5,4) dummy<-data.frame(groups,value)
So far, I've tried variations of:
subset<-c("A","B","C") dummy2<-dummy[dummy$groups==subset,]
but I get the following warning: longer object length is not a multiple of shorter object length
Would anyone know how to fix this or have other options?
We can use filter from dplyr. Get the first 'n' unique elements of 'groups' with head, use %in% to return a logical vector in filter to subset the rows
library(dplyr)
n <- 4
dummy %>%
filter(groups %in% head(unique(groups), n))
or subset in base R
subset(dummy, groups %in% head(unique(groups), n))
== can be used either with equal length vectors (for elementwise comparison) or if length of the second vector is 1. For multiple elements, use %in%

Subsetting a dataframe based on a vector of strings [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.
We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.

Given large data.table, use binary search to find the correct row based on the first two columns and then add 1 to third column

I have a dataframe with 3 columns. First two columns are IDs (ID1 and ID2) referring to the same item and the third column is a count of how many times items with these two IDs appear. The dataframe has many rows so I want to use binary search to first find the appropriate row where both IDs match and then add 1 to the cell under the count column in that row.
I have used the which() function to find the index of the correct row and then using the index added 1 to the count column.
For example:
index <- which(DF$ID1 == x & DF$ID1 == y)
DF$Count[index] <- DF$Count[index] + 1
While this works, the which function is very inefficient. Because I have to do this within a for loop for more than a trillion times, it takes a lot of time. Also, there is only one row in the data frame with this ID combination. While the which function goes through all the rows, a function that stops once it finds the correct row should suffice. I have looked into using data.table and setkey for this purpose but do not know how to implement that for my purpose. Thank you in advance.
Indeed you can use data.table and setkeyv (not setkey because you need 2 columns as indexes)
library(data.table)
DF <- data.frame(ID1=sample(1:100,100000,replace=TRUE),ID2=sample(1:100,100000,replace=TRUE))
# convert DF to a data.table
DF <- as.data.table(DF)
# put both ID1 and ID2 as indexes, in that order
setkeyv(DF,c("ID1","ID2"))
# random x and y values
x <- 10
y <- 18
# select value for ID1=x and ID2=y and add 1 in the Count column
DF[.(x,y),"Count"] <- DF[,.(x,y),"Count"]+1

Filter specific column of a data.frame with also specific range [duplicate]

This question already has answers here:
Filter each column of a data.frame based on a specific value
(4 answers)
Closed 7 years ago.
I would like to select rows of a data.frame using filter(). The condition to select a row is that at least one value out of five variables should be in an interval. I don't know how to apply such a condition.
I have checked similar issues and tried them but no luck!
for example
Filter each column of a data.frame based on a specific value
Here is a reproducible example:
xx <- rep(rep(seq(0,800,200),each=10),times=2)
yy<-replicate(5,c(replicate(2,sort(10^runif(10,-1,0),decreasing=TRUE)),replicate(2,sort(10^runif(10,-1,0),decreasing=TRUE)), replicate(2,sort(10^runif(10,-2,0),decreasing=TRUE)),replicate(2,sort(10^runif(10,-3,0),decreasing=TRUE)), replicate(2,sort(10^runif(10,-4,0), decreasing=TRUE))))
V <- rep(seq(100,2500,length.out=10),times=2)
No <- rep(1:10,each=10)
df <- data.frame(V,xx,yy,No)
I want to filter X1:X5 columns so that the row is selected if any value in X1 to X5 is in the (0.5;0.55) interval.
library(dplyr)
f_1 <- df%>%
filter(X1:X5>=0.5&X1:X5<=0.55)
I got error
Warning messages:
1: In c(0.867315118241628, 0.720280300480341, 0.673805202395872, 0.489167242541468, :
numerical expression has 100 elements: only the first used
2: In c(0.867315118241628, 0.720280300480341, 0.673805202395872, 0.489167242541468, :
numerical expression has 100 elements: only the first used
3: In c(0.867315118241628, 0.720280300480341, 0.673805202395872, 0.489167242541468, :
numerical expression has 100 elements: only the first used
4: In c(0.867315118241628, 0.720280300480341, 0.673805202395872, 0.489167242541468, :
numerical expression has 100 elements: only the first used
You could adapt the solution presented in this answer. It looks for rows where at least one of the value responds to the condition (since logical vectors can be summed).
filter(df,rowSums(.[,names(.) %in% paste0("X",1:5)] >= 0.50 & .[,names(.) %in% paste0("X",1:5)] <= 0.55) > 0)

Row index for a data.table "binary search" on a subset of columns [duplicate]

This question already has answers here:
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
(2 answers)
Closed 9 years ago.
I have a larger set of data and need the Row numbers of rows that fulfill certain conditions. Package data.table.
days <- strptime(c("2013-01-01 8:00:00", "2013-02-01 8:00:00"), format="%Y-%m-%d %H:%M:%S")
DateTime <- rep(seq(days[1], days[2], length.out=1e6/5), 5)
Update <- rep(LETTERS[3:1], length.out=1e6)
Group <- rep(c("AAA", "BBB", "CCC"), length.out=1e6)
Weight <- trunc(rnorm(1e6, 110, 3))
Weight2 <- rnorm(1e6, 100, 1.5)
DT <- data.table(DateTime, Update, Group, Weight, Weight2)
setkey(DT, DateTime, Update, Group, Weight, Weight2)
Exp <- DT[1e6/2]
I cannot create another data.table as a subset without the column DateTime since this column is used in the key. Creating a new key on the subset could change the order and I need certainty that the original order is preserved.
It is possible to get the Row numbers I need by using the two commands.
system.time(DT[, which(DT$Update==Exp$Update & DT$Group==Exp$Group & DT$Weight==Exp$Weight & DT$Weight2==Exp$Weight2)])
system.time(which(DT$Update==Exp$Update & DT$Group==Exp$Group & DT$Weight==Exp$Weight & DT$Weight2==Exp$Weight2))
However I need a faster way to do that.
Thank you for any suggestions.
It is possible to get the Row number the following way.
which(is.na(DT[list(DT$DateTime, DT$Update,
DT$Group, DT$Weight, Exp$Weight2), which=TRUE]) == FALSE)
However it is 4 times slower than vector search examples from the Question.

Resources