This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 2 years ago.
I'm trying to find a way to subset the first 30 groups in my data frame (171 in total, of unequal length).
Here's a smaller dummy data frame I've been practicing with (in this case I only try to subsample the first 3 groups):
groups=c(rep("A",times=5),rep("B",times=2), rep("C",times=3),rep("D",times=2), rep("E",times=8)) value=c(1,2,4,3,5,7,6,8,7,5,2,3,5,7,1,1,2,3,5,4) dummy<-data.frame(groups,value)
So far, I've tried variations of:
subset<-c("A","B","C") dummy2<-dummy[dummy$groups==subset,]
but I get the following warning: longer object length is not a multiple of shorter object length
Would anyone know how to fix this or have other options?
We can use filter from dplyr. Get the first 'n' unique elements of 'groups' with head, use %in% to return a logical vector in filter to subset the rows
library(dplyr)
n <- 4
dummy %>%
filter(groups %in% head(unique(groups), n))
or subset in base R
subset(dummy, groups %in% head(unique(groups), n))
== can be used either with equal length vectors (for elementwise comparison) or if length of the second vector is 1. For multiple elements, use %in%
Related
This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 1 year ago.
I have a dataframe with 10 columns. One column gives bird species' names. There's actually 300 species but I'm just interested in 200 of them. I would like to keep only the information about this 200 species.
Screenshot of my table: https://i.stack.imgur.com/OcJyI.png
I can't just write : filter(Species == "Mallard" & Species == "Wood-pigeon")
I have a matrix with all the 200 selected species. But, I don't know how to use this matrix to select to relevant rows in my dataframe. Is it possible with subset/filter/etc function to select rows based on a matrix?
What are the correct codes please ?
The == with & is not going to work anyway as we don't find the different 'Species' in the same cell. With that code, it would be | instead of &. But, this can be done more easily with %in% on a vector of values e.g.
subset(df1, Species %in% c("Mallard", "Wood-pigeon"))
the
c("Mallard", "Wood-pigeon")
can be extended to any number of Species
This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.
We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.
This question already has answers here:
Filter multiple values on a string column in dplyr
(6 answers)
Closed 3 years ago.
I have a large dataframe "Marks", containing marks each year from 2014/5-2017/8. I have separated the dataframe into 4 smaller ones, by year of completion using:
marks14 <-
Marks%>%
filter(YearOfCompletion == "2014/5")
marks15 <-
Marks%>%
filter(YearOfCompletion == "2015/6")
marks16 <-
Marks%>%
filter(YearOfCompletion == "2016/7")
marks17 <-
Marks%>%
filter(YearOfCompletion == "2017/8")
I am attempting now to separate the "2016/7" and "2017/8" marks in to one dataframe. I have tried to manipulate the filter function, but I'm unable to figure it out and I can't find the code for this in online cookbooks.
We can use %in% to filter a vector of dates with length greater than or equal to 1
library(dplyr)
Marks %>%
filter(YearOfCompletion %in% c("2016/7", "2016/8"))
This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 7 years ago.
I need to sum columns of a table that have a names starting with a particular string.
An example table might be:
tbl<-data.frame(num1=c(3,2,9), num2=c(3,2,9),n3=c(3,2,9),char1=c('a', 'b', 'c'))
I get the list of columns (in this example I wrote only 2, but the real case has more tan 20).
a<-colnames(tbl)[grep('num', colnames(tbl))]
I tried with
sum(tbl[,a])
But I get only one number with the total sum of the elements in both vectors.
What I need is the result of:
tbl$num1+ tbl$num2
We can either use Reduce
Reduce(`+`, tbl[a])
Or rowSums. The rowSums also has the option of removing the NA elements with na.rm=TRUE.
rowSums(tbl[a])
This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))
df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.
See:
?`[`
?'%in%'
Using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
Maybe closer to what you had in mind is to create a vector of frequencies using ave:
subset(df1, ave(ID, ID, FUN = length) > cutoff)