Subsetting a dataframe based on a vector of strings [duplicate] - r

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.

We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]

library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.

Related

Adding new column to R tibble based on values in existing column [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have a tibble of participant data from an experiment, and I need to replace a column of identifiable IDs with a new column of anonymous IDs. I have found the ids package which can generate random IDs for me, but I'm not sure how to do this so that they match up with the existing ones.
Specifically, each participant has multiple rows in the tibble (but not always the same number of rows per participant), so I need to insert a column of random IDs such that all of e.g. Bob's rows will get the ID 123, and all of Alice's rows will get the ID 456.
I was assuming it might be best to do this with apply, but I'm just not sure what the function should be so that I don't get a different random ID for every row.
data$randomID <- apply(data, 1, function(x) {random_id(bytes=5)} )
data$randomID <- as.integer(factor(data$ID, levels = unique(data$ID)))

R: subsetting first 30 groups in data frame [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 2 years ago.
I'm trying to find a way to subset the first 30 groups in my data frame (171 in total, of unequal length).
Here's a smaller dummy data frame I've been practicing with (in this case I only try to subsample the first 3 groups):
groups=c(rep("A",times=5),rep("B",times=2), rep("C",times=3),rep("D",times=2), rep("E",times=8)) value=c(1,2,4,3,5,7,6,8,7,5,2,3,5,7,1,1,2,3,5,4) dummy<-data.frame(groups,value)
So far, I've tried variations of:
subset<-c("A","B","C") dummy2<-dummy[dummy$groups==subset,]
but I get the following warning: longer object length is not a multiple of shorter object length
Would anyone know how to fix this or have other options?
We can use filter from dplyr. Get the first 'n' unique elements of 'groups' with head, use %in% to return a logical vector in filter to subset the rows
library(dplyr)
n <- 4
dummy %>%
filter(groups %in% head(unique(groups), n))
or subset in base R
subset(dummy, groups %in% head(unique(groups), n))
== can be used either with equal length vectors (for elementwise comparison) or if length of the second vector is 1. For multiple elements, use %in%

I am trying to sum rows in a column by a unique ID [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I have a table of data that has a unique ID in the first column and then the next 5 columns have data. The first column has some rows that have the same unique ID. I want to sum add all of the rows with the same unique ID so my output only has one row for each of those unique IDs. I have seen some methods to do this over just one other column but I need it over 5 other columns.
You might want to use dplyr to work with dataframe. Install it if you do not have.
install.packages("dplyr")
Assuming your dataframe is df with ID column and other five columns are numeric, this will sum all those 5 cols by ID.
library(dplyr)
df %>%
group_by(ID) %>%
summarise_all(sum)

How to present data for repeated Title [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
Sample file Imdb sample
In the film data set I have same titles available: 'A star is born' aka 'Narodziny gwiazdy' - four times, 'Halloween' - 3 times. These are different movies as released in different years.
How to filter only these titles which are present multiple times and display the details for them?
(titleDetails <- imdb_movies.csv %>%
group_by(Title) %>%
summarise(count = n()) %>%
filter(count > 2))
titleDetails
Code above will display only title and count.
How to display all details which I have in the data set?
You can call df[duplicated(df$Title) | duplicated(df$Title, fromLast = T), ].
duplicated(df$Title) returns a logical vector with TRUEs for all rows with a duplicated title. The first occurrence of the duplicated title will show as FALSE.
duplicated(df$Title, fromLast = TRUE) does the same thing, except in reverse order. This time, from the standpoint of the data you've supplied, the last occurrence of the duplicated title is marked FALSE.
Then, you can get all of the rows with duplicated titles by using the | (or) operator on these two duplicated() calls and index your original data using the resulting logical vector.

subset based on frequency level [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))
df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.
See:
?`[`
?'%in%'
Using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
Maybe closer to what you had in mind is to create a vector of frequencies using ave:
subset(df1, ave(ID, ID, FUN = length) > cutoff)

Resources