How to extract unique values from a data frame in r [duplicate] - r

This question already has answers here:
list unique values for each column in a data frame
(2 answers)
Closed 2 years ago.
I would like to extract the unique values from this data frame as an example
test <- data.frame(position=c("chr1_13529", "chr1_13529", "chr1_13538"),
genomic_regions=c("gene", "intergenic", "intergenic"))
The resulting data frame should give me only
chr1_13538 intergenic
Basically I want to extract rows that have a unique position

Here is a tidyverse/dplyr solution.
You are just grouping by position, counting occurances, and selecting those that only have 1 occurance.
library(tidyverse)
test %>%
group_by(position) %>%
mutate(count = n()) %>%
filter(count == 1) %>%
select(-count)

Here is a base R approach:
There are two parts:
We create a list of positions that occur at least twice using duplicated
We look for positions that are not in the list of duplicated positions
Then we subset test on condition 2.
test[!test$position %in% test$position[duplicated(test$position)],]
# position genomic_regions
#3 chr1_13538 intergenic

Related

How to sum values in a group_by pipe except certain values? [duplicate]

This question already has answers here:
filtering within the summarise function of dplyr
(3 answers)
Opposite of %in%: exclude rows with values specified in a vector
(13 answers)
Closed 3 months ago.
This post was edited and submitted for review 3 months ago and failed to reopen the post:
Original close reason(s) were not resolved
EDIT: I want to specify which values NOT to include in my calculation by providing a list of values for records to skip. I do NOT want to provide a list of values to include in my calculation because my dataset is too large.
I want to group records based on a certain value, and then I want to do some other calculations for certain variables; however, I want to exclude certain values from one of those calculations. Here is an example of what the data transformation would look like without any exclusions:
library(dplyr)
grouped <- starwars %>%
group_by(species) %>% #group my data by a particular value
summarise(Total_Mass = sum(mass), #make a calculation
Average_Height = mean(height)) # make another calculation
and here's what I am attempting to do:
exclude <- c("R2-D2","Luke","Darth") #make a list of the names of records I would like to exclude
grouped2 <- starwars %>%
group_by(species) %>%
summarise(Total_Mass = sum(mass) where name !%in% exclude, #sum mass for all records except those where name is in the exclude list
Average_Height = mean(height)) # make another calculation without any exclusions

I am trying to sum rows in a column by a unique ID [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I have a table of data that has a unique ID in the first column and then the next 5 columns have data. The first column has some rows that have the same unique ID. I want to sum add all of the rows with the same unique ID so my output only has one row for each of those unique IDs. I have seen some methods to do this over just one other column but I need it over 5 other columns.
You might want to use dplyr to work with dataframe. Install it if you do not have.
install.packages("dplyr")
Assuming your dataframe is df with ID column and other five columns are numeric, this will sum all those 5 cols by ID.
library(dplyr)
df %>%
group_by(ID) %>%
summarise_all(sum)

Subsetting a dataframe based on a vector of strings [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.
We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.

Deleting duplicate rows based on logical operation in R [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Select the row with the maximum value in each group
(19 answers)
Closed 3 years ago.
I have data like this:
ID SHape Length
180139746001000 2
180139746001000 1
I want to delete the duplicate rows whichever has the less shape length.
Can anyone help me with this?
with
df <- data.table(matrix(c(102:106,106:104,1:3,1:3,5:6),nrow = 8))
colnames(df) <- c("ID","Shape Length")
just use duplicated after sorting
setkey(df,"V2")
df[!duplicated(V1, fromLast = TRUE)]
You can select the highest shape length for each ID by performing
df %>%
group_by(ID) %>%
arrange(SHape.Length) %>%
slice(1) %>%
ungroup()

Sorting Column in R [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 3 years ago.
I have data that includes a treatment group, which is indicated by a 1, and a control group, which is indicated by a 0. This is all contained under the variable treat_invite. How can I separate these and take the mean of pct_missing for the 1's and 0's? I've attached an image for clarification.
enter image description here
assuming your data frame is called df:
df <- df %>% group_by(treat_invite) %>% mutate(MeanPCTMissing = mean(PCT_missing))
Or, if you want to just have the summary table (rather than the original table with an additional column):
df <- df %>% group_by(treat_invite) %>% summarise(MeanPCTMissing =
mean(PCT_missing))

Resources