This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
Sample file Imdb sample
In the film data set I have same titles available: 'A star is born' aka 'Narodziny gwiazdy' - four times, 'Halloween' - 3 times. These are different movies as released in different years.
How to filter only these titles which are present multiple times and display the details for them?
(titleDetails <- imdb_movies.csv %>%
group_by(Title) %>%
summarise(count = n()) %>%
filter(count > 2))
titleDetails
Code above will display only title and count.
How to display all details which I have in the data set?
You can call df[duplicated(df$Title) | duplicated(df$Title, fromLast = T), ].
duplicated(df$Title) returns a logical vector with TRUEs for all rows with a duplicated title. The first occurrence of the duplicated title will show as FALSE.
duplicated(df$Title, fromLast = TRUE) does the same thing, except in reverse order. This time, from the standpoint of the data you've supplied, the last occurrence of the duplicated title is marked FALSE.
Then, you can get all of the rows with duplicated titles by using the | (or) operator on these two duplicated() calls and index your original data using the resulting logical vector.
Related
This question already has answers here:
rearrange a data frame by sorting a column within groups
(3 answers)
Subsetting from a Data Frame
(2 answers)
Closed 2 months ago.
I have two columns in my dataset in R studio right now: one is "experience level", which contains four different two letter abbreviations ("SE", "MI", "EX", "EN") related to the experience level of an employee. The second column is "salary", which is the employee's salary in USD. How can I create a data frame or sort the data by a specific experience level, such as showing only salaries that are a part of "EN" employees?
I am not sure where to start even. Have tried using group_by but to no avail.
Showing "only" salaries that are part of a group, can be done with filter()
Sorting can be done with the arrange() function
library(tidyverse)
df %>%
filter(experience=="EN") %>% # filters to only EN
arrange(desc(salary)) #sort/arrange the salary data, descending (high to low)
This question already has answers here:
filtering within the summarise function of dplyr
(3 answers)
Opposite of %in%: exclude rows with values specified in a vector
(13 answers)
Closed 3 months ago.
This post was edited and submitted for review 3 months ago and failed to reopen the post:
Original close reason(s) were not resolved
EDIT: I want to specify which values NOT to include in my calculation by providing a list of values for records to skip. I do NOT want to provide a list of values to include in my calculation because my dataset is too large.
I want to group records based on a certain value, and then I want to do some other calculations for certain variables; however, I want to exclude certain values from one of those calculations. Here is an example of what the data transformation would look like without any exclusions:
library(dplyr)
grouped <- starwars %>%
group_by(species) %>% #group my data by a particular value
summarise(Total_Mass = sum(mass), #make a calculation
Average_Height = mean(height)) # make another calculation
and here's what I am attempting to do:
exclude <- c("R2-D2","Luke","Darth") #make a list of the names of records I would like to exclude
grouped2 <- starwars %>%
group_by(species) %>%
summarise(Total_Mass = sum(mass) where name !%in% exclude, #sum mass for all records except those where name is in the exclude list
Average_Height = mean(height)) # make another calculation without any exclusions
This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 1 year ago.
In the following code below, i look over three variables in a dataset. However, I would like to look over the three variable when the year column is equal to 72. Is there a way to do it by using the View function?
library(plm)
data("Cigar")
View(Cigar[, c("year","price", "sales")])
You can do this in several ways. One way is to use subset() with select. You don't need to quote column names.
For example:
View(subset(Cigar, select = c(year, price, sales), year == 72))
In R version 4.1.0 or newer you can also use the |> pipe :
Cigar |>
subset(Cigar, select = c(year, price, sales), year == 72) |>
View()
This question already has answers here:
list unique values for each column in a data frame
(2 answers)
Closed 2 years ago.
I would like to extract the unique values from this data frame as an example
test <- data.frame(position=c("chr1_13529", "chr1_13529", "chr1_13538"),
genomic_regions=c("gene", "intergenic", "intergenic"))
The resulting data frame should give me only
chr1_13538 intergenic
Basically I want to extract rows that have a unique position
Here is a tidyverse/dplyr solution.
You are just grouping by position, counting occurances, and selecting those that only have 1 occurance.
library(tidyverse)
test %>%
group_by(position) %>%
mutate(count = n()) %>%
filter(count == 1) %>%
select(-count)
Here is a base R approach:
There are two parts:
We create a list of positions that occur at least twice using duplicated
We look for positions that are not in the list of duplicated positions
Then we subset test on condition 2.
test[!test$position %in% test$position[duplicated(test$position)],]
# position genomic_regions
#3 chr1_13538 intergenic
This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.
We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.