Count the number of observations in the data frame in R [duplicate] - r

This question already has answers here:
Count number of unique levels of a variable
(7 answers)
Count number of distinct values in a vector
(6 answers)
Closed 2 months ago.
I want to know the way of counting the number of observations using R.
For example, let's say I have a data df as follows:
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3,5,5,5,9,9))
Even though the biggest number of id is 9, there are only 5 numbers: 1,2,3,5,and 9. So there are only 5 numbers in id. I want to count how many numbers exist in id like this.

In base R:
length(unique(df$id))
[1] 5
Here, unique filters only distinct values and length then counts the number of values in the vector
In dplyr:
df %>%
summarise(n = length(unique(id)))
Alternatively:
nrow(distinct(df))
Here, distinct subsets the whole dataframe (not just the column id!) to unique rows before nrow counts the number of remaining rows

Here another two options:
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3,5,5,5,9,9))
sum(!duplicated(df$id))
#> [1] 5
library(dplyr)
n_distinct(df$id)
#> [1] 5
Created on 2022-07-09 by the reprex package (v2.0.1)

Related

Select rows in dataframe based on a specific value in any column [duplicate]

This question already has answers here:
Finding rows containing a value (or values) in any column
(3 answers)
Closed 1 year ago.
I have a dataframe, that contains e.g. 5 rows and 3 columns:
I would like to select those rows, which contains for example text yellow (rows 1 and 4)?
Use the following to select rows that contain "yellow" in any column:
library(tidyverse)
result <- mydata %>%
filter_all(any_vars(. == "yellow"))
A base R option using subset + rowSums
subset(df,rowSums(df=="yellow")>0)

I am trying to sum rows in a column by a unique ID [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I have a table of data that has a unique ID in the first column and then the next 5 columns have data. The first column has some rows that have the same unique ID. I want to sum add all of the rows with the same unique ID so my output only has one row for each of those unique IDs. I have seen some methods to do this over just one other column but I need it over 5 other columns.
You might want to use dplyr to work with dataframe. Install it if you do not have.
install.packages("dplyr")
Assuming your dataframe is df with ID column and other five columns are numeric, this will sum all those 5 cols by ID.
library(dplyr)
df %>%
group_by(ID) %>%
summarise_all(sum)

Subsetting a dataframe based on a vector of strings [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.
We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.

data mining: subset based on maximum criteria of several observations [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 6 years ago.
Consider the example data
Zip_Code <- c(1,1,1,2,2,2,3,3,3,3,4,4)
Political_pref <- c('A','A','B','A','B','B','A','A','B','B','A','A')
income <- c(60,120,100,90,80,60,100,90,200,200,90,110)
df1 <- data.frame(Zip_Code, Political_pref, income)
I want to group_by each $Zip_code and obtain the maximum $income based on each $Political_pref factor.
The desired output is a df which has 8obs of 3 variables. That contains, 2 obs for each $Zip_code (an A and B for each) which had the greatest income
I am playing with dplyr, but happy for a solution using any package (possibly with data.table)
library(dplyr)
df2 <- df1 %>%
group_by(Zip_Code) %>%
filter(....)
We can use slice with which.max
library(dplyr)
df1 %>%
group_by(Zip_Code, Political_pref) %>%
slice(which.max(income))

subset based on frequency level [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))
df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.
See:
?`[`
?'%in%'
Using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
Maybe closer to what you had in mind is to create a vector of frequencies using ave:
subset(df1, ave(ID, ID, FUN = length) > cutoff)

Resources