data mining: subset based on maximum criteria of several observations [duplicate] - r

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 6 years ago.
Consider the example data
Zip_Code <- c(1,1,1,2,2,2,3,3,3,3,4,4)
Political_pref <- c('A','A','B','A','B','B','A','A','B','B','A','A')
income <- c(60,120,100,90,80,60,100,90,200,200,90,110)
df1 <- data.frame(Zip_Code, Political_pref, income)
I want to group_by each $Zip_code and obtain the maximum $income based on each $Political_pref factor.
The desired output is a df which has 8obs of 3 variables. That contains, 2 obs for each $Zip_code (an A and B for each) which had the greatest income
I am playing with dplyr, but happy for a solution using any package (possibly with data.table)
library(dplyr)
df2 <- df1 %>%
group_by(Zip_Code) %>%
filter(....)

We can use slice with which.max
library(dplyr)
df1 %>%
group_by(Zip_Code, Political_pref) %>%
slice(which.max(income))

Related

Use of aggregate function in R [duplicate]

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 5 months ago.
I have data like this:
ID <- c(1001,1001, 1001, 1002,1002,1002)
activity <- c(123,123,123, 456,456,789)
df<- data.frame(ID,activity)
I want to count the number of unique activity values within ID to end up with a dataframe like this:
N<- c(1,1,1,2,2,2)
data.frame(df,N)
So we can see that person 1001 did only 1 activity while person 1002 did two.
I think it can be done with aggregate but am happy to use another approach.
dplyr option
sum_df <- df %>%
group_by(ID) %>%
summarize(count_distinct = n_distinct(activity)) %>%
left_join(df,
by = 'ID')

Count the number of observations in the data frame in R [duplicate]

This question already has answers here:
Count number of unique levels of a variable
(7 answers)
Count number of distinct values in a vector
(6 answers)
Closed 2 months ago.
I want to know the way of counting the number of observations using R.
For example, let's say I have a data df as follows:
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3,5,5,5,9,9))
Even though the biggest number of id is 9, there are only 5 numbers: 1,2,3,5,and 9. So there are only 5 numbers in id. I want to count how many numbers exist in id like this.
In base R:
length(unique(df$id))
[1] 5
Here, unique filters only distinct values and length then counts the number of values in the vector
In dplyr:
df %>%
summarise(n = length(unique(id)))
Alternatively:
nrow(distinct(df))
Here, distinct subsets the whole dataframe (not just the column id!) to unique rows before nrow counts the number of remaining rows
Here another two options:
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3,5,5,5,9,9))
sum(!duplicated(df$id))
#> [1] 5
library(dplyr)
n_distinct(df$id)
#> [1] 5
Created on 2022-07-09 by the reprex package (v2.0.1)

R - find rows corresponding to maximum value of a column among mutliple rows [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 3 years ago.
with data like below, have data for hours of each day for each area,loc pair. Need to find out the rows for each area,loc for which value of a is maximum.
day,hour,area,loc,a,b,c
20181231,ar01,loc01,00,99,11.3,18.2
20181231,ar01,loc01,22,96,12.3,15.2
20190101,ar01,loc01,00,98,10.9,22.5
20190101,ar01,loc01,23,97,10.9,22.1
20181231,ar02,loc01,00,93,11.3,18.2
20181231,ar02,loc01,22,96,12.3,15.2
20190101,ar02,loc01,00,97,10.9,22.5
20190101,ar02,loc01,23,97.2,10.9,22.1
expected output
day,hour,area,loc,a,b,c
20181231,ar01,loc01,00,99,11.3,18.2
20190101,ar01,loc01,00,98,10.9,22.5
20181231,ar02,loc01,22,96,12.3,15.2
20190101,ar02,loc01,23,97.2,10.9,22.1
I could do an aggregation using dplyr, like df %>% group_by(day, area, loc) - how do I get the result rows from here ?
You can try:
library(dplyr)
df %>%
group_by(day, area, loc) %>%
filter(., a == max(a))

Sorting Column in R [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 3 years ago.
I have data that includes a treatment group, which is indicated by a 1, and a control group, which is indicated by a 0. This is all contained under the variable treat_invite. How can I separate these and take the mean of pct_missing for the 1's and 0's? I've attached an image for clarification.
enter image description here
assuming your data frame is called df:
df <- df %>% group_by(treat_invite) %>% mutate(MeanPCTMissing = mean(PCT_missing))
Or, if you want to just have the summary table (rather than the original table with an additional column):
df <- df %>% group_by(treat_invite) %>% summarise(MeanPCTMissing =
mean(PCT_missing))

how to obtain summary of statistics for distinct values of a column in dataframe in R? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
Consider we have a data.frame named IND, in which we have a column called dept. There are in total 100 rows and there are 20 distinct values in dept.
Now I would like to obtain the summary statistics for these 20 subsets of data.frame containing 5 rows each using the main data.frame!
summary(IND) gives the summary statistics for whole dataset but what should I do in my case?
Something like this
mtcars %>% group_by(cyl) %>% summarise_each(funs(sum, mean))
can be used for your case as
IND %>% group_by(dept) %>% summarise_each(funs(sum, mean))

Resources