I would like, when summarizing after grouping, to count the number of a specific level of another factor.
In the working example below, I would like to count the number of "male" levels in each group. I've tried many things with count, tally and so on but cannot find a straightforward and neat way to do it.
df <- data.frame(Group=replicate(20, sample(c("A","B"), 1)),
Value=rnorm(20),
Factor=replicate(20, sample(c("male","female"), 1)))
df %>%
group_by(Group) %>%
summarize(Value = mean(Value),
n_male = ???)
Thanks for your help!
We can use sum on a logical vector i.e. Factor == "male". The TRUE/FALSE will be coerced to 1/0 to get the frequency of 'male' elements when we do the sum
df %>%
group_by(Group) %>%
summarise(Value = mean(Value),
n_male = sum(Factor=="male"))
Related
I kindly request for help in grouping by ID, counting non-zeros and presenting the results as the percentage of the total in that particular ID
My data
library(dplyr)
id<-c(1,1,1,1,1,2,2,2,2)
x<-c(0,1,0,1,0,0,0,1,0)
df1<-data.frame(id,x)
head(df1)
in my results, after grouping by id=1, then i need column with total for ones as 2 and another column with the precentage (2/5)= 40. For group id=2 then i need the total for the column as 1 and percentage as (1/4)=25
df1 %>%
group_by(id) %>%
summarise(sum_of_1 = sum(x!=0),
pct= round((sum_of_1/n())*100))
This one works. Thanks for the help Hulk
Try this-
df1 %>%
group_by(id) %>%
summarise(sum_of_1 = sum(x, na.rm = TRUE),
pct = round((sum_of_1/n())*100))
I have a large timeseries dataset, and would like to choose the top 10 observations from each date based one the values in one of my columns.
I am able to do this using group_by(Date) %>% top_n(10)
However, if the values for the 10th and 11th observation are equal, then they are both picked, so that I get 11 observations instead of 10.
Do anyone know what i can do to make sure that only 10 observations are chosen?
You can arrange the data and select first 10 rows in each group.
library(dplyr)
df %>% arrange(Date, desc(col_name)) %>% group_by(Date) %>% slice(1:10)
Similarly, with filter
df %>%
arrange(Date, desc(col_name)) %>%
group_by(Date) %>%
filter(row_number() <= 10)
With data.table you can do
library(data.table)
setDT(df)
df[order(Date, desc(value))][, .SD[1:10], by = Date]
Change value to match the variable name used to choose which observation should be kept in case of ties. You can also do:
df[order(Date, desc(value))][, head(.SD,10), by = Date]
We can use base R
df1 <- df[with(df, order(Date, -value)),]
df1[with(df1, ave(seq_along(Date), Date, FUN = function(x) x %in% 1:10)),]
I have a tibble, with many observations and variables.
What I want to do is simply calculate(grouping by variable1 and variable2) the mean of variableXXX,and the total number of missing values for each group.
this is what I have written so far:
data%>%
group_by(variable1,variable2)%>%
summarise(mean(variableXXX))
how can I calculate the number of missing values for each group? I am new using R, so the easiest solution is better
We can get the sum of logical vector created with is.na
library(dplyr)
data%>%
group_by(variable1,variable2)%>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(variableXXX)))
NOTE: Assuming that we are interested in the count of NAs in the 'variableXXX' column grouped by 'variable1' and 'variable2'
If we need the NA count of the whole subset of dataset
library(purrr)
data %>%
group_split(variable1, variable2) %>%
map_dfr(~ .x %>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(.))))
Title is self-explanatory. Looking to calculate percent NA by ID group in R. There are lots of posts on calculating NA by variable column but almost nothing on doing it by row groups.
If there are multiple columns, after grouping by 'ID', use summarise_at to loop over the columns, create a logical vector with is.na, get the mean, and multiply by 100
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), ~ 100 *mean(is.na(.)))
If we want to get the percentage across all other variables,
library(tidyr)
df1 %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(Perc = 100 * mean(is.na(value)))
Or with aggregate from base R
aggregate(.~ ID, df1, FUN = function(x) 100 * mean(is.na(x)), na.action = na.pass)
Or to get the percentage across, then unlist, the other columns, create a table with the logical vector and the 'ID' column, and use prop.table to get the percentage
prop.table(table(cbind(ID = df1$ID,
value = is.na(unlist(df1[setdiff(names(df1), "ID")]))))
I have a data set of prices in which the first column is year and the rest of the columns are regions. I'm trying to count the number of negative values by year by region.
I've tried to use dplyr to group_by(year) then summarise_at() but I can't figure out the exact code to use.
Neg_Count <- select (BaseCase, Year, 'Hub1': 'Hub15')
Neg_Count <- Neg_Count %>%
group_by(Year) %>%
What is the best way to do this?
We can use summarise_at. Create a logical vector and get the sum
library(dplyr)
Neg_Count %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Hub")), ~ sum(. < 0))