Grouping characters in R using as.factor [duplicate] - r

This question already has answers here:
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 2 years ago.
I'm trying to find the maximum number of flights delayed from certain origins using the library(nycflights13) and I'm not able to figure out how to group by "chr"
library(nycflights13)
library(dplyr)
flights2 <- mutate(flights,factori = as.factor(flights$origin))
flights2 %>%
filter(dep_delay > 2) %>%
select(dep_delay, factori) %>%
group_by(factori)
Sample of output:
How can I get them grouped together? How can I find the max count?

group_by doesn't change anything in the structure of the data. The number of rows and column remain the same after group_by. It is what you do after group_by that decides the output.
To get max dep_delay for each factori you can do :
library(nycflights13)
library(dplyr)
flights2 %>%
filter(dep_delay > 2) %>%
select(dep_delay, factori) %>%
group_by(factori) %>%
summarise(max = max(dep_delay, na.rm = TRUE))
# factori max
#* <fct> <dbl>
#1 EWR 1126
#2 JFK 1301
#3 LGA 911
summarise usually gives only one row per group whereas mutate would keep the number of rows same as original data.

Related

how to determine the number of unique values based on multiple criteria dplyr

I've got a df that looks like:
df(site=c(A,B,C,D,E), species=c(1,2,3,4), Year=c(1980:2010).
I would like to calculate the number of different years that each species appear in each site, creating a new column called nYear, I've tried filtering by group and using mutate combined with ndistinct values but it is not quite working.
Here is part of the code I have been using:
Df1 <- Df %>%
filter(Year>1985)%>%
mutate(nYear = n_distinct(Year[Year %in% site]))%>%
group_by(Species,Site, Year) %>%
arrange(Species, .by_group=TRUE)
ungroup()
The approach is good, a few things to correct.
First, let's make some reproducible data (your code gave errors).
df <- data.frame("site"=LETTERS[1:5], "species"=1:5, "Year"=1981:2010)
You should have used summarise instead of mutate when you're looking to summarise values across groups. It will give you a shortened tibble as an output, with only the groups and the summary figures present (fewer columns and rows).
mutate on the other hand aims to modify an existing tibble, keeping all rows and columns by default.
The order of your functions in the chains also needs to change.
df %>%
filter(Year>1985) %>%
group_by(species,site) %>%
summarise(nYear = length(unique(Year))) %>% # instead of mutate
arrange(species, .by_group=TRUE) %>%
ungroup()
First, group_by(species,site), not year, then summarise and arrange.
# A tibble: 5 × 3
species site nYear
<int> <chr> <int>
1 1 A 5
2 2 B 5
3 3 C 5
4 4 D 5
5 5 E 5
You can use distinct() on the filtered frame, and then count by your groups of interest:
distinct(Df %>% filter(Year>1985)) %>%
count(Site, Species,name = "nYear")

R programing inbuilt Titanic Data set [duplicate]

This question already has an answer here:
How can I count the number of instances a value occurs within a subgroup in R?
(1 answer)
Closed 1 year ago.
I am new to R programming. I have to build titanic data in R. I want to find out how many child and adults are there in the dataset. Can someone give me hint to find the same?
I tried using length() function but it did not give the result.
Here's a solution in tidyverse syntax. It converts the Titanic dataset into a tibble (a type of dataframe), groups the data by the Age column, then uses n() to count the number of rows at each level of Age, giving the number of children and adults.
library(tidyverse)
Titanic %>%
as_tibble() %>%
group_by(Age) %>%
summarise(N = n())
This gives the output:
# A tibble: 2 x 2
Age N
<chr> <int>
1 Adult 16
2 Child 16

R: how can I calculate the percentages a variable takes on a certain value by group?

So I'm trying to get r to report the share of a certain variable taking on a specific value in a group.
For example: Let`s consider a dataset which consists of groups 1,2 and 3. Now I would like to know the percentage a Variable1 takes on the value 500 in group 1,2 and 3 and incorporate this as a new vaiable.
Is there a convenient way to get to a solution?
So it should look something like this:
df
Group Var1 Var1_perc
1 0 50
1 400 50
1 500 50
1 500 50
and so on for the other groups
I would use tidyverse to do this
Calculate how often a variable takes on a certain value in a group
library(tidyverse)
df %>%
group_by(Group,Var1) %>%
summarise(count = n())
To calculate the percentage in a group:
df %>%
left_join(df %>%
group_by(grp) %>%
summarise(n = n()), by = "grp" ) %>%
group_by(grp,value) %>%
summarise(percentage = n()/n)
The whole left_join stuff is to calculate how often a group appears in the table. I couldn't think of a better one rn.

data mining: subset based on maximum criteria of several observations [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 6 years ago.
Consider the example data
Zip_Code <- c(1,1,1,2,2,2,3,3,3,3,4,4)
Political_pref <- c('A','A','B','A','B','B','A','A','B','B','A','A')
income <- c(60,120,100,90,80,60,100,90,200,200,90,110)
df1 <- data.frame(Zip_Code, Political_pref, income)
I want to group_by each $Zip_code and obtain the maximum $income based on each $Political_pref factor.
The desired output is a df which has 8obs of 3 variables. That contains, 2 obs for each $Zip_code (an A and B for each) which had the greatest income
I am playing with dplyr, but happy for a solution using any package (possibly with data.table)
library(dplyr)
df2 <- df1 %>%
group_by(Zip_Code) %>%
filter(....)
We can use slice with which.max
library(dplyr)
df1 %>%
group_by(Zip_Code, Political_pref) %>%
slice(which.max(income))

What is the right way to reference part of a dataframe after piping? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
What is the correct way to do something like this? I am trying to get the colSums of each group for specific columns. The . syntax seems incorrect with this type of subsetting.
csv<-data.frame(id_num=c(1,1,1,2,2),c(1,2,3,4,5),c(1,2,3,3,3))
temp<-csv%>%group_by(id_num)%>%colSums(.[,2:3],na.rm=T)
This can be done with summarise_each or in the recent version additional functions like summarise_at, summarise_if were introduced for convenient use.
csv %>%
group_by(id_num) %>%
summarise_each(funs(sum))
csv %>%
group_by(id_num) %>%
summarise_at(2:3, sum)
If we are using column names, wrap it with vars in the summarise_at
csv %>%
group_by(id_num) %>%
summarise_at(names(csv)[-1], sum)
NOTE: In the OP's dataset, the column names for the 2nd and 3rd columns were not specified resulting in something like c.1..2..3..4..5.
Using the vars to apply the function on the selected column names
csv %>%
group_by(id_num) %>%
summarise_at(vars(c.1..2..3..4..5.), sum)
# # A tibble: 2 × 2
# id_num c.1..2..3..4..5.
# <dbl> <dbl>
#1 1 6
#2 2 9

Resources