I would like to summarize a table using dplyr.
Here is how I would like to proceed:
I have a data.frame like this:
year region week site species gps_clutch
2017 sud 18 6 au 337
2017 sud 20 10 au 352
2017 sud 22 10 au 352
2017 sud 24 10 au 352
2017 sud 18 6 aio 337
2017 sud 20 6 aio 352
2017 sud 22 6 au 352
2018 sud 20 6 au 337
2018 sud 20 10 au 352
2018 sud 22 10 au 352
2018 sud 22 10 aio 352
2018 sud 22 6 au 352
2017 nor 19 5 au 337
2017 nor 21 2 au 352
2017 nor 23 5 au 352
2017 nor 25 2 au 352
2017 nor 19 5 aio 337
2017 nor 25 5 aio 352
2017 nor 19 5 au 337
2018 nor 21 2 aio 352
2018 nor 23 5 aio 352
2018 nor 25 2 au 352
2018 nor 23 5 aio 337
2018 nor 23 5 au 352
I would like to count the number of "gps_clutch" for each year, region, site, week and expand this all the possible weeks recorded for each region. I explain: in the region "sud" I sampled week 18, 20, 22, 24 and in the region "nor" week 19, 21, 23, 25. I would like to convert implicit missing values by "0" but only for the weeks (nested in regions) that have been sampled. I do not want to expand in a way that I would get a row for week 19 in region "sud" because this region was not sampled that specific week.
this code works well to expand the grid as I would like:
dat %>%
group_by(region) %>%
expand(year,site, species,week)
the following code works too, to get the count values but does not expand the grid as I wish (I only get the list of weeks for which I did observe something for each year, not the total number of weeks sampled across both years). Which mean that if in "sud" "2017" I only have records for weeks 20 and 22, the grid will not get expanded to week 18 and 24 :
field_subsetnord %>%
group_by(year,region,site,species,week) %>%
summarise(count_clutch=length(gps_clutch)) %>%
complete(week,nesting(year,sites,species), fill = list(count_clutch = 0))
this is the table I would like to get at the end:
year region week site species count
2017 sud 18 6 au 1
2017 sud 20 6 au 0
2017 sud 22 6 au 1
2017 sud 24 6 au 0
2017 sud 18 6 aio 1
2017 sud 20 6 aio 1
2017 sud 22 6 aio 0
2017 sud 24 6 aio 0
2017 sud 18 10 au 0
2017 sud 20 10 au 1
2017 sud 22 10 au 1
2017 sud 24 10 au 1
2017 sud 18 10 aio 0
2017 sud 20 10 aio 0
2017 sud 22 10 aio 0
2017 sud 24 10 aio 0
2018 sud 18 6 au 0
2018 sud 20 6 au 1
2018 sud 22 6 au 1
2018 sud 24 6 au 0
2018 sud 18 6 aio 0
2018 sud 20 6 aio 0
2018 sud 22 6 aio 0
2018 sud 24 6 aio 0
2018 sud 18 10 au 0
2018 sud 20 10 au 1
2018 sud 22 10 au 1
2018 sud 24 10 au 0
2018 sud 18 10 aio 0
2018 sud 20 10 aio 0
2018 sud 22 10 aio 1
2018 sud 24 10 aio 0
and so on for 2018...
any suggestions to mix these two codes would be appreciated :)
You are so close with your two approaches. Essentially they just need to be combined to get what you're after. :)
Group by region and then complete() the dataset first, then regroup by all variables and summarise(). Since the gps_clutch will now have missing values in it, you can sum up the non-missing values (via !is.na) in the summarise() statement to count the clutches.
dat %>%
group_by(region) %>%
complete(year, site, species, week) %>%
group_by(year, region, site, species, week) %>%
summarise(count_clutch = sum( !is.na(gps_clutch) ) )
# A tibble: 64 x 6
# Groups: year, region, site, species [16]
year region site species week count_clutch
<int> <fct> <int> <fct> <int> <int>
1 2017 nor 2 aio 19 0
2 2017 nor 2 aio 21 0
3 2017 nor 2 aio 23 0
4 2017 nor 2 aio 25 0
5 2017 nor 2 au 19 0
6 2017 nor 2 au 21 1
7 2017 nor 2 au 23 0
8 2017 nor 2 au 25 1
9 2017 nor 5 aio 19 1
10 2017 nor 5 aio 21 0
# ... with 54 more rows
Related
Suppose I have a dataset with people born in different years:
ID year birth_year outcome
1 10021 2015 1960 1
2 10021 2016 1960 1
3 10021 2017 1960 1
4 10021 2018 1960 0
5 10021 2019 1960 0
6 10022 2015 1968 1
7 10022 2016 1968 0
8 10022 2017 1968 0
9 10022 2018 1968 0
10 10022 2019 1968 0
11 10023 2015 1968 1
12 10023 2016 1968 1
13 10023 2017 1968 1
14 10023 2018 1968 1
15 10023 2019 1968 1
16 10024 2015 1961 0
17 10024 2016 1961 0
18 10024 2017 1961 0
19 10024 2018 1961 1
20 10024 2019 1961 1
I want to split this dataset into smaller datasets according to birth year, and store them as year1960, year1961 and year1968. Specifically,
> year1960
ID year birth_year outcome
1 10021 2015 1960 1
2 10021 2016 1960 1
3 10021 2017 1960 1
4 10021 2018 1960 0
5 10021 2019 1960 0
> year1961
1 10024 2015 1961 0
2 10024 2016 1961 0
3 10024 2017 1961 0
4 10024 2018 1961 1
5 10024 2019 1961 1
> year1968
1 10022 2015 1968 1
2 10022 2016 1968 0
3 10022 2017 1968 0
4 10022 2018 1968 0
5 10022 2019 1968 0
6 10023 2015 1968 1
7 10023 2016 1968 1
8 10023 2017 1968 1
9 10023 2018 1968 1
10 10023 2019 1968 1
How do I do this with fewest steps possible?
There are probably shorter/better ways to do this but his will work and you'll end up with individual dataframes for each birth year.
# read data
df <-read.csv('data.csv')
# split data by 'birth_year' into list of data frames
df_split <- split(df, with(df, birth_year))
# rename elements of list
names(df_split) <- paste0('year', names(df_split))
# create individual dataframes from list
list2env(df_split, env = .GlobalEnv)
I am creating a shiny app that tracks various stats of 6 teams in a competition over 6 years. The df is as follows:
Year Pos Team P W L D GF GA GD G. BP Pts
1 2017 1 Southern Steel 15 15 0 0 1062 812 250 130.8 0 30
2 2017 2 Central Pulse 15 9 6 0 783 756 27 103.6 2 20
3 2017 3 Northern Mystics 15 8 7 0 878 851 27 111.3 3 19
4 2017 4 Waikato Bay of Plenty Magic 15 7 8 0 873 848 25 103.0 5 19
5 2017 5 Northern Stars 15 4 11 0 738 868 -130 85.0 1 9
6 2017 6 Mainland Tactix 15 2 13 0 676 875 -199 77.3 2 6
7 2018 1 Central Pulse 15 12 3 0 850 679 171 125.2 3 27
8 2018 2 Southern Steel 15 10 5 0 874 866 8 100.9 2 22
9 2018 3 Mainland Tactix 15 7 8 0 746 761 -15 98.0 5 19
10 2018 4 Northern Mystics 15 7 8 0 783 796 -13 98.4 3 17
11 2018 5 Waikato Bay of Plenty Magic 15 5 10 0 804 878 -74 91.6 3 13
12 2018 6 Northern Stars 15 4 11 0 832 909 -77 91.5 5 13
13 2019 1 Central Pulse 15 13 2 0 856 676 180 126.6 0 39
14 2019 2 Southern Steel 15 12 3 0 946 809 137 116.9 2 38
15 2019 3 Northern Stars 15 6 9 0 785 840 -55 93.5 3 21
16 2019 4 Waikato Bay of Plenty Magic 15 5 10 0 713 793 -80 89.9 0 15
17 2019 5 Mainland Tactix 15 5 10 0 740 849 -109 87.2 0 15
18 2019 6 Northern Mystics 15 4 11 0 786 859 -73 91.5 2 14
19 2020 1 Central Pulse 15 11 2 2 594 474 120 125.3 1 49
20 2020 2 Mainland Tactix 15 9 4 2 606 566 40 107.1 2 42
21 2020 3 Northern Mystics 15 7 6 2 582 475 7 101.2 3 35
22 2020 4 Northern Stars 15 5 7 3 590 626 -36 94.2 3 29
23 2020 5 Southern Steel 15 4 10 1 578 637 -59 90.7 3 21
24 2020 6 Waikato Bay of Plenty Magic 15 2 9 4 520 592 -72 87.8 3 19
25 2021 1 Northern Mystics 15 11 4 0 924 878 46 105.2 4 37
26 2021 2 Southern Steel 15 11 4 0 813 801 12 101.5 2 35
27 2021 3 Mainland Tactix 15 9 6 0 801 775 26 103.4 4 31
28 2021 4 Northern Stars 15 9 6 0 825 791 34 104.3 2 29
29 2021 5 Central Pulse 15 4 11 0 789 810 -21 97.4 8 20
30 2021 6 Waikato Bay of Plenty Magic 15 1 15 0 807 904 -97 89.3 6 9
31 2022 1 Central Pulse 15 10 5 0 828 732 96 113.1 4 34
32 2022 2 Northern Stars 15 11 4 0 836 783 53 106.8 1 34
33 2022 3 Northern Mystics 15 9 6 0 858 807 51 106.3 4 31
34 2022 4 Southern Steel 15 6 9 0 853 898 -45 95.0 2 20
35 2022 5 Waikato Bay of Plenty Magic 15 4 11 0 733 803 -70 91.3 4 16
36 2022 6 Mainland Tactix 15 5 0 0 788 873 -85 90.3 1 16
I need 3 graphs:
A stacked bar chart showing wins/draws/losses for each team across the 6 years.
A line chart showing the position of each team at the end of each of the 6 years.
A bubble chart showing total goals for/ goals against for each team across all 6 years, with total wins dictating size of the plots.
I also need to be able to filter the data for these graphs with a checkbox for choosing teams and a slider to select the year range.
I have got a stacked bar chart which can not be filtered - I can't figure out how to group the original df by team AND have it connected to the reactive filter I have. Currently the graph is connected to a melted df which is no good as I need the reactive filtered one defined in the function. The graph is also a bit ugly - how can I flip the chart so that wins are on bottom and draws are on top?
The second chart is all good.
The third chart again I need to group the data so that I have total stats across the 6 years- currently there are 36 bubbles but I only want 6.
Screenshots of shiny app output: https://imgur.com/a/qzqlUob
Code:
library(ggplot2)
library(shiny)
library(dplyr)
library(reshape2)
library(scales)
df <- read.csv("ANZ_Premiership_2017_2022.csv")
teams <- c("Central Pulse", "Northern Stars", "Northern Mystics",
"Southern Steel", "Waikato Bay of Plenty Magic", "Mainland Tactix")
mdf <- melt(df %>%
group_by(Team) %>% summarise(Wins = sum(W),
Losses = sum(L),
Draws = sum(D)),
id.vars = "Team")
ui <- fluidPage(
titlePanel("ANZ Premiership Analysis"),
sidebarLayout(
sidebarPanel(
checkboxGroupInput("teams",
"Choose teams",
choices = teams,
selected = teams),
sliderInput("years",
"Choose years",
sep="",
min=2017, max=2022, value=c(2017,2022))
),
mainPanel(
h2("Chart Tabs"),
tabsetPanel(
tabPanel("Wins/ Losses/ Draws", plotOutput("winLoss")),
tabPanel("Standings", plotOutput("standings")),
tabPanel("Goals", plotOutput ("goals"))
)
)
)
)
server <- function(input, output){
filterTeams <- reactive({
df.selection <- filter(df, Team %in% input$teams, Year %in% (input$years[1]:input$years[2]))
})
output$winLoss <- renderPlot({
ggplot(mdf, mapping=aes(Team, value, fill=variable))+
geom_bar(stat = "identity", position = "stack")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
ylab("Wins")+
xlab("Team")
})
output$standings <- renderPlot({
filterTeams() %>%
ggplot(aes(x=Year, y=Pos, group=Team, color=Team)) +
geom_line(size=1.25) +
geom_point(size=2.5)+
ggtitle("Premiership Positions") +
ylab("Position")
})
output$goals <- renderPlot({
filterTeams()%>%
ggplot(aes(GF, GA, size=W, color=Team))+
geom_point(alpha=0.7)+
scale_size(range=c(5,15),name = "Wins")+
xlab("Goals for")+
ylab("Goals against")
})
}
shinyApp(ui = ui, server = server)
I have grouped data that I want to convert to ungrouped data.
year<-c(rep(2014,4),rep(2015,4))
Age<-rep(c(22,23,24,25),2)
n<-c(1,1,3,2,0,2,3,1)
mydata<-data.frame(year,Age,n)
I would like to have a dataset like the one below created from the previous one.
year Age
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
Try
mydata[rep(1:nrow(mydata),mydata$n),]
year Age n
1 2014 22 1
2 2014 23 1
3 2014 24 3
3.1 2014 24 3
3.2 2014 24 3
4 2014 25 2
4.1 2014 25 2
6 2015 23 2
6.1 2015 23 2
7 2015 24 3
7.1 2015 24 3
7.2 2015 24 3
8 2015 25 1
Here's a tidyverse solution:
library(tidyverse)
mydata %>%
uncount(n)
which gives:
year Age
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
You can also use tidyr syntax for this:
library(tidyr)
year<-c(rep(2014,4),rep(2015,4))
Age<-rep(c(22,23,24,25),2)
n<-c(1,1,3,2,0,2,3,1)
mydata<-data.frame(year,Age,n)
uncount(mydata, n)
#> year Age
#> 1 2014 22
#> 2 2014 23
#> 3 2014 24
#> 4 2014 24
#> 5 2014 24
#> 6 2014 25
#> 7 2014 25
#> 8 2015 23
#> 9 2015 23
#> 10 2015 24
#> 11 2015 24
#> 12 2015 24
#> 13 2015 25
But of course you shouldn't use tidyr just because it is tidyr :) An alternate view of the Tidyverse "dialect" of the R language, and its promotion by RStudio.
We can use tidyr::complete
library(tidyr)
library(dplyr)
mydata %>% group_by(year, Age) %>%
complete(n = seq_len(n)) %>%
select(-n) %>%
ungroup()
# A tibble: 14 × 2
year Age
<dbl> <dbl>
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
14 2015 22
I need to aggregate the previous 5 years of the N_C variable in each row.
For example: year 2017 - Sum_Five_Years = 10(2017)+21(2015)+14(2014)+16(2013) = 61
Data:
library(dplyr)
DF<-data.frame(company = c("DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM","DEL MAR PHARM"),
year= c("2017","2015","2015","2015","2013","2012","2012","2012","2010","2010","2015","2014","2014","2013","2013","2012"),
N_C= c("0","7","5","4","3","24","52","99","43","37","5","7","7","4","9","20"), Sum_Year = c("0","21","21","21","16","195","195","195","80","80","21","14","14","16","16","195"))
DF <- DF %>% arrange(year)
company year N_C Sum_Year
1 DEL MAR PHARM 2010 43 80
2 DEL MAR PHARM 2010 37 80
3 DEL MAR PHARM 2012 24 195
4 DEL MAR PHARM 2012 52 195
5 DEL MAR PHARM 2012 99 195
6 DEL MAR PHARM 2012 20 195
7 DEL MAR PHARM 2013 3 16
8 DEL MAR PHARM 2013 4 16
9 DEL MAR PHARM 2013 9 16
10 DEL MAR PHARM 2014 7 14
11 DEL MAR PHARM 2014 7 14
12 DEL MAR PHARM 2015 7 21
13 DEL MAR PHARM 2015 5 21
14 DEL MAR PHARM 2015 4 21
15 DEL MAR PHARM 2015 5 21
16 DEL MAR PHARM 2017 10 10
Expected Outcome
DF$Sum_Five_Year <- cbind(c("80","80","275","275","275","275","291","291","291","305","305","246","246","246","246","61"))
> DF
company year N_C Sum_Year Sum_Five_Year
1 DEL MAR PHARM 2010 43 80 80
2 DEL MAR PHARM 2010 37 80 80
3 DEL MAR PHARM 2012 24 195 275
4 DEL MAR PHARM 2012 52 195 275
5 DEL MAR PHARM 2012 99 195 275
6 DEL MAR PHARM 2012 20 195 275
7 DEL MAR PHARM 2013 3 16 291
8 DEL MAR PHARM 2013 4 16 291
9 DEL MAR PHARM 2013 9 16 291
10 DEL MAR PHARM 2014 7 14 305
11 DEL MAR PHARM 2014 7 14 305
12 DEL MAR PHARM 2015 7 21 246
13 DEL MAR PHARM 2015 5 21 246
14 DEL MAR PHARM 2015 4 21 246
15 DEL MAR PHARM 2015 5 21 246
16 DEL MAR PHARM 2017 10 10 61
I have tried the following code but it does not work:
library(data.table)
setDT(DF)
DF[, `:=` (Sum_Five_Year= sum(N_C)), by= list(company,cut(year, breaks = c(5), right = F))]
Any suggestion would be very appreciated :)
With no additional packages, you could use sapply.
The code below assumes that Sum_Year has already been created. You could apply the following directly to your example:
distinct(DF, company, year, Sum_Year) %>%
group_by(company) %>%
mutate(
year = as.integer(as.character(year)),
Sum_Five_Year = sapply(year, function(x) sum(Sum_Year[between(year, x - 5 + 1, x)]))
) %>%
left_join(DF %>% select(-Sum_Year), by = c("company", "year"))
Output:
# A tibble: 16 x 5
# Groups: company [?]
company year Sum_Year Sum_Five_Year N_C
<chr> <int> <int> <int> <int>
1 DELMARPHARM 2010 80 80 43
2 DELMARPHARM 2010 80 80 37
3 DELMARPHARM 2012 195 275 24
4 DELMARPHARM 2012 195 275 52
5 DELMARPHARM 2012 195 275 99
6 DELMARPHARM 2012 195 275 20
7 DELMARPHARM 2013 16 291 3
8 DELMARPHARM 2013 16 291 4
9 DELMARPHARM 2013 16 291 9
10 DELMARPHARM 2014 14 305 7
11 DELMARPHARM 2014 14 305 7
12 DELMARPHARM 2015 21 246 7
13 DELMARPHARM 2015 21 246 5
14 DELMARPHARM 2015 21 246 4
15 DELMARPHARM 2015 21 246 5
16 DELMARPHARM 2017 10 61 10
Otherwise you can do:
DF %>%
group_by(company, year) %>%
mutate(N_C = as.numeric(as.character(N_C))) %>%
summarise(Sum_Year = sum(N_C)) %>%
mutate(
year = as.integer(as.character(year)),
Sum_Five_Year = sapply(year, function(x) sum(Sum_Year[between(year, x - 5 + 1, x)]))
) %>%
left_join(DF %>% select(-Sum_Year), by = c("company", "year"))
If you'd like to get rid of the duplicated format, just leave out the join at the end:
DF %>%
group_by(company, year) %>%
mutate(N_C = as.numeric(as.character(N_C))) %>%
summarise(Sum_Year = sum(N_C)) %>%
mutate(
year = as.integer(as.character(year)),
Sum_Five_Year = sapply(year, function(x) sum(Sum_Year[between(year, x - 5 + 1, x)]))
)
Output:
# A tibble: 6 x 4
# Groups: company [1]
company year Sum_Year Sum_Five_Year
<chr> <int> <dbl> <dbl>
1 DELMARPHARM 2010 80 80
2 DELMARPHARM 2012 195 275
3 DELMARPHARM 2013 16 291
4 DELMARPHARM 2014 14 305
5 DELMARPHARM 2015 21 246
6 DELMARPHARM 2017 10 61
I've got a data frame with panel-data, subjects' characteristic through the time. I need create a column with a sequence from 1 to the maximum number of year per every subject. For example, if subject 1 is in the data frame from 2000 to 2005, I need the following sequence: 1,2,3,4,5,6.
Below is a small fraction of my data. The last column (exp) is what I trying to get. Additionally, if you have a look at the first subject (13) you'll see that in 2008 the value of qtty is zero. In this case I need just a NA or a code (0,1, -9999), it doesn't matter which one.
Below the data is what I did to get that vector, but it didn't work.
Any help will be much appreciated.
subject season qtty exp
13 2000 29 1
13 2001 29 2
13 2002 29 3
13 2003 29 4
13 2004 29 5
13 2005 27 6
13 2006 27 7
13 2007 27 8
13 2008 0 NA
28 2000 18 1
28 2001 18 2
28 2002 18 3
28 2003 18 4
28 2004 18 5
28 2005 18 6
28 2006 18 7
28 2007 18 8
28 2008 18 9
28 2009 20 10
28 2010 20 11
28 2011 20 12
28 2012 20 13
35 2000 21 1
35 2001 21 2
35 2002 21 3
35 2003 21 4
35 2004 21 5
35 2005 21 6
35 2006 21 7
35 2007 21 8
35 2008 21 9
35 2009 14 10
35 2010 11 11
35 2011 11 12
35 2012 10 13
My code:
numbY<-aggregate(season ~ subject, data = toCountY,length)
colnames(numbY)<-c("subject","inFish")
toCountY$inFish<-numbY$inFish[match(toCountY$subject,numbY$subject)]
numbYbyFisher<-unique(numbY)
seqY<-aggregate(numbYbyFisher$inFish, by=list(numbYbyFisher$subject), function(x)seq(1,x,1))
I am using ddply and I distinguish 2 cases:
Either you generate a sequence along subjet and you replace by NA where you have qtty is zero
ddply(dat,.(subjet),transform,new.exp=ifelse(qtty==0,NA,seq_along(subjet)))
Or you generate a sequence along qtty different of zero with a jump where you have qtty is zero
ddply(dat,.(subjet),transform,new.exp={
hh <- seq_along(which(qtty !=0))
if(length(which(qtty ==0))>0)
hh <- append(hh,NA,which(qtty==0)-1)
hh
})
EDITED
ind=qtty!=0
exp=numeric(length(subject))
temp=0
for(i in 1:length(unique(subject[ind]))){
temp[i]=list(seq(from=1,to=table(subject[ind])[i]))
}
exp[ind]=unlist(temp)
this will provide what you need