I want to find the average of the months from Nov to March, say Nov 1982 to Mar 1983. Then, for my result, I want a column with year and mean in another. If the mean is taken till Mar 1983, I want the year to be shown as 1983 along with that mean.
This is how my data looks like.
I want my result to look like this.
1983 29.108
1984 26.012
I am not very good with R packages, If there is an easy way to do this. I would really appreciate any help. Thank you.
Here is one approach to get average of Nov-March every year.
library(dplyr)
df %>%
#Remove data for month April-October
filter(!between(month, 4, 10)) %>%
#arrange the data by year and month
arrange(year, month) %>%
#Remove 1st 3 months of the first year and
#last 2 months of last year
filter(!(year == min(year) & month %in% 1:3 |
year == max(year) & month %in% 11:12)) %>%
#Create a group column for every November entry
group_by(grp = cumsum(month == 11)) %>%
#Take average for each year
summarise(year = last(year),
value = mean(value)) %>%
select(-grp)
# A tibble: 2 x 2
# year value
# <int> <dbl>
#1 1982 0.308
#2 1983 -0.646
data
It is easier to help if you provide data in a reproducible format which can be copied easily.
set.seed(123)
df <- data.frame(year = rep(1981:1983, each = 12),month = 1:12,value = rnorm(36))
With dplyr
# remove the "#" before in the begining of the next line if dplyr or tidyverse is not installed
#install.packages("dplyr")
library(dplyr) #reading the library
colnames(df) <- c("year","month","value") #here I assumed your dataset is named df
df<- df%>%
group_by(year) %>%
summarize(av_value =mean(value))
You can do this as follow using tidyverse
require(tidyverse)
year <- rep(1982:1984, 3)
month <- rep(1:12, 3)
value <- runif(length(month))
dat <- data.frame(year, month, value)
head(dat)
dat looks like your data
# A tibble: 3 × 2
year value
<int> <dbl>
1 1982 0.450
2 1983 0.574
3 1984 0.398
The trick then is to group_by and summarise
dat %>%
group_by(year) %>%
summarise(value = mean(value))
Which gives you
# A tibble: 3 × 2
year value
<int> <dbl>
1 1982 0.450
2 1983 0.574
3 1984 0.398
Related
I am new to R and I have problems with calculating the amount of bill for each month. I have the dataframe as below:
dat <- data.frame(
time = factor(c("Breakfast","Breakfast","Breakfast","Breakfast","Breakfast","Breakfast"), levels=c("Breakfast")), date=c("2020-01-20","2020-01-21","2020-01-22","2020-02-10","2020-02-11","2020-02-12"),
total_bill = c(12.7557,14.8,17.23,15.7,16.9,13.2)
)
My goal is to calculate the amount spending on the Breakfast for each month so here we have two months and I want to get the total sum of January and February separately.
Any help for this would be much appreciated. Thank you!
Does this answer your question?
sums <- tapply(dat$total_bill, format(as.Date(dat$date), "%B"), sum)
February January
45.8000 44.7857
sumsis a list: so if you want to access, for example, the datum for February, you can do this:
sums[1]
February
45.8
Alternatively, you can convert sums into a dataframe and access the monthly sums via the month names:
sums <- as.data.frame.list(tapply(dat$total_bill, format(as.Date(dat$date), "%B"), sum))
sums$February
45.8
Addition:
Another (fun) solution is via regex: you define the dates as a pattern and, using sub plus backreference \\1 to recall the two numbers between the dashes, reduce them to the months part:
tapply(dat$total_bill, sub("\\d{4}-(\\d{2})-\\d{2}", "\\1", dat$date), sum)
01 02
44.7857 45.8000
We can convert the 'date' to Date class, get the month, and use that as grouping column and sum the 'total_bill'
library(dplyr)
dat %>%
group_by(time, Month = format(as.Date(date), "%B")) %>%
summarise(total_bill = sum(total_bill, na.rm = TRUE))
# A tibble: 2 x 3
# Groups: time [1]
# time Month total_bill
# <fct> <chr> <dbl>
#1 Breakfast February 45.8
#2 Breakfast January 44.8
We can convert it to 'wide' format, if that is needed
library(tidyr)
out <- dat %>%
group_by(time, Month = format(as.Date(date), "%B")) %>%
summarise(total_bill = sum(total_bill, na.rm = TRUE)) %>%
pivot_wider(names_from = Month, values_from = total_bill)
out
# A tibble: 1 x 3
# Groups: time [1]
# time February January
# <fct> <dbl> <dbl>
# 1 Breakfast 45.8 44.8
If we also need to group by 'year'
out <- dat %>%
mutate(date = as.Date(date)) %>%
group_by(time, Year = format(date, "%Y"), Month = format(date, "%B")) %>%
summarise(total_bill = sum(total_bill, na.rm = TRUE))
library(dplyr)
d_sum <- dat %>%
group_by(substr(date, 0, 7)) %>%
summarise(sum = sum(total_bill))
d_sum
# A tibble: 2 x 2
`substr(date, 0, 7)` sum
<chr> <dbl>
1 2020-01 44.8
2 2020-02 45.8
Consider the following example:
library(tidyverse)
library(lubridate)
df = tibble(client_id = rep(1:3, each=24),
date = rep(seq(ymd("2016-01-01"), (ymd("2016-12-01") + years(1)), by='month'), 3),
expenditure = runif(72))
In df you have stored information on monthly expenditure from a bunch of clients for the past 2 years. Now you want to calculate the monthly difference between this year and the previous year for each client.
Is there any way of doing this maintaining the "long" format of the dataset? Here I show you the way I am doing it nowadays, which implies going wide:
df2 = df %>%
mutate(date2 = paste0('val_',
year(date),
formatC(month(date), width=2, flag="0"))) %>%
select(client_id, date2, value) %>%
pivot_wider(names_from = date2,
values_from = value)
df3 = (df2[,2:13] - df2[,14:25])
However I find tihs unnecessary complex, and in large datasets going from long to wide can take quite a lot of time, so I think there must be a better way of doing it.
If you want to keep data in long format, one way would be to group by month and date value for each client_id and calculate the difference using diff.
library(dplyr)
df %>%
group_by(client_id, month_date = format(date, "%m-%d")) %>%
summarise(diff = -diff(expenditure))
# client_id month_date diff
# <int> <chr> <dbl>
# 1 1 01-01 0.278
# 2 1 02-01 -0.0421
# 3 1 03-01 0.0117
# 4 1 04-01 -0.0440
# 5 1 05-01 0.855
# 6 1 06-01 0.354
# 7 1 07-01 -0.226
# 8 1 08-01 0.506
# 9 1 09-01 0.119
#10 1 10-01 0.00819
# … with 26 more rows
An option with data.table
library(data.table)
library(zoo)
setDT(df)[, .(diff = -diff(expenditure)), .(client_id, month_date = as.yearmon(date))]
I have data of from each of the avalanches that occurred. I need to calculate the number of avalanches that occurred by each year and month but the data just gives the exact days that an avalanche occurred. How do I sum the number of occurrences that occurred during each year-month? I also only need the winter related year-months (Dec (12) - March (3)). Please help!
library(XML)
library(RCurl)
library(dplyr)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep="")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T,stringsAsFactors=F)
names(thispage.avalanche)<-c('Date','Region','Location','Observer')
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
The output should look something like:
Date AvalancheTotal
2000-01 1
2000-02 2
2000-03 8
2000-12 23
2001-01 16
.
.
.
.
.
2019-03 45
Using dplyr, you could get the variable of interest ("year-month") from the Date column, group by this variable, and then compute the number of rows in each group.
In a similar way, you can filter to only get the months you like:
library(dplyr)
winter_months <- c(1:3, 12)
avalancheslc %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
mutate(YearMonth = format(Date,"%Y-%m"),
Month = as.numeric(format(Date,"%m"))) %>%
filter(Month %in% winter_months) %>%
group_by(YearMonth) %>%
summarise(AvalancheTotal = n())
We can convert to yearmon from zoo and use that in the group_by to get the number of rows
library(dplyr)
library(zoo)
dim(avalancheslc)
#[1] 5494 4
out <- avalancheslc %>%
group_by(Date = format(as.yearmon(Date, "%m/%d/%Y"), "%Y-%m")) %>%
summarise(AvalancheTotal = n())
If we need only output from December to March, then filter the data
subOut <- out %>%
filter(as.integer(substr(Date, 6, 7)) %in% c(12, 1:3))
Or it can be filtered earlier in the chain
library(lubridate)
out <- avalancheslc %>%
mutate(Date = as.yearmon(Date, "%m/%d/%Y")) %>%
filter(month(Date) %in% c(12, 1:3)) %>%
count(Date)
dim(out)
#[1] 67 2
Now, for filling with 0's
mths <- month.abb[c(12, 1:3)]
out1 <- crossing(Months = mths,
Year = year(min(out$Date)):year(max(out$Date))) %>%
unite(Date, Months, Year, sep= " ") %>%
mutate(Date = as.yearmon(Date)) %>%
left_join(out) %>%
mutate(n = replace_na(n, 0))
tail(out1)
# A tibble: 6 x 2
# Date n
# <S3: yearmon> <dbl>
#1 Mar 2014 100
#2 Mar 2015 94
#3 Mar 2016 96
#4 Mar 2017 93
#5 Mar 2018 126
#6 Mar 2019 163
I have a data frame with four habitats sampled over eight months. Ten samples were collected from each habitat each month. The number of individuals for species in each sample was counted. The following code generates a smaller data frame of a similar structure.
# Pseudo data
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)
I want to sum the total number of individuals by month, across all species sampled. I'm using ddply (preferred) but I'm open to other suggestions.
The closest I get is to add together the sum of each column, as shown here.
library(plyr)
ddply(df, ~ Month, summarize, tot_by_mon = sum(Species1) + sum(Species2) + sum(Species3))
# Month tot_by_mon
# 1 Jan 84
# 2 Feb 92
# 3 Mar 67
This works, but I wonder if there is a generic method to handle cases with an "unknown" number of species. That is, the first species always begins in the 4th column but the last species could be in the 10th or 42nd column. I do not want to hard code the actual species names into the summary function. Note that the species names vary widely, such as Doryflav and Pheibica.
Similar to #useR's answer with data.table's melt, you can use tidyr to reshape with gather:
library(tidyr)
library(dplyr)
gather(df, Species, Value, matches("Species")) %>%
group_by(Month) %>% summarise(z = sum(Value))
# A tibble: 3 x 2
Month z
<fctr> <int>
1 Jan 90
2 Feb 81
3 Mar 70
If you know the columns by position instead of a pattern to be "matched"...
gather(df, Species, Value, -(1:3)) %>%
group_by(Month) %>% summarise(z = sum(Value))
(Results shown using #akrun's set.seed(123) example data.)
Here's another solution with data.table without needing to know the names of the "Species" columns:
library(data.table)
DT = melt(setDT(df), id.vars = c("Habitat", "Month", "Sample"))
DT[, .(tot_by_mon=sum(value)), by = "Month"]
or if you want it compact, here's a one-liner:
melt(setDT(df), 1:3)[, .(tot_by_mon=sum(value)), by = "Month"]
Result:
Month tot_by_mon
1: Jan 90
2: Feb 81
3: Mar 70
Data: (Setting seed to make example reproducible)
set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)
Suppose Speciess columns all start with Species, you can select them by the prefix and sum using group_by %>% do:
library(tidyverse)
df %>%
group_by(Month) %>%
do(tot_by_mon = sum(select(., starts_with('Species')))) %>%
unnest()
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 63
#2 Feb 67
#3 Mar 58
If column names don't follow a pattern, you can select by column positions, for instance if Species columns go from 4th to the end of data frame:
df %>%
group_by(Month) %>%
do(tot_by_mon = sum(select(., 4:ncol(.)))) %>%
unnest()
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 63
#2 Feb 67
#3 Mar 58
Here is another option with data.table without reshaping to 'long' format
library(data.table)
setDT(df)[, .(tot_by_mon = Reduce(`+`, lapply(.SD, sum))), Month,
.SDcols = Species1:Species3]
# Month tot_by_mon
#1: Jan 90
#2: Feb 81
#3: Mar 70
Or with tidyverse, we can also make use of map functions which would be efficient
library(dplyr)
library(purrr)
df %>%
group_by(Month) %>%
nest(starts_with('Species')) %>%
mutate(tot_by_mon = map_int(data, ~sum(unlist(.x)))) %>%
select(-data)
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 90
#2 Feb 81
#3 Mar 70
data
set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2),
levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)
I am using the baby names data in R for practice.
total_n <-babynames %>%
mutate(name_gender = paste(name,sex))%>%
group_by(year) %>%
summarise(total_n = sum(n, na.rm=TRUE)) %>%
arrange(total_n)
bn <- inner_join(babynames,total_n,by = "year")
df <- bn%>%
mutate(pct_of_names = n/total_n)%>%
group_by(name, year)%>%
summarise(pct =sum(pct_of_names))
The dataframe output looked like this:
For each name, there's all the years, and the related pct for that year. I am stuck with getting the year with the highest pct for each name. How do I do this?
Pretty simple, once you know where the babynames data comes from. You had everything needed:
library(dplyr)
library(babynames)
total_n <-babynames %>%
mutate(name_gender = paste(name,sex))%>%
group_by(year) %>%
summarise(total_n = sum(n, na.rm=TRUE)) %>%
arrange(total_n)
bn <- inner_join(babynames,total_n,by = "year")
df <- bn%>%
mutate(pct_of_names = n/total_n)%>%
group_by(name, year)%>%
summarise(pct =sum(pct_of_names))
You were missing this final step:
df %>%
group_by(name) %>%
filter(pct == max(pct))
# A tibble: 95,025 x 3
# Groups: name [95,025]
name year pct
<chr> <dbl> <dbl>
1 Aaban 2014 4.338256e-06
2 Aabha 2014 2.440269e-06
3 Aabid 2003 1.316094e-06
4 Aabriella 2015 1.363073e-06
5 Aada 2015 1.363073e-06
6 Aadam 2015 5.997520e-06
7 Aadan 2009 6.031433e-06
8 Aadarsh 2014 4.880538e-06
9 Aaden 2009 3.335645e-04
10 Aadesh 2011 1.370356e-06
# ... with 95,015 more row
group_by and filter are your friends.