Aggregating based on previous year and this year - r

I have these data sets
month Year Rain
10 2010 376.8
11 2010 282.78
12 2010 324.58
1 2011 73.51
2 2011 225.89
3 2011 22.96
I used
df2prnext<-
aggregate(Rain~Year, data = subdataprnext, mean)
but I need the mean value of 217.53.
I am not getting the expected result. Thank you for your help.

Related

Updating table with custom numbers

Below is my dataset, which contains four columns id, year, quarter, and price.
df <- data.frame(id = c(1,2,1,2),
year = c(2010,2010,2011,2011),
quarter = c("2010-q1","2010-q2","2011-q1","2011-q2"),
price = c(10,50,10,50))
Now I want to expand this dataset for 2012 and 2013. First, I want to copy rows for 2010 and 2011 and paste them below, and after that, replace these values for years and quarters with 2012 and 2013 and also quarters with 2012-q1,2012-q2,2013-q1 and 2013-q2.
So can anybody help me with how to solve this and prepare the table as the table below?
df %>%
mutate(year = year + 2, quarter = paste0(year, "-q", id)) %>%
bind_rows(df, .)
id year quarter price
1 1 2010 2010-q1 10
2 2 2010 2010-q2 50
3 1 2011 2011-q1 10
4 2 2011 2011-q2 50
5 1 2012 2012-q1 10
6 2 2012 2012-q2 50
7 1 2013 2013-q1 10
8 2 2013 2013-q2 50

How to query NOAA for historical daily temperature averages using rnoaa?

I'm trying to find the historical average temperature between a range of dates using NOAA data and comparing to the long term average temperatures.
I'm using the rnoaa package and have hit a bit of a snag. For long term averages, I have been successful using the following syntax:
library('rnoaa')
start_date = "2010-01-15"
end_date = "2010-11-14"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_DLY', stationid=paste0('GHCND:',station_id),
datatypeid='dly-tavg-normal',
startdate = start_date, enddate = end_date,limit=365)
This lets me parse weather_data$data for the long term average temperatures for that given station between January 15th and November 14th.
However, I can't seem to find the right dataset or datatype for historical average temperatures. I'd like to get the same data as the code above except with the actual daily average temperatures for those days. Any idea how to query this? I've been at it for a few hours and have had no luck.
Something I tried was the following:
weather_data <- ncdc(datasetid='GHCND', stationid=paste0('GHCND:',station_id),
startdate = start_date, enddate = end_date,limit=365)
uniq_d_types = unique(weather_data$data$datatype)
View(uniq_d_types)
This let me see the unique data types in the GHCND dataset but none of the data types seemed to be daily average temperatures. Any thoughts?
In order to obtain average daily actual temperatures from the NOAA data using the rnoaa package, one must use the hourly data and aggregate it by day. Hourly NOAA data is in the NORMAL_HLY data set, and the required data type is HLY-TEMP-NORMAL.
library('rnoaa')
library(lubridate)
options(noaakey = "obtain key from NOAA website")
start_date = "2010-01-15"
end_date = "2010-01-31"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_HLY', stationid=paste0('GHCND:',station_id),
datatypeid = "HLY-TEMP-NORMAL",
startdate = start_date, enddate = end_date,limit=500)
data <- weather_data$data
data$year <- year(data$date)
data$month <- month(data$date)
data$day <- day(data$date)
# summarize to average daily temps
aggregate(value ~ year + month + day,mean,data = data)
...and the output:
> aggregate(value ~ year + month + day,mean,data = data)
year month day value
1 2010 1 15 323.5417
2 2010 1 16 322.8750
3 2010 1 17 323.4167
4 2010 1 18 323.7500
5 2010 1 19 323.2083
6 2010 1 20 321.0833
7 2010 1 21 318.4167
8 2010 1 22 317.6667
9 2010 1 23 319.0000
10 2010 1 24 321.0833
11 2010 1 25 323.5417
12 2010 1 26 326.0833
13 2010 1 27 328.4167
14 2010 1 28 330.9583
15 2010 1 29 333.2917
16 2010 1 30 335.7917
17 2010 1 31 308.0000
>
Note that temperatures are stored in tenths of degrees in this data set, so for the period between January 15th and 31st 2010, the average daily temperatures at the Dulles International Airport weather station were between 30.8 degrees and 33.5 degrees.
Also note that to calculate the average by stationId and run across multiple weather stations, simply add station to the aggregate() function.
> # summarize to average daily temps by station
> aggregate(value ~ station + year + month + day,mean,data = data)
station year month day value
1 GHCND:USW00093738 2010 1 15 323.5417
2 GHCND:USW00093738 2010 1 16 322.8750
3 GHCND:USW00093738 2010 1 17 323.4167
4 GHCND:USW00093738 2010 1 18 323.7500
5 GHCND:USW00093738 2010 1 19 323.2083
6 GHCND:USW00093738 2010 1 20 321.0833
7 GHCND:USW00093738 2010 1 21 318.4167
8 GHCND:USW00093738 2010 1 22 317.6667
9 GHCND:USW00093738 2010 1 23 319.0000
10 GHCND:USW00093738 2010 1 24 321.0833
11 GHCND:USW00093738 2010 1 25 323.5417
12 GHCND:USW00093738 2010 1 26 326.0833
13 GHCND:USW00093738 2010 1 27 328.4167
14 GHCND:USW00093738 2010 1 28 330.9583
15 GHCND:USW00093738 2010 1 29 333.2917
16 GHCND:USW00093738 2010 1 30 335.7917
17 GHCND:USW00093738 2010 1 31 308.0000
>
The answer is to grab historical (meaning actual, on the day specified-- not long term average) weather data from the NOAA's ISD database. USAF and WBAN values can be found by looking through the isd-history.csv file found here:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa
Here's an example query.
out <- isd(usaf='724030', wban = '93738', year=2018)
This will grab a years worth of ~hourly weather data from ISD mapping. You can then parse/process this data however you see fit (e.g. for daily average temperatures like I did).

find number of customers added each month

customer_id transaction_id month year
1 3 7 2014
1 4 7 2014
2 5 7 2014
2 6 8 2014
1 7 8 2014
3 8 9 2015
1 9 9 2015
4 10 9 2015
5 11 9 2015
2 12 9 2015
I am well familiar with R basics. Any help will be appreciated.
the expected output should look like following:
month year number_unique_customers_added
7 2014 2
8 2014 0
9 2015 3
In the month 7 and year 2014, only customers_id 1 and 2 are present, so number of customers added is two. In the month 8 and year 2014, no new customer ids are added. So there should be zero customers added in this period. Finally in year 2015 and month 9, customer_ids 3,4 and 5 are the new ones added. So new number of customers added in this period is 3.
Using data.table:
require(data.table)
dt[, .SD[1,], by = customer_id][, uniqueN(customer_id), by = .(year, month)]
Explanation: We first remove all subsequent transactions of each customer (we're interested in the first one, when she is a "new customer"), and then count unique customers by each combination of year and month.
Using dplyr we can first create a column which indicates if a customer is duplicate or not and then we group_by month and year to count the new customers in each group.
library(dplyr)
df %>%
mutate(unique_customers = !duplicated(customer_id)) %>%
group_by(month, year) %>%
summarise(unique_customers = sum(unique_customers))
# month year unique_customers
# <int> <int> <int>
#1 7 2014 2
#2 8 2014 0
#3 9 2015 3

Boxplot not plotting all data

I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

Resources