Calculate mean of weather station data in r - r

I am working with a dataset of about 800 weather stations, with monthly air temperature values for each station from 1986 to 2014. The data are split into three columns: (1) station name, (2) date (year and month), and (3) Temp. In general, the data look something like this:
STATION DATE TEMP
Station 1 198601 -15
Station 1 198602 -16
Station 1 201401 -10
Station 1 201402 -14
Station 2 198601 -11
Station 2 198602 -9
Station 2 201401 -5
Station 2 201402 -4
I need to extract the average temperature at each weather station for a given month in various year ranges. For example, if I needed to know the average July temperature at each weather station from 1986-1990. My ideal output, then would be a new list or dataframe giving the average temperature for each station, based on my specified date range.
I am sure this can be accomplished using a for loop, but I am not very proficient at creating such code. Any suggestions would be greatly appreciated.

Using dplyr instead of data table
weather <- data.frame(station = c("Station 1", "Station 1", "Station 1", "Station 1",
"Station 2", "Station 2", "Station 2", "Station 2"),
date = c(198601, 198602, 201401, 201402, 198601, 198602, 201401, 201402),
temp = c(-15, -16, -10, -14, -11, -9, -5, -4))
library(dplyr)
library(stringr)
# get month and year columns in data
weather <- mutate(weather,
year = str_extract(date, "\\d{4}"),
month = str_extract(date, "\\d{2}$"))
# get the mean for each station for each month
mean_station <- group_by(weather, station, month) %>%
summarise(mean_temp = mean(temp, na.rm = T))
If you need to only do this on a certain range of dates you can add a filter on the year
mean_station <- group_by(weather, station, month) %>%
filter(year >= 1986, year <= 2015) %>%
summarise(mean_temp = mean(temp, na.rm = T))

Something like this...?
> df$month <- substr(df$DATE, 5, 6)
> result <- aggregate(TEMP~STATION+month, mean, data=df)
> data.frame(Year=unique(substr(df$DATE, 1, 4)), result)
Year STATION month TEMP
1 1986 Station1 01 -12.5
2 2014 Station2 01 -8.0
3 1986 Station1 02 -15.0
4 2014 Station2 02 -6.5

Or maybe
library(data.table)
setDT(df)[, list(MeanTemp = mean(TEMP)),
by = list(STATION, Mon = substr(DATE, 5, 6))]
# STATION Mon MeanTemp
# 1: Station 1 01 -12.5
# 2: Station 1 02 -15.0
# 3: Station 2 01 -8.0
# 4: Station 2 02 -6.5

I am also learning R and may not answer your question directly but I wanted to mention that seas Package is helpful for analysis of such type of data
for example
require(seas)
pdf( paste("test",".pdf", sep="") )
for (i in 1: length(STATION)){
d1 <-mksub(mdata,id=STATION[i]) # making a subset for each station based on name/unique id
dat.ss <- seas.sum(d1)
plot(dat.ss)
}
graphics off ()
You have to ensure that str() of your dataset is the in format that seas needs.
with such large dataset I am advise loops and functions helps with data analysis a bit quickly. If got another way of looping that works grateful if you can share

Related

How to organise a date in 3 different columns in r?

I have a lot of climatic data organise by dates like this.
df = data.frame(date = c("2011-03-24", "2011-02-03", "2011-01-02"), Precipitation = c(20, 22, 23))
And I want to organise it like this one
df = data.frame(year = c("2011", "2011","2011"), month = c("03","02","01"), day = c("24", "03", "02"), pp = c(20, 22, 23))
I have a lot of information and I can not do it manually.
Can anybody help me? Thanks a lot.
Using strsplit you can do like this way:
Logic: strsplit will split the date with dashes to create list of 3 elements each having 3 parts of year, month and day. We bind these elements using rbind but to do it iteratively. We use do.call, So do.call will row bind these list elements into 3 rows. Since the outcome is a matrix, we convert it into a dataframe and then using setNames we give new names to the columns. The last cbind will bind these 3x3 dataframe with original precipitation.
cbind(setNames(data.frame(do.call('rbind', strsplit(df$date, '-'))), c('Year', 'month', 'day')), 'Precipitation' = df$Precipitation)
Output:
Year month day Precipitation
1 2011 03 24 20
2 2011 02 03 22
3 2011 01 02 23
This returns integer values for year, month, and day. If you really need them as characters padded with 0 you can use formatC(x, width = 2, flag = "0") on the result.
library(clock)
library(dplyr)
df <- data.frame(
date = c("2011-03-24", "2011-02-03", "2011-01-02"),
pp = c(20, 22, 23)
)
df %>%
mutate(
date = as.Date(date),
year = get_year(date),
month = get_month(date),
day = get_day(date)
)
#> date pp year month day
#> 1 2011-03-24 20 2011 3 24
#> 2 2011-02-03 22 2011 2 3
#> 3 2011-01-02 23 2011 1 2

extract specific digits from column of numbers in R

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!
You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

How to wrangle the data using Lubridate package and Regex instead of using the separate function?

https://www.kaggle.com/shivamb/netflix-shows-and-movies-exploratory-analysis/data ---- contains the data set.
This is an exploratory data analysis performed on the shows from the Netflix data set. There are two main objectives in the data wrangling process. First is to get only the year part separately from the date_added column. Second is to create a new column which contains the number of seasons for a particular show from the duration column. I have relied on the separate function from the dplyr package to achieve the above two objectives.
The code is as follows:-
# Neitlix EDA ----
# https://www.kaggle.com/shivamb/netflix-shows-and-movies-exploratory-analysis
library(tidyverse)
library(lubridate)
net_flix <- read.csv("netflix_titles_nov_2019.csv")
net_flix_wrangled_tbl <- net_flix %>%
separate(date_added,
into = c("date","month","year"),
sep = "-",
remove = FALSE)%>%
separate(duration,
into = c("count","show_type"),
sep = " ",
remove = FALSE)%>%
glimpse()
Those who do not wish to download the data can use the following code of the data frame contained below:
sf <- data.frame(date_added = c("30-11-19", "29-11-19", "", "12-07-19", "", "16-09-19"),
duration = c("1 Season", "67 min", "135 min", "2 Seasons", "107 min", "3 Seasons"))
The output is working with the separate() function for getting both the date and filtering the number of Seasons from the duration column.
But can this be done in a better and a robust way by using the lubridate package to get the year and ifelse() and filter() or Regex function to get only number of seasons and not the minutes of movies ?
Here is one alternative :
library(dplyr)
library(lubridate)
sf %>%
mutate(date_added = dmy(date_added),
date = day(date_added), month = month(date_added),
year = year(date_added),
count = readr::parse_number(as.character(duration)),
show_type = stringr::str_remove(duration, as.character(count)))
# date_added duration date month year count show_type
#1 2019-11-30 1 Season 30 11 2019 1 Season
#2 2019-11-29 67 min 29 11 2019 67 min
#3 <NA> 135 min NA NA NA 135 min
#4 2019-07-12 2 Seasons 12 7 2019 2 Seasons
#5 <NA> 107 min NA NA NA 107 min
#6 2019-09-16 3 Seasons 16 9 2019 3 Seasons

Sum daily values into monthly values

I am trying to sum daily rainfall values into monthly totals for a record over 100 years in length. My data takes the form:
Year Month Day Rain
1890 1 1 0
1890 1 2 3.1
1890 1 3 2.5
1890 1 4 15.2
In the example above I want R to sum all the days of rainfall in January 1890, then February 1890, March 1890.... through to December 2010. I guess what I'm trying to do is create a loop to sum values. My output file should look like:
Year Month Rain
1890 1 80.5
1890 2 72.4
1890 3 66.8
1890 4 77.2
Any easy way to do this?
Many thanks.
You can use dplyr for some pleasing syntax
library(dplyr)
df %>%
group_by(Year, Month) %>%
summarise(Rain = sum(Rain))
In some cases it can be beneficial to convert it to a time-series class like xts, then you can use functions like apply.monthly().
Data:
df <- data.frame(
Year = rep(1890,5),
Month = c(1,1,1,2,2),
Day = 1:5,
rain = rexp(5)
)
> head(df)
Year Month Day rain
1 1890 1 1 0.1528641
2 1890 1 2 0.1603080
3 1890 1 3 0.5363315
4 1890 2 4 0.6368029
5 1890 2 5 0.5632891
Convert it to xts and use apply.monthly():
library(xts)
dates <- with(df, as.Date(paste(Year, Month, Day), format("%Y %m %d")))
myXts <- xts(df$rain, dates)
> head(apply.monthly(myXts, sum))
[,1]
1890-01-03 0.8495036
1890-02-05 1.2000919

R - group_by utilizing splinefun

I am trying to group my data by Year and CountyID then use splinefun (cubic spline interpolation) on the subset data. I am open to ideas, however the splinefun is a must and cannot be changed.
Here is the code I am trying to use:
age <- seq(from = 0, by = 5, length.out = 18)
TOT_POP <- df %.%
group_by(unique(df$Year), unique(df$CountyID) %.%
splinefun(age, c(0, cumsum(df$TOT_POP)), method = "hyman")
Here is a sample of my data Year = 2010 : 2013, Agegrp = 1 : 17 and CountyIDs are equal to all counties in the US.
CountyID Year Agegrp TOT_POP
1001 2010 1 3586
1001 2010 2 3952
1001 2010 3 4282
1001 2010 4 4136
1001 2010 5 3154
What I am doing is taking the Agegrp 1 : 17 and splitting the grouping into individual years 0 - 84. Right now each group is a representation of 5 years. The splinefun allows me to do this while providing a level of mathematical rigour to the process i.e., splinefun allows me provide a population total per each year of age, in each individual county in the US.
Lastly, the splinefun code by itself does work but within the group_by function it does not, it produces:
Error: wrong result size(4), expected 68 or 1.
The splinefun code the way I am using it works like this
TOT_POP <- splinefun(age, c(0, cumsum(df$TOT_POP)),
method = "hyman")
TOT_POP = pmax(0, diff(TOT_POP(c(0:85))))
Which was tested on one CountyID during one Year. I need to iterate this process over "x" amount of years and roughly 3200 counties.
# Reproducible data set
set.seed(22)
df = data.frame( CountyID = rep(1001:1005,each = 100),
Year = rep(2001:2010, each = 10),
Agegrp = sample(1:17, 500, replace=TRUE),
TOT_POP = rnorm(500, 10000, 2000))
# Convert Agegrp to age
df$Agegrp = df$Agegrp*5
colnames(df)[3] = "age"
# Make a spline function for every CountyID-Year combination
split.dfs = split(df, interaction(df$CountyID, df$Year))
spline.funs = lapply(split.dfs, function(x) splinefun(x[,"age"], x[,"TOT_POP"]))
# Use the spline functions to interpolate populations for all years between 0 and 85
new.split.dfs = list()
for( i in 1:length(split.dfs)) {
new.split.dfs[[i]] = data.frame( CountyID=split.dfs[[i]]$CountyID[1],
Year=split.dfs[[i]]$Year[1],
age=0:85,
TOT_POP=spline.funs[[i]](0:85))
}
# Does this do what you want? If so, then it will be
# easier for others to work from here
# > head(new.split.dfs[[1]])
# CountyID Year age TOT_POP
# 1 1001 2001 0 909033.4
# 2 1001 2001 1 833999.8
# 3 1001 2001 2 763181.8
# 4 1001 2001 3 696460.2
# 5 1001 2001 4 633716.0
# 6 1001 2001 5 574829.9
# > tail(new.split.dfs[[2]])
# CountyID Year age TOT_POP
# 81 1002 2001 80 10201.693
# 82 1002 2001 81 9529.030
# 83 1002 2001 82 8768.306
# 84 1002 2001 83 7916.070
# 85 1002 2001 84 6968.874
# 86 1002 2001 85 5923.268
First, I believe I was using the wrong wording in what I was trying to achieve, my apologies; group_by actually wasn't going to solve the issue. However, I was able to solve the problem using two functions and ddply. Here is the code that solved the issue:
interpolate <- function(x, ageVector){
result <- splinefun(ageVector,
c(0, cumsum(x)), method = "hyman")
diff(result(c(0:85)))
}
mainFunc <- function(df){
age <- seq(from = 0, by = 5, length.out = 18)
colNames <- setdiff(colnames(df)
c("Year","CountyID","AgeGrp"))
colWiseSpline <- colwise(interpolate, .cols = true,
age)(df[ , colNames])
cbind(data.frame(
Year = df$Year[1],
County = df$CountyID[1],
Agegrp = 0:84
),
colWiseSpline
)
}
CompleteMainRaw <- ddply(.data = df,
.variables = .(CountyID, Year),
.fun = mainFunc)
The code now takes each county by year and runs the splinefun on that subset of population data. At the same time it creates a data.frame with the results i.e., splits the data from 17 age groups to 85 age groups while factoring it our appropriately; which is what splinefun does.
Thanks!

Resources