How to reshape a dataframe with "reoccurring" columns? - r

I am new to data analysis with R. I recently got a pre-formatted environmental observation-model dataset, an example subset of which is shown below:
date site obs mod site obs mod
2000-09-01 00:00:00 campus NA 61.63 city centre 66 56.69
2000-09-01 01:00:00 campus 52 62.55 city centre NA 54.75
2000-09-01 02:00:00 campus 52 63.52 city centre 56 54.65
Basically, the data include the time series of hourly observed and modelled concentrations of a pollutant at various sites in "reoccurring columns", i.e., site - obs - mod (in the example I only showed 2 out of the total 75 sites). I read this "wide" dataset in as a data frame, and wanted to reshape it into the "narrower" format as:
date site obs mod
2000-09-01 00:00:00 campus NA 61.63
2000-09-01 01:00:00 campus 52 62.55
2000-09-01 02:00:00 campus 52 63.52
2000-09-01 00:00:00 city centre 66 56.69
2000-09-01 01:00:00 city centre NA 54.75
2000-09-01 02:00:00 city centre 56 54.65
I believed that I should use the package "reshape2" to do this. Firstly I tried to melt and then dcast the dataset:
test.melt <- melt(test.data, id.vars = "date", measure.vars = c("site", "obs", "mod"))
However, it only returned half of the data, i.e., records of the site(s) ("city centre") following the first one ("campus") were all cut off:
date variable value
2001-01-01 00:00:00 site campus
2001-01-01 01:00:00 site campus
2001-01-01 02:00:00 site campus
2001-01-01 00:00:00 obs NA
2001-01-01 01:00:00 obs 52
2001-01-01 02:00:00 obs 52
2001-01-01 00:00:00 mod 61.63
2001-01-01 01:00:00 mod 62.55
2001-01-01 02:00:00 mod 63.52
I then tried recast:
test.recast <- recast(test.data, date ~ site + obs + mod)
However, it returned with error message:
Error in eval(expr, envir, enclos) : object 'site' not found
I have tried to search for previous questions but have not found similar scenarios (correct me if I am wrong). Could someone please help me with this?
Many thanks in advance!

You might be better off using base R reshape after doing some variable name cleanup.
Here's your data.
test <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "date site obs mod site obs mod
'2000-09-01 00:00:00' campus NA 61.63 'city centre' 66 56.69
'2000-09-01 01:00:00' campus 52 62.55 'city centre' NA 54.75
'2000-09-01 02:00:00' campus 52 63.52 'city centre' 56 54.65")
test
# date site obs mod site.1 obs.1 mod.1
# 1 2000-09-01 00:00:00 campus NA 61.63 city centre 66 56.69
# 2 2000-09-01 01:00:00 campus 52 62.55 city centre NA 54.75
# 3 2000-09-01 02:00:00 campus 52 63.52 city centre 56 54.65
If you did this correctly, you should get names like I got: as #chase mentions in his answer, "recurring column names is a bit of an oddity and is not normal R behaviour"--so we've got to fix that.
Note: Both of these options generate a "time" variable which you can go ahead and drop. You might want to keep it just in case you wanted to reshape back into a wide format.
Option 1: If you got names like I did (which you should have), the solution is simple. For the first site, just append "0" to the site name and use base R reshape:
names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".")
test <- reshape(test, direction = "long",
idvar = "date", varying = 2:ncol(test))
rownames(test) <- NULL # reshape makes UGLY rownames
test
# date time site obs mod
# 1 2000-09-01 00:00:00 0 campus NA 61.63
# 2 2000-09-01 01:00:00 0 campus 52 62.55
# 3 2000-09-01 02:00:00 0 campus 52 63.52
# 4 2000-09-01 00:00:00 1 city centre 66 56.69
# 5 2000-09-01 01:00:00 1 city centre NA 54.75
# 6 2000-09-01 02:00:00 1 city centre 56 54.65
Option 2: If you really do have duplicated column names, the fix is still easy, and follows the same logic. First, create nicer column names (easy to do using rep()), and then use reshape() as described above.
names(test)[-1] <- paste(names(test)[-1],
rep(1:((ncol(test)-1)/3), each = 3), sep = ".")
test <- reshape(test, direction = "long",
idvar = "date", varying = 2:ncol(test))
rownames(test) <- NULL
### Or, more convenient:
# names(test) <- make.unique(names(test))
# names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".")
# test <- reshape(test, direction = "long",
# idvar = "date", varying = 2:ncol(test))
# rownames(test) <- NULL
Optional step: The data in this form are still not totally "long". If that is required, all that is required is one more step:
require(reshape2)
melt(test, id.vars = c("date", "site", "time"))
# date site time variable value
# 1 2000-09-01 00:00:00 campus 0 obs NA
# 2 2000-09-01 01:00:00 campus 0 obs 52.00
# 3 2000-09-01 02:00:00 campus 0 obs 52.00
# 4 2000-09-01 00:00:00 city centre 1 obs 66.00
# 5 2000-09-01 01:00:00 city centre 1 obs NA
# 6 2000-09-01 02:00:00 city centre 1 obs 56.00
# 7 2000-09-01 00:00:00 campus 0 mod 61.63
# 8 2000-09-01 01:00:00 campus 0 mod 62.55
# 9 2000-09-01 02:00:00 campus 0 mod 63.52
# 10 2000-09-01 00:00:00 city centre 1 mod 56.69
# 11 2000-09-01 01:00:00 city centre 1 mod 54.75
# 12 2000-09-01 02:00:00 city centre 1 mod 54.65
Update (to try to address some questions from the comments)
The reshape() documentation is quite confusing. It's best to work through a few examples to get an understanding of how it works. Specifically, "time" does not have to refer to time ("date" in your problem), but is more for, say, panel data, where records are collected at different times for the same ID. In your case, the only "id" in the original data is the "date" column. The other potential "id" is the site, but not in the way the data are organized.
Imagine, for a moment, if your data looked like this:
test1 <- structure(list(date = structure(1:3,
.Label = c("2000-09-01 00:00:00",
"2000-09-01 01:00:00", "2000-09-01 02:00:00"), class = "factor"),
obs.campus = c(NA, 52L, 52L), mod.campus = c(61.63, 62.55,
63.52), obs.cityCentre = c(66L, NA, 56L), mod.cityCentre = c(56.69,
54.75, 54.65)), .Names = c("date", "obs.campus", "mod.campus",
"obs.cityCentre", "mod.cityCentre"), class = "data.frame", row.names = c(NA,
-3L))
test1
# date obs.campus mod.campus obs.cityCentre mod.cityCentre
# 1 2000-09-01 00:00:00 NA 61.63 66 56.69
# 2 2000-09-01 01:00:00 52 62.55 NA 54.75
# 3 2000-09-01 02:00:00 52 63.52 56 54.65
Now try reshape(test1, direction = "long", idvar = "date", varying = 2:ncol(test1)). You'll see that reshape() sees the site names as "time" (that can be overridden by adding "timevar = "site"" to your reshape command).
When direction = "long", you must specify which columns vary with "time". In your case, that is all the columns except for the first, hence my use of 2:ncol(test) for "varying".
test2? Where is that?
Question under #Chase's answer: I think you misunderstand how melt() is supposed to work. Basically, it tries to get you the "skinniest" form of your data. In this case, the skinniest form would be the "optional step" described above since date + site would be the minimum required to comprise a unique ID variable. (I would say that "time" can safely be dropped.)
Once your data are in the format described in the "optional step" (we'll assume that the output has been stored as "test.melt", you can always easily pivot the table around in different ways. As a demonstration of what I mean by that, try the following and see what they do.
dcast(test.melt, date + site ~ variable)
dcast(test.melt, date ~ variable + site)
dcast(test.melt, variable + site ~ date)
dcast(test.melt, variable + date ~ site)
It is not easy to have that flexibility if you stop at "Option 1" or "Option 2".
Update (a few years later)
melt from "data.table" can now "melt" multiple columns in a similar way that reshape does. It should work whether or not the column names are duplicated.
You can try the following:
measure <- c("site", "obs", "mod")
melt(as.data.table(test), measure.vars = patterns(measure), value.name = measure)
# date variable site obs mod
# 1: 2000-09-01 00:00:00 1 campus NA 61.63
# 2: 2000-09-01 01:00:00 1 campus 52 62.55
# 3: 2000-09-01 02:00:00 1 campus 52 63.52
# 4: 2000-09-01 00:00:00 2 city centre 66 56.69
# 5: 2000-09-01 01:00:00 2 city centre NA 54.75
# 6: 2000-09-01 02:00:00 2 city centre 56 54.65

The fact that you have recurring column names is a bit of an oddity and is not normal R behaviour. Most of the time R forces you to have valid names via the make.names() function. Regardless, I'm able to duplicate your problem. Note I made my own example since yours isn't reproducible, but the logic is the same.
#Do not force unique names
s <- data.frame(id = 1:3, x = runif(3), x = runif(3), check.names = FALSE)
#-----
id x x
1 1 0.6845270 0.5218344
2 2 0.7662200 0.6179444
3 3 0.4110043 0.1104774
#Now try to melt, note that 1/2 of your x-values are missing!
melt(s, id.vars = 1)
#-----
id variable value
1 1 x 0.6845270
2 2 x 0.7662200
3 3 x 0.4110043
The solution is to make your column names unique. As I said before, R does this by default in most cases. However, you can do it after the fact via make.unique()
names(s) <- make.unique(names(s))
#-----
[1] "id" "x" "x.1"
Note that the second column of x now has a 1 appended to it. Now melt() works as you'd expect:
melt(s, id.vars = 1)
#-----
id variable value
1 1 x 0.6845270
2 2 x 0.7662200
3 3 x 0.4110043
4 1 x.1 0.5218344
5 2 x.1 0.6179444
6 3 x.1 0.1104774
At this point, if you want to treat x and x.1 as the same variable, I think a little gsub() or other regex function to get rid of the offending characters. THis is a workflow I use quite often.

Related

Thoughts on how to speed up replacing a column value when conditions between two objects (dataframe or datatable) are met in a loop?

I'm trying to get some input on how I might speed up a for loop that I've written. Essentially, I have a dataframe (DF1) where each row provides the latitude and longitude/point at a given time. The variables therefore are the lat and long for the point and the timestamp (date and time object). In addition, each row is a unique timestamp, for a unique point (in other word no repeats).
I'm trying to match it up to weather data which is contained in an netcdf file. The hourly weather data I have is a 3 dimensional file that includes the lats and longs for grids, time stamps, and the value for that weather variable. I'll call the weather variable u for now.
What I want to end up with: I have a column for the u values in DF1. It starts out with only missing values. In the end I want to replace those missing values in DF1 with the appropriate u value from the netcdf file.
How I've done this so far: I have constructed a for loop that can extract the appropriate u values for the nearest grid to each point in DF1. For example, for lat x and long y in DF1 I find the nearest grid in the netcdf file and extract the full timeseries. Then, within the for loop, I amend the DF1$u values with the appropriate data. DF1 is a rather large dataframe and the netcdf is even bigger (DF1 is just a small subset of the full netcdf).
Some psuedo data:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/5/2021 04:00:00"
, '2/1/2021 06:00:00'
, "1/7/2021 01:00:00"
, "2/2/2021 01:00:00"
, "2/5/2021 02:00:00")
lat <- c(34,36,41,50,20,40)
long <- c(55,50,‑89,-175,-155,25)
DF1 <- data.frame(ID, datetime, lat, long)
DF1$u <- NA
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 NA
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
Here is an example of the type of for loop I've constructed, I've left out some of the more specific details that aren't relevant:
for(i in 1:nrows(DF1) {
### a number of steps here that identify the closest grid to each point. ####
mat_index <- as.vector(arrayInd(which.min(dist_mat), dim(dist_mat)))
# Extract the time series for the lat and long that are closest. u is the weather variable, time is the datetime variable, and geometry is the corresponding lat long list item.
df_u <- data.frame(u=data_u[mat_index[2],mat_index[1],],time=timestamp,geometry=rep(psj[i,1],dim(data_u)[3]))
# To make things easier I seperate geometry into a lat and a long col
df_u <- tidyr::separate(df_u, geometry, into = c("long", "lat"), sep = ",")
df_u$long <- gsub("c\\(", "", df_u$long)
df_u$lat <- gsub("\\)", "", df_u$lat)
# I find that datatables are a bit easier for these types of tasks, so I set the full timeseries data and DF1 to data table (in reality I set df1 as a data table outside of the for loop)
df_u <- setDT(df_u)
# Then I use merge to join the two datatables, replace the missing value with the appropriate value from df_u, and then drop all the unnecessary columns in the final step.
df_u_new <- merge(windu_df, df_u, by =c("lat","long","time"), all.x = T)
df_u_new[, u := ifelse(is.na(u.x), u.y, u.x)]
windu_df <- df_u_new[, .(time, lat, long, u)]
}
This works, but given the sheer size of the dataframe/datatables that I'm working with, I wonder if there is a faster way to do that last step in particular. I know merge is probably the slowest way to do this, but I kept running into issues using match() and inner_join.
Unfortunately I'm not able to really give fully reproduceable data here given that I'm working with a netcdf file, but df_u looks something like this for the first iteration:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/1/2021 02:00:00"
, "1/1/2021 03:00:00"
, "1/1/2021 04:00:00"
, "1/1/2021 05:00:00"
, "1/1/2021 06:00:00")
lat <- c(34,34,34,34,34,34)
long <- c(55,55,55,55,55,55)
u <- c(2.8,3.6,2.1,5.6,4.4,2,5)
df_u <- data.frame(ID, datetime, lat, long,u)
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/1/2021 02:00:00 34 55 3.6
3 3 1/1/2021 03:00:00 34 55 2.1
4 4 1/1/2021 04:00:00 34 55 5.6
5 5 1/1/2021 05:00:00 34 55 4.4
6 6 1/1/2021 06:00:00 34 55 2.5
Once u is amended in DF1 it should read:
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
In the next iteration, the for loop will extract the relevant weather data for the lat and long in row 2 and then retain 2.8 and replace the NA in row 2 with a value.
EDIT: The NETCDF data covers an entire year (so every hour for a year) for a decently large spatial area. It's ERA5 data. In addition, DF1 had thousands of unique lat/long and timestamp observations.

Simple manipulation on large data frame in r

I have a relatively large data frame. It contains roughly 40 million rows and 12 columns, please see part of it below. Specifically, it is a 3-hr averaged ozone data for counties in the US. Each row represents a certain county and a certain day (from 19800101 to 20161231 for 3108 counties). Note that this data file has a size of 7.05 GB.
index date state.fips county.fp X07.30 X10.30 X13.30 X16.30 X19.30 X21.30 X01.30 X04.30
1 01001 1980-01-01 01 001 29.98488 29.47778 29.12294 29.98976 31.69830 31.56405 30.48744 29.62118
2 01001 1980-01-02 01 001 29.03014 28.75464 28.58736 30.26555 32.39263 32.43746 31.70940 31.14960
3 01001 1980-01-03 01 001 30.69475 30.19832 29.68841 30.28920 31.61882 31.43047 31.01369 30.58366
4 01001 1980-01-04 01 001 30.20852 29.69874 29.47550 30.55639 32.62610 34.47959 35.54881 35.78104
5 01001 1980-01-05 01 001 35.80190 35.69129 35.89026 38.51287 39.82833 39.49016 38.73464 38.09185
6 01001 1980-01-06 01 001 37.32787 36.55899 35.96070 36.62670 37.03226 36.71239 35.86387 35.05945
The question is times in the columns below are in UTC, and I need to convert to US local time. There are five time zones for the US, namely Eastern time zone, Central time ozone, Mountain time zone, and Pacific time zone. Yes I only covered the contiguous US. How should I start this manipulation?
Also please pay attention that the original data file is large (7.05 GB). We may encounter no enough memory errors. I am working on a laptop with 16 GB RAM.
Below I post my code for doing this. However I don't how to add the dplyr:case_when to adjust time zones.
names(ozone) <- gsub("^X","", names(ozone)) # get rid of X in columns names
ozone <- pivot_longer(ozone, cols = c('01.30','04.30','07.30',
'10.30','13.30','16.30','19.30','21.30'),
names_to = 'time', values_to = 'ozone_val')
ozone$date <- ymd(ozone$date) # convert to date format
ozone$date = as.POSIXct(paste(ozone$date, ozone$time),
format = "%Y-%m-%d %H.%M",
tz = 'UTC')
ozone$date <- with_tz(ozone$date, "America/New_York") # how to apply case_when here
ozone$time <- substr(ozone$date, 12, 19)
ozone$year.day <- substr(ozone$date, 1, 10)
ozone <- subset(ozone, select = -date)
ozone_1 <- pivot_wider(ozone, id_cols = c('index','state.fips','county.fp','year.day'),
names_from = 'time', values_from = 'ozone_val')
Any ideas?
This should get you started but you'll need to post a more complete reproducible example and/or some more info about what exactly you are looking for. But, you should be able to use this general framework if you do not run out of memory (e.g., you may be able to use something like dplyr::case_when() to create the timezone based on the state; or subsetting after making the column POSIXct). Hope this gets you started!
Also, I am happy to explain anything that is unclear!
library(data.table)
setDT(data)
names(data) <- gsub("^X", "", names(data))
dt <- melt(data, id.vars = c("index", "date", "state.fips", "county.fp"),
variable.name = "time", value.name = "ozone_val")
dt[, date := as.POSIXct(paste(as.character(date), time),
format = "%Y-%m-%d %H.%M",
tz = "America/New_York")]
print(dt, nrows = 10)
index date state.fips county.fp time ozone_val
1: 1001 1980-01-01 07:30:00 1 1 07.30 29.98488
2: 1001 1980-01-02 07:30:00 1 1 07.30 29.03014
3: 1001 1980-01-03 07:30:00 1 1 07.30 30.69475
4: 1001 1980-01-04 07:30:00 1 1 07.30 30.20852
5: 1001 1980-01-05 07:30:00 1 1 07.30 35.80190
---
44: 1001 1980-01-02 04:30:00 1 1 04.30 31.14960
45: 1001 1980-01-03 04:30:00 1 1 04.30 30.58366
46: 1001 1980-01-04 04:30:00 1 1 04.30 35.78104
47: 1001 1980-01-05 04:30:00 1 1 04.30 38.09185
48: 1001 1980-01-06 04:30:00 1 1 04.30 35.05945
Data:
data <- read.table(header = T, text = "index date state.fips county.fp X07.30 X10.30 X13.30 X16.30 X19.30 X21.30 X01.30 X04.30
1 01001 1980-01-01 01 001 29.98488 29.47778 29.12294 29.98976 31.69830 31.56405 30.48744 29.62118
2 01001 1980-01-02 01 001 29.03014 28.75464 28.58736 30.26555 32.39263 32.43746 31.70940 31.14960
3 01001 1980-01-03 01 001 30.69475 30.19832 29.68841 30.28920 31.61882 31.43047 31.01369 30.58366
4 01001 1980-01-04 01 001 30.20852 29.69874 29.47550 30.55639 32.62610 34.47959 35.54881 35.78104
5 01001 1980-01-05 01 001 35.80190 35.69129 35.89026 38.51287 39.82833 39.49016 38.73464 38.09185
6 01001 1980-01-06 01 001 37.32787 36.55899 35.96070 36.62670 37.03226 36.71239 35.86387 35.05945")

Create Time series observations,timestamps and filling up the values

I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21

Finding each time of daily max variable in climate data

I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15

Weighted Moving Average based on Irregular Date Intervals

I am new to time series and was hoping someone could provide some input/ideas here.
I am trying to find ways to impute missing values.
I was hoping to find the moving average, but most of the packages (smooth, mgcv, etc.) don't seem to take time intervals into consideration.
For example, the dataset might look like something below and I would want value at 2016-01-10 to have the greatest influence in calculating the missing value:
Date Value Diff_Days
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-01-30 50 16
I have instances where NA might be the first observation or the last observation. Sometimes NA values also occur multiple times, at which point the rolling window would need to expand, and this is why I would like to use the moving average.
Is there a package that would take date intervals / separate weights into consideration?
Or please suggest if there is a better way to impute NA values in such cases.
You can use glm or any different model.
Input
con <- textConnection("Date Value Diff_Days
2015-12-14 NA 0
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-02-14 NA 0
2016-02-18 NA 0
2016-02-29 50 16")
df <- read.table(con, header = T)
df$Date <- as.Date(df$Date)
df$Date.numeric <- as.numeric(df$Date)
fit <- glm(Value ~ Date.numeric, data = df)
df.na <- df[is.na(df$Value),]
predicted <- predict(fit, df.na)
df$Value[is.na(df$Value)] <- predicted
plot(df$Date, df$Value)
points(df.na$Date, predicted, type = "p", col="red")
df$Date.numeric <- NULL
rm(df.na)
print(df)
Output
Date Value Diff_Days
1 2015-12-14 -3.054184 0
2 2016-01-01 10.000000 13
3 2016-01-10 14.000000 4
4 2016-01-14 18.518983 0
5 2016-01-28 30.000000 14
6 2016-02-14 40.092149 0
7 2016-02-18 42.875783 0
8 2016-02-29 50.000000 16

Resources