How to convert 3 hourly data into hourly data? [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a set of data for several stations that have 3 hourly and 1 hourly data frames. I have been able to seize through the data and separate the data into sets with 1 hr and 3 hrs. But I want to convert the datasets in 3hrs into hourly data. I do not need to estimate the missing data in between the hours, I can fill those as missing data, but I need to have a uniform data structure and all the other data in the database that I am using are already in hourly except those few stations.
I have included some data that shows the current dataset. hourly dataset. 3hourly dataset. expected dataset

Here is my best guess of what you want solved with R and the tidyverse
I have read in your data. After row binding we expand the data to include the missing time points and join to original data for desired result.
library(tidyverse)
#read in the data
df1 = readxl::read_excel("df1.xlsx")
df2 = readxl::read_excel("df2.xlsx")
#fix names of one dataframe
names(df1) <- names(df2)
#create proper timestamps
df = bind_rows(df1,df2) %>%
mutate(ts = lubridate::ymd_hm(paste0(year, "-", month, "-", day, " ", hour,":00")))
#expand timestamps and station
expanded_ts <-
df %>%
tidyr::expand(ts, station)
#join for desired result
left_join(expanded_ts, df, by=c("ts", "station"))
## A tibble: 96 x 8
# ts station year month day hour T2 DP
# <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2014-08-01 00:00:00 450070 NA NA NA NA NA NA
# 2 2014-08-01 00:00:00 450110 2014 8 1 0 295 259
# 3 2014-08-01 00:00:00 450320 2014 8 1 0 295 259
# 4 2014-08-01 00:00:00 450390 2014 8 1 0 304 236
# 5 2014-08-01 01:00:00 450070 2014 8 1 1 320 250
# 6 2014-08-01 01:00:00 450110 2014 8 1 1 310 250
# 7 2014-08-01 01:00:00 450320 NA NA NA NA NA NA
# 8 2014-08-01 01:00:00 450390 NA NA NA NA NA NA
# 9 2014-08-01 02:00:00 450070 2014 8 1 2 330 250
#10 2014-08-01 02:00:00 450110 2014 8 1 2 320 250

Related

finding all flights that have at least three years of data in R

I am using the flight dataset that is freely available in R.
flights <- read_csv("http://ucl.ac.uk/~uctqiax/data/flights.csv")
Now, lets say i want to find all flight that have been flying for at least three consecutive years: so there are dates available for three years in the date column. Basically i am only interested in the year part of the data.
i was thinking of the following approach: create a unique list of all plane names and then for each plane get all the dates and see if there are three consecutive years.
I started as follows:
NOyears = 3
planes <- unique(flights$plane)
# at least 3 consecutive years
for (plane in planes){
plane = "N576AA"
allyears <- which(flights$plane == plane)
}
but i am stuck here. This whole approach start looking too complicated to me. Is there an easier/faster way? Considering that i am working on a very large dataset...
Note: I want to be able to specify the number of year later on, that is why i included NOyears = 3 in the first place.
EDIT:
I have just noticed this question on SO. Very interesting use of diff and cumsum which are both new to me. Maybe a similiar approach is possible here using data.table?
dplyr will do the trick here
library(dplyr)
library(lubridate)
flights %>%
mutate(year = year(date)) %>%
group_by(plane) %>%
summarise(range = max(year) - min(year)) %>%
filter(range >= 2)
Though I'm not seeing any planes that meet criteria!
Edit: Per mnist's comment, consecutive years are a little more tricky, but here's a working example with consecutive months (the data you supplied only has one year) - just swap out for years!
nMonths = 6
flights %>%
mutate(month = month(date)) %>% #Calculate month
count(plane, month) %>% #Summarize to one row for each plane/month combo
arrange(plane, month) %>% #Arrange by plane, month so we can look at consecutive months
group_by(plane) %>% #Within each plane...
mutate(consecutiveMonths = c(0, sequence(rle(diff(month))$lengths))) %>% #...calculate the number of consecutive months each row represents
group_by(plane) %>% #Then, for each plane...
summarise(maxConsecutiveMonths = max(consecutiveMonths)) %>% #...return the maximum number of consecutive months
filter(maxConsecutiveMonths > nMonths) #And keep only those planes that meet criteria!
Here is another option using data.table:
#summarize into a smaller dataset; assuming that we are not counting days to check for consecutive years
yearly <- flights[, .(year=unique(year(date))), .(carrier, flight)]
#add a dummy flight to demonstrate consecutive years
yearly <- rbindlist(list(yearly, data.table(carrier="ZZ", flight="111", year=2011:2014)))
setkey(yearly, carrier, flight, year)
yearly[, c("rl", "rw") := {
iscons <- cumsum(c(0L, diff(year)!=1L))
.(iscons, rowid(carrier, flight, iscons))
}]
yearly[rl %in% yearly[rw>=3L]$rl]
output:
carrier flight year rl rw
1: ZZ 111 2011 5117 1
2: ZZ 111 2012 5117 2
3: ZZ 111 2013 5117 3
4: ZZ 111 2014 5117 4
Here is a data.table approach (using month, since there is only one year in that file, filtering flights that operated consecutively during 12 months):
library(data.table)
flights <- fread("http://ucl.ac.uk/~uctqiax/data/flights.csv")
flights[, month:=month(date)]
setkey(flights, plane, date)
flights[, max_run:=lapply(.SD, function(x) max(rle(cumsum(c(0, diff(unique(x))) > 1))$lengths)),
.SDcols="month", by="plane"][max_run > 11][]
#> date hour minute dep arr dep_delay arr_delay carrier
#> 1: 2011-01-01 12:00:00 NA NA NA NA NA NA XE
#> 2: 2011-01-01 12:00:00 NA NA NA NA NA NA XE
#> 3: 2011-01-01 12:00:00 NA NA NA NA NA NA XE
#> 4: 2011-01-02 12:00:00 NA NA NA NA NA NA XE
#> 5: 2011-01-02 12:00:00 NA NA NA NA NA NA XE
#> ---
#> 151636: 2011-11-21 12:00:00 10 56 1056 1359 25 37 FL
#> 151637: 2011-12-09 12:00:00 18 36 1836 2126 -5 -4 FL
#> 151638: 2011-12-13 12:00:00 17 27 1727 2013 -3 -7 FL
#> 151639: 2011-12-14 12:00:00 6 28 628 914 -2 -8 FL
#> 151640: 2011-12-14 12:00:00 11 57 1157 1438 -3 -14 FL
#> flight dest plane cancelled time dist month max_run
#> 1: 2174 PNS 1 NA 489 1 12
#> 2: 2277 BRO 1 NA 308 1 12
#> 3: 2811 MOB 1 NA 427 1 12
#> 4: 2204 OKC 1 NA 395 1 12
#> 5: 2570 BTR 1 NA 253 1 12
#> ---
#> 151636: 298 ATL N983AT 0 98 696 11 12
#> 151637: 296 ATL N983AT 0 89 696 12 12
#> 151638: 292 ATL N983AT 0 87 696 12 12
#> 151639: 290 ATL N983AT 0 86 696 12 12
#> 151640: 286 ATL N983AT 0 87 696 12 12
Created on 2020-05-14 by the reprex package (v0.3.0)

R Max of Same Date, Previous Date, and Previous Hour Value

A couple basic data manipulations. I searched with different wordings and couldn't find much.
I have data structured as below. In reality the hourly data is continuous, but I just included 4 lines as an example.
start <- as.POSIXlt(c('2017-1-1 1:00','2017-1-1 2:00','2017-1-2 1:00','2017-1-2 2:00'))
values <- as.numeric(c(2,5,4,3))
df <- data.frame(start,values)
df
start values
1 2017-01-01 01:00:00 2
2 2017-01-01 02:00:00 5
3 2017-01-02 01:00:00 4
4 2017-01-02 02:00:00 3
I would like to add a couple columns that:
1) Show the max of the same day.
2) Show the max of the previous day.
3) Show the value of one previous hour.
The goal is to have an output like:
MaxValueDay <- as.numeric(c(5,5,4,4))
MaxValueYesterday <- as.numeric(c(NA,NA,5,5))
PreviousHourValue <- as.numeric(c(NA,2,NA,4))
df2 <- data.frame(start,values,MaxValueDay,MaxValueYesterday,PreviousHourValue)
df2
start values MaxValueDay MaxValueYesterday PreviousHourValue
1 2017-01-01 01:00:00 2 5 NA NA
2 2017-01-01 02:00:00 5 5 NA 2
3 2017-01-02 01:00:00 4 4 5 NA
4 2017-01-02 02:00:00 3 4 5 4
Any help would be greatly appreciated. Thanks
A solution using dplyr, magrittr, and lubridate packages:
library(dplyr)
library(magrittr)
library(lubridate)
df %>%
within(MaxValueDay <- sapply(as.Date(start), function (x) max(df$values[which(x==as.Date(start))]))) %>%
within(MaxValueYesterday <- MaxValueDay[sapply(as.Date(start)-1, match, as.Date(start))]) %>%
within(PreviousHourValue <- values[sapply(start-hours(1), match, start)])
# start values MaxValueDay MaxValueYesterday PreviousHourValue
# 1 2017-01-01 01:00:00 2 5 NA NA
# 2 2017-01-01 02:00:00 5 5 NA 2
# 3 2017-01-02 01:00:00 4 4 5 NA
# 4 2017-01-02 02:00:00 3 4 5 4

How to add columns for months in a dataframe at specific locations

I have a dataframe that looks like this:
CONTRACT_ID START_DATE SERVICE VALUE year month
1 01-01-2018 A 10 2018 1
2 01-01-2018 B 20 2018 1
3 01-01-2018 C 30 2018 1
4 01-03-2018 B 40 2018 3
5 01-03-2018 C 50 2018 3
6 01-03-2018 A 60 2018 3
And I have converted it to a form like this:
CONTRACT_ID year SERVICE 1 3
1 2018 A 10 NA
2 2018 B 20 NA
3 2018 C 30 NA
4 2018 B NA 40
5 2018 C NA 50
6 2018 A NA 60
Using reshape function like this:
reshape(df, idvar = c("year","CONTRACT_ID","SERVICE"), timevar = "month", direction = "wide")
The problem is that in my current dataframe I don't have data for some of the months like we see here for 2(feb). But i would like to add columns for all the missing months like:
CONTRACT_ID year SERVICE 1 2 3
1 2018 A 10 NA NA
2 2018 B 20 NA NA
3 2018 C 30 NA NA
4 2018 B NA NA 40
5 2018 C NA NA 50
6 2018 A NA NA 60
How do I achieve that. I know that I can add columns in between and in the end, but it doesn't seems efficient. I am creating a script and I want it to be efficient and less time consuming.
EDIT:
As per the suggestion in the comment below, I used spread function for widening the data.
But if I keep drop = False the code gives all the combination as output which significantly increases the table size. If I make it TRUE, it doesn't create the combinations but it also removes the Month columns for which I don't have the data, in the current data. I want to keep the columns but not the combinations of CONTRACT_ID, DATE, SERVICE which don't exist. Initially I was removing those rows in subsequent steps but now the size of the table has increased substantially large and I need to handle it while doing the spread of data.
Any suggestions.
Try this.
library(tidyr)
long_data <- read.table(header=TRUE, text='
CONTRACT_ID START_DATE SERVICE VALUE year month
1 01-01-2018 A 10 2018 1
2 01-01-2018 B 20 2018 1
3 01-01-2018 C 30 2018 1
4 01-03-2018 B 40 2018 3
5 01-03-2018 C 50 2018 3
6 01-03-2018 A 60 2018 3
')
long_data
long_data$month <- factor(long_data$month, levels = 1:12, ordered = TRUE)
spread(long_data, key = month, value = VALUE, fill = NA, drop = FALSE)

Looping over unique values [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a data frame in long format, with one observation row per measurement. I want to loop through each unique ID and find the "minimum" date for each unique individual. For example, patient 1 may be measured at three different times, but I want the earliest time. I thought about sorting the dataset by the date (in increasing order) and removing all duplicates, but I'm not sure if this is the best way to go. Any help or suggestions would be greatly appreciated. Thank you!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', order the 'Date' (assuming that it is in Date class or else change to Date class with as.Date with correct format), and get the first observation with head
library(data.table)
setDT(df1)[order(Date), head(.SD, 1), by = ID]
Here is another way using basic R:
earliestDates = aggregate(list(date = df$date), list(ID = df$ID), min)
result = merge(earliestDates,df)
earliestDates is a two column data frame that has the minimum date by ID. The merge will join the values in the other columns.
Example:
set.seed(1)
ID = floor(runif(20,1,5))
day = as.Date(floor(runif(20,1,25)),origin = "2017-1-1")
weight = floor(runif(20,80,95))
df = data.frame(ID = ID, date = day, weight = weight)
> df
ID date weight
1 2 2017-01-24 92
2 2 2017-01-07 89
3 3 2017-01-17 91
4 4 2017-01-05 88
5 1 2017-01-08 87
6 4 2017-01-11 91
7 4 2017-01-02 80
8 3 2017-01-11 87
9 3 2017-01-22 90
10 1 2017-01-10 90
11 1 2017-01-13 87
12 1 2017-01-16 92
13 3 2017-01-13 86
14 2 2017-01-06 83
15 4 2017-01-21 81
16 2 2017-01-18 81
17 3 2017-01-21 84
18 4 2017-01-04 87
19 2 2017-01-19 89
20 4 2017-01-11 86
After the aggregate and merge, the result is:
> result
ID date weight
1 1 2017-01-08 87
2 2 2017-01-06 83
3 3 2017-01-11 87
4 4 2017-01-02 80
Try the following dplyr code:
library(dplyr)
set.seed(12345)
###Create test dataset
tb <- tibble(id = rep(1:10, each = 3),
date = rep(seq(as.Date("2017-07-01"), by=10, len=10), 3),
obs = rnorm(30))
# # A tibble: 30 × 3
# id date obs
# <int> <date> <dbl>
# 1 2017-07-01 0.5855288
# 1 2017-07-11 0.7094660
# 1 2017-07-21 -0.1093033
# 2 2017-07-31 -0.4534972
# 2 2017-08-10 0.6058875
# 2 2017-08-20 -1.8179560
# 3 2017-08-30 0.6300986
# 3 2017-09-09 -0.2761841
# 3 2017-09-19 -0.2841597
# 4 2017-09-29 -0.9193220
# # ... with 20 more rows
###Pipe the dataset through dplyr's 'group_by' and 'filter' commands
tb %>% group_by(id) %>%
filter(date == min(date)) %>%
ungroup() %>%
distinct()
# # A tibble: 10 × 3
# id date obs
# <int> <date> <dbl>
# 1 2017-07-01 0.5855288
# 2 2017-07-31 -0.4534972
# 3 2017-08-30 0.6300986
# 4 2017-07-01 -0.1162478
# 5 2017-07-21 0.3706279
# 6 2017-08-20 0.8168998
# 7 2017-07-01 0.7796219
# 8 2017-07-11 1.4557851
# 9 2017-08-10 -1.5977095
# 10 2017-09-09 0.6203798

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.

Resources