Reading in dates from Excel into R - r

I have multiple csv files which I need to read into R. The first column of the files contain dates and times, which I am converting into POSIXlt when I have loaded the data frame. Each of my csv files have the dates and times formatted in the same way in Excel, however, some files are read in differently.
For example,
My file looks like this once imported:
date value
1 2011/01/01 00:00:00 39
2 2011/01/01 00:15:00 35
3 2011/01/01 00:30:00 38
4 2011/01/01 00:45:00 39
5 2011/01/01 01:00:00 38
6 2011/01/01 01:15:00 38
Therefore, the code I use to amend the format is:
DATA$date <- as.POSIXlt(DATA$date,format="%Y/%m/%d %H:%M:%S")
However, some files are being read in as:
date value
1 01/01/2011 00:00 39
2 01/01/2011 00:15 35
3 01/01/2011 00:30 38
4 01/01/2011 00:45 39
5 01/01/2011 01:00 38
6 01/01/2011 01:15 38
Which means my format section of my code does not work and gives an error. Therefore, is there anyway to automatically detect which format the date column is in? Or, is there a way of knowing how it will be read, since the format of the column in Excel is the same on both.

When using the wrong formatting string for your date input, I seem to get NA values. If this be the case, you solve this problem in two steps. First, format the dates from Excel assuming that you have all three of hours, minutes, and seconds:
date.original <- DATA$date
DATA$date <- as.POSIXlt(DATA$date,format="%Y/%m/%d %H:%M:%S")
This should leave NA values in the date column for those dates which be missing seconds. Then you can try this:
DATA$date[is.na(DATA$date)] <- as.POSIXlt(date.original, format="%Y/%m/%d %H:%M")
This should cover the remaining data.
Data
DATA <- data.frame(date=c('2011/01/01 00:00:00', '2011/01/01 00:15',
'2011/01/01 00:30:00', '2011/01/01 00:45'),
value=c(39, 35, 38, 39))

Related

Thoughts on how to speed up replacing a column value when conditions between two objects (dataframe or datatable) are met in a loop?

I'm trying to get some input on how I might speed up a for loop that I've written. Essentially, I have a dataframe (DF1) where each row provides the latitude and longitude/point at a given time. The variables therefore are the lat and long for the point and the timestamp (date and time object). In addition, each row is a unique timestamp, for a unique point (in other word no repeats).
I'm trying to match it up to weather data which is contained in an netcdf file. The hourly weather data I have is a 3 dimensional file that includes the lats and longs for grids, time stamps, and the value for that weather variable. I'll call the weather variable u for now.
What I want to end up with: I have a column for the u values in DF1. It starts out with only missing values. In the end I want to replace those missing values in DF1 with the appropriate u value from the netcdf file.
How I've done this so far: I have constructed a for loop that can extract the appropriate u values for the nearest grid to each point in DF1. For example, for lat x and long y in DF1 I find the nearest grid in the netcdf file and extract the full timeseries. Then, within the for loop, I amend the DF1$u values with the appropriate data. DF1 is a rather large dataframe and the netcdf is even bigger (DF1 is just a small subset of the full netcdf).
Some psuedo data:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/5/2021 04:00:00"
, '2/1/2021 06:00:00'
, "1/7/2021 01:00:00"
, "2/2/2021 01:00:00"
, "2/5/2021 02:00:00")
lat <- c(34,36,41,50,20,40)
long <- c(55,50,‑89,-175,-155,25)
DF1 <- data.frame(ID, datetime, lat, long)
DF1$u <- NA
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 NA
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
Here is an example of the type of for loop I've constructed, I've left out some of the more specific details that aren't relevant:
for(i in 1:nrows(DF1) {
### a number of steps here that identify the closest grid to each point. ####
mat_index <- as.vector(arrayInd(which.min(dist_mat), dim(dist_mat)))
# Extract the time series for the lat and long that are closest. u is the weather variable, time is the datetime variable, and geometry is the corresponding lat long list item.
df_u <- data.frame(u=data_u[mat_index[2],mat_index[1],],time=timestamp,geometry=rep(psj[i,1],dim(data_u)[3]))
# To make things easier I seperate geometry into a lat and a long col
df_u <- tidyr::separate(df_u, geometry, into = c("long", "lat"), sep = ",")
df_u$long <- gsub("c\\(", "", df_u$long)
df_u$lat <- gsub("\\)", "", df_u$lat)
# I find that datatables are a bit easier for these types of tasks, so I set the full timeseries data and DF1 to data table (in reality I set df1 as a data table outside of the for loop)
df_u <- setDT(df_u)
# Then I use merge to join the two datatables, replace the missing value with the appropriate value from df_u, and then drop all the unnecessary columns in the final step.
df_u_new <- merge(windu_df, df_u, by =c("lat","long","time"), all.x = T)
df_u_new[, u := ifelse(is.na(u.x), u.y, u.x)]
windu_df <- df_u_new[, .(time, lat, long, u)]
}
This works, but given the sheer size of the dataframe/datatables that I'm working with, I wonder if there is a faster way to do that last step in particular. I know merge is probably the slowest way to do this, but I kept running into issues using match() and inner_join.
Unfortunately I'm not able to really give fully reproduceable data here given that I'm working with a netcdf file, but df_u looks something like this for the first iteration:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/1/2021 02:00:00"
, "1/1/2021 03:00:00"
, "1/1/2021 04:00:00"
, "1/1/2021 05:00:00"
, "1/1/2021 06:00:00")
lat <- c(34,34,34,34,34,34)
long <- c(55,55,55,55,55,55)
u <- c(2.8,3.6,2.1,5.6,4.4,2,5)
df_u <- data.frame(ID, datetime, lat, long,u)
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/1/2021 02:00:00 34 55 3.6
3 3 1/1/2021 03:00:00 34 55 2.1
4 4 1/1/2021 04:00:00 34 55 5.6
5 5 1/1/2021 05:00:00 34 55 4.4
6 6 1/1/2021 06:00:00 34 55 2.5
Once u is amended in DF1 it should read:
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
In the next iteration, the for loop will extract the relevant weather data for the lat and long in row 2 and then retain 2.8 and replace the NA in row 2 with a value.
EDIT: The NETCDF data covers an entire year (so every hour for a year) for a decently large spatial area. It's ERA5 data. In addition, DF1 had thousands of unique lat/long and timestamp observations.

Convert sub-hourly data to hourly and round up time in R

I have a very big dataframe in R, containing weather data with the following format.
valid temp
1 17/08/2014 00:20 14
2 17/08/2014 00:50 14
3 17/08/2014 01:20 13.5
4 17/08/2014 01:50 13
5 17/08/2014 02:20 12
6 17/08/2014 02:50 10
I would like to convert these sub-hourly data to hourly, like the following.
valid tmpc
1 2014-08-17 00:00:00 14
2 2014-08-17 01:00:00 13.75
3 2014-08-17 02:00:00 12.5
The class of df$valid is 'factor'. I have tried first converting them to Date through POSIXct, but it gives only NA values. I have also tried changing the system locale and still I get NAs.
We can do this in base R by converting to POSIXlt, set the minute to 0, convert it back to POSIXct and aggregate to get the mean of 'temp'
df1$valid <- strptime(df1$valid, "%d/%m/%Y %H:%M")
df1$valid$min <- 0
df1$valid <- as.POSIXct(df1$valid)
aggregate(temp~valid, df1, FUN = mean)
Option 1: The lubridate solution using ceiling_date or round_date. It's not clear according to your data frame and results if what you want is to round or ceiling. For instance, in the first row you are rounding and in the third using ceiling. Anyways here the example:
library(lubridate)
df <- data.frame(i = 1, valid= "17/08/2014 01:28", temp = 14)
df$valid <- dmy_hm(df$valid)
df$valid_round <- ceiling_date(df$valid , unit="hours")
Results:
i valid temp valid_round
1 1 2014-08-17 01:28:00 14 2014-08-17 02:00:00
Option 2: using the base functions. Use:
df$valid <- as.POSIXct(strptime(df$valid, "%d/%m/%Y %H:%M", tz ="UTC"))
and then round it.

Removing multiple data entries based on a total number of entries per day

I start with a data frame titled 'dat' in R that looks like the following:
datetime lat long id extra step
1 8/9/2014 13:00 31.34767 -81.39117 36 1 31.38946
2 8/9/2014 17:00 31.34767 -81.39150 36 1 11155.67502
3 8/9/2014 23:00 31.30683 -81.28433 36 1 206.33342
4 8/10/2014 5:00 31.30867 -81.28400 36 1 11152.88177
What I need to do is find out what days have less than 3 entries and remove all entries associated with those days from the original data.
I initially did this by the following:
library(plyr)
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
### count using just the date so you can ID which days have fewer than 3 points
datecount<- count(dat2, "date")
datecount<- subset(datecount, datecount$freq < 3)
This end up producing the following:
row.names date freq
1 49 2014-09-26 1
2 50 2014-09-27 2
3 135 2014-12-21 2
Which is great, but I cannot figure out how to remove the entries from these days with less than three entries from the original 'dat' because this is a compressed version of the original data frame.
So to try and deal with this I have come up with another way of looking at the problem. I will use the strptime and cbind from above:
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
And I will utilize the column titled "extra". I would like to create a new column which is the result of summing the values in this "extra" column by the simplified strptime dates. But find a way to apply this new value to all entries from that date, like the following:
date datetime lat long id extra extra_sum
1 2014-08-09 8/9/2014 13:00 31.34767 -81.39117 36 1 3
2 2014-08-09 8/9/2014 17:00 31.34767 -81.39150 36 1 3
3 2014-08-09 8/9/2014 23:00 31.30683 -81.28433 36 1 3
4 2014-08-10 8/10/2014 5:00 31.30867 -81.28400 36 1 4
5 2014-08-10 8/10/2014 13:00 31.34533 -81.39317 36 1 4
6 2014-08-10 8/10/2014 17:00 31.34517 -81.39317 36 1 4
7 2014-08-10 8/10/2014 23:00 31.34483 -81.39283 36 1 4
8 2014-08-11 8/11/2014 5:00 31.30600 -81.28317 36 1 2
9 2014-08-11 8/11/2014 13:00 31.34433 -81.39300 36 1 2
The code that creates the "extra_sum" column is what I am struggling with.
After creating this I can simply subset my data to all entries that have a value >2. Any help figuring out how to use my initial methodology or this new one to remove days with fewer than 3 entries from my initial data set would be much appreciated!
The plyr way.
library(plyr)
datetime <- dat$datetime
###strip the time down to only have the date no hh:mm:ss
date <- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2 <-cbind(date, dat)
dat3 <- ddply(dat2, .(date), function(df){
if (nrow(df)>=3) {
return(df)
} else {
return(NULL)
}
})
I recommend using the data.table package
library(data.table)
dat<-data.table(dat)
dat$Date<-as.Date(as.character(dat$datetime), format = "%m/%d/%Y")
dat_sum<-dat[, .N, by = Date ]
dat_3plus<-dat_sum[N>=3]
dat<-dat[Date%in%dat_3plus$Date]

how to concatenate 2 columns in the same data set in R

I have a data set in the following order:
Date Time Open High Low Close Volume NumberOfTrades BidVolume AskVolume
1 2011/12/22 02:00:00 5805.5 5820.5 5804.0 5820.5 253 96 161 71
2 2011/12/22 02:01:00 5819.0 5820.0 5813.0 5817.0 77 57 43 23
3 2011/12/22 02:02:00 5816.5 5820.0 5816.0 5819.0 30 22 9 14
I need to add a column before column a (Date) that will be A+B ("Date" "Time") and than I will be able to make my dataset an XTS (XTS needs a unick key)
The final result will be something like:
DateTime Date Time Open High Low Close Volume NumberOfTrades BidVolume AskVolume
1 2011/12/22 02:00:00 2011/12/22 02:00:00 5805.5 5820.5 5804.0 5820.5 253 96 161 71
Thanks
Use paste to combine the Date and Time columns and as.POSIXct to convert the string to date-time class.
Assuming your data frame is called df:
df$DateTime = as.POSIXct(paste(df$Date, df$Time))
After you've added DateTime to your data frame, per #RichardScriven's comment, you can rearrange the order of the columns as follows:
df = df[ , c(length(df), 1:(length(df)-1))]
Or, you can add DateTime as the first column as follows:
df = data.frame(DateTime=as.POSIXct(paste(df$Date, df$Time)), df)

Parsing dates in multiple formats in R using lubridate

I have data with dates in MM/DD/YY HH:MM format and others in plain old MM/DD/YY format. I want to parse all of them into the same format as "2010-12-01 12:12 EST." How should I go about doing that? I tried the following ifelse statement and it gave me a bunch of long integers and told me a large number of my data points failed to parse:
df_prime$date <- ifelse(!is.na(mdy_hm(df$date)), mdy_hm(df$date), mdy(df$date))
df_prime is a duplicate of the data frame df that I initially loaded in
IEN date admission_number KEY_PTF_45 admission_from discharge_to
1 12 3/3/07 18:05 1 252186 OTHER DIRECT
2 12 3/9/07 12:10 1 252186 RETURN TO COMMUNITY- INDEPENDENT
3 12 3/10/07 15:08 2 252382 OUTPATIENT TREATMENT
4 12 3/14/07 10:26 2 252382 RETURN TO COMMUNITY-INDEPENDENT
5 12 4/24/07 19:45 3 254343 OTHER DIRECT
6 12 4/28/07 11:45 3 254343 RETURN TO COMMUNITY-INDEPENDENT
...
1046334 23613488506 2/25/14 NA NA
1046335 23613488506 2/25/14 11:27 NA NA
1046336 23613488506 2/28/14 NA NA
1046337 23613488506 3/4/14 NA NA
1046338 23613488506 3/10/14 11:30 NA NA
1046339 23613488506 3/10/14 12:32 NA NA
Sorry if some of the formatting isn't right, but the date column is the most important one.
EDIT: Below is some code for a portion of my data frame via a dput command:
structure(list(IEN = c(23613488506, 23613488506, 23613488506, 23613488506, 23613488506, 23613488506), date = c("2/25/14", "2/25/14 11:27", "2/28/14", "3/4/14", "3/10/14 11:30", "3/10/14 12:32")), .Names = c("IEN", "date"), row.names = 1046334:1046339, class = "data.frame")
Have you tried the function guess_formats() in the lubridate package?
A reproducible example to build a dataframe like yours could be helpful!
The lubridate package's mdy_hm has a truncated parameter that lets you supply dates that might not have all the bits. For your example:
> mdy_hm(d$date,truncated=2)
[1] "2014-02-25 00:00:00 UTC" "2014-02-25 11:27:00 UTC"
[3] "2014-02-28 00:00:00 UTC" "2014-03-04 00:00:00 UTC"
[5] "2014-03-10 11:30:00 UTC" "2014-03-10 12:32:00 UTC"

Resources