R time series data manipulation with different data length_extract variable - r

I need some suggestions how to better design my problem’s resolution.
I starting from many Csv file of result of parametric study (time series data). I want to analyze the influence of some parameters on variable. The idea is to extract some variable from table of result for each id of parametric study and create a data.frame for each variable to easily make some plot and some analysis.
The problem is that some parameters change the time step of parametric study, so there are some csv much longer. One variable for example is Temperature. It is possible to maintain the differences on time step and evaluate Delta T varying one parameter? Plyr can do that? Or I have to resample part of my result to make this evaluation losing part of information?
I achieve to this point at moment:
head(data, 5)
names Date.Time Tout.dry.bulb RHout TsupIn TsupOut QconvIn[Wm2]
1 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:03:00 0 50 23 15.84257 -1.090683e-14
2 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:06:00 0 50 23 16.66988 0.000000e+00
3 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:09:00 0 50 23 13.83446 1.090683e-14
4 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:12:00 0 50 23 14.34774 2.181366e-14
5 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:15:00 0 50 23 12.59164 2.181366e-14
QconvOut[Wm2] Hvout[Wm2K] Qradout[Wm2] MeanRadTin MeanAirTin MeanOperTin
1 0.0000 17.76 -5.428583e-08 23 23 23
2 -281.3640 17.76 -1.151613e-07 23 23 23
3 -296.0570 17.76 -1.018871e-07 23 23 23
4 -245.7001 17.76 -1.027338e-07 23 23 23
5 -254.8158 17.76 -9.458750e-08 23 23 23
> str(data)
'data.frame': 1858080 obs. of 13 variables:
$ names : Factor w/ 35 levels "G_0-T_0-W_0-P1_0-P2_0",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Date.Time : POSIXct, format: "2005-01-01 00:03:00" "2005-01-01 00:06:00" "2005-01-01 00:09:00" ...
$ Tout.dry.bulb: num 0 0 0 0 0 0 0 0 0 0 ...
$ RHout : num 50 50 50 50 50 50 50 50 50 50 ...
$ TsupIn : num 23 23 23 23 23 23 23 23 23 23 ...
$ TsupOut : num 15.8 16.7 13.8 14.3 12.6 ...
$ QconvIn[Wm2] : num -1.09e-14 0.00 1.09e-14 2.18e-14 2.18e-14 ...
$ QconvOut[Wm2]: num 0 -281 -296 -246 -255 ...
$ Hvout[Wm2K] : num 17.8 17.8 17.8 17.8 17.8 ...
$ Qradout[Wm2] : num -5.43e-08 -1.15e-07 -1.02e-07 -1.03e-07 -9.46e-08 ...
$ MeanRadTin : num 23 23 23 23 23 23 23 23 23 23 ...
$ MeanAirTin : num 23 23 23 23 23 23 23 23 23 23 ...
$ MeanOperTin : num 23 23 23 23 23 23 23 23 23 23 ...
names(DF)
[1] "G_0-T_0-W_0-P1_0-P2_0" "G_0-T_0-W_0-P1_0-P2_1" "G_0-T_0-W_0-P1_0-P2_2"
[4] "G_0-T_0-W_0-P1_0-P2_3" "G_0-T_0-W_0-P1_0-P2_4" "G_0-T_0-W_0-P1_0-P2_5"
[7] "G_0-T_0-W_0-P1_0-P2_6" "G_0-T_0-W_0-P1_1-P2_0" "G_0-T_0-W_0-P1_1-P2_1"
[10] "G_0-T_0-W_0-P1_1-P2_2" "G_0-T_0-W_0-P1_1-P2_3" "G_0-T_0-W_0-P1_1-P2_4"
[13] "G_0-T_0-W_0-P1_1-P2_5" "G_0-T_0-W_0-P1_1-P2_6" "G_0-T_0-W_0-P1_2-P2_0"
[16] "G_0-T_0-W_0-P1_2-P2_1" "G_0-T_0-W_0-P1_2-P2_2" "G_0-T_0-W_0-P1_2-P2_3"
[19] "G_0-T_0-W_0-P1_2-P2_4" "G_0-T_0-W_0-P1_2-P2_5" "G_0-T_0-W_0-P1_2-P2_6"
[22] "G_0-T_0-W_0-P1_3-P2_0" "G_0-T_0-W_0-P1_3-P2_1" "G_0-T_0-W_0-P1_3-P2_2"
[25] "G_0-T_0-W_0-P1_3-P2_3" "G_0-T_0-W_0-P1_3-P2_4" "G_0-T_0-W_0-P1_3-P2_5"
[28] "G_0-T_0-W_0-P1_3-P2_6" "G_0-T_0-W_0-P1_4-P2_0" "G_0-T_0-W_0-P1_4-P2_1"
[31] "G_0-T_0-W_0-P1_4-P2_2" "G_0-T_0-W_0-P1_4-P2_3" "G_0-T_0-W_0-P1_4-P2_4"
[34] "G_0-T_0-W_0-P1_4-P2_5" "G_0-T_0-W_0-P1_4-P2_6"
From P1_4-P2_0 to P1_4-P2_6 the length is 113760 obs estand of 37920 because the time step change from 3 min to 1 min.
I’d like to have separated database for each variable in which I have date.time and value of variable for each names in column.
How I can do it?
Thank for any suggestion

I strongly suggest using a data structure that is appropriate for working with time series. In this case, the zoo package would work well. Load each CSV file into a zoo object, using your Date.Time column to define the index (timestamps) of the data. You can use the zoo() function to create those objects, for example.
Then use the merge function of zoo to combine the objects. It will find observations with the same timestamp and put them into one row. With merge, you can specify all=TRUE to get the union of all timestamps; or you can specify all=FALSE to get the intersection of the timestamps. For the union (all=TRUE), missing observations will be NA.
The read.zoo function could be difficult to use for reading your data. I suggest replacing your call to read.zoo with something like this:
table <- read.csv(filepath, header=TRUE, stringsAsFactors=FALSE)
dateStrings <- paste("2005/", table$Date.Time, sep="")
dates <- as.POSIXct(dateStrings)
dat <- zoo(table[,-1], dates)
(I assume that Date.Time is the first column in your file. That's why I wrote table[,-1].)

Related

Date problems plotting time series with ggplot2 from .csv file with individual columns for year month

I'm working on a data analysis project for hydrological modelling data. I've exported the results to .csv format and integrated into R as data frame (Out_1). Afterwards I selected some variables I need as you can see below.
Out_1 <- read.csv("Outlets_1.csv",header = TRUE)
Out_1s <- select(Out_1,SUB,YEAR,MON,AREAkm2,EVAPcms,FLOW_OUTcms,SED_OUTtons,YYYYMM)
str(Out_1s)
'data.frame': 480 obs. of 8 variables:
$ SUB : int 19 19 19 19 19 19 19 19 19 19 ...
$ YEAR : int 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
$ MON : int 1 2 3 4 5 6 7 8 9 10 ...
$ AREAkm2 : int 1025 1025 1025 1025 1025 1025 1025 1025 1025 1025 ...
$ EVAPcms : num 0.00601 0.00928 0.01696 0.01764 0.02615 ...
$ FLOW_OUTcms: num 2.31 2.84 3.16 18.49 34.42 ...
$ SED_OUTtons: num 215 308 416 3994 11440 ...
$ YYYYMM : int 198301 198302 198303 198304 198305 198306 198307 198308 198309 198310 ...
typeof(Out_1s$YEAR)
[1] "integer"
typeof(Out_1s$MON)
[1] "integer"
typeof(Out_1s$YYYYMM)
[1] "integer"
What I try to do exactly is to create graphical summaries with ggplot2 based on either combining the Out_1s.YEAR and Out_1s.MON columns or to identify the Out_1s.YYYYMM variable as YYYY-MM or MM-YYYY.
Out_1s$Date <- NA
typeof(Out_1s$Date)
[1] "character"
Out_1s$Date <- paste(Out_1s$YEAR,Out_1s$MON, sep = "-")
as.Date.character(Out_1s$Date, "%Y-%m")
graph1 <- ggplot(Out_1s, aes(Date, FLOW_OUTcms ))
graph1 + geom_line()
And the result which is not actually what was expected.
Two problems here.
First, a Date object is a year, month and day. To fix add a "01" to the paste statement.
Out_1s$Date <- paste(Out_1s$YEAR,Out_1s$MON, "01", sep = "-")
In your case since the date did not include a day, the as.Date function would return a series of NAs
Second, is the need to reassign the result from the as.Date function back to the original column.
Out_1s$Date <- as.Date.character(Out_1s$Date, "%Y-%m-%d")

Measuring distance between centroids R

I want to create a matrix of the distance (in metres) between the centroids of every country in the world. Country names or country IDs should be included in the matrix.
The matrix is based on a shapefile of the world downloaded here: http://gadm.org/version2
Here is some rough info on the shapefile I'm using (I'm using shapefile#data$UN as my ID):
> str(shapefile#data)
'data.frame': 174 obs. of 11 variables:
$ FIPS : Factor w/ 243 levels "AA","AC","AE",..: 5 6 7 8 10 12 13
$ ISO2 : Factor w/ 246 levels "AD","AE","AF",..: 61 17 6 7 9 11 14
$ ISO3 : Factor w/ 246 levels "ABW","AFG","AGO",..: 64 18 6 11 3 10
$ UN : int 12 31 8 51 24 32 36 48 50 84 ...
$ NAME : Factor w/ 246 levels "Afghanistan",..: 3 15 2 11 6 10 13
$ AREA : int 238174 8260 2740 2820 124670 273669 768230 71 13017
$ POP2005 : int 32854159 8352021 3153731 3017661 16095214 38747148
$ REGION : int 2 142 150 142 2 19 9 142 142 19 ...
$ SUBREGION: int 15 145 39 145 17 5 53 145 34 13 ...
$ LON : num 2.63 47.4 20.07 44.56 17.54 ...
$ LAT : num 28.2 40.4 41.1 40.5 -12.3 ...
I tried this:
library(rgeos)
shapefile <- readOGR("./Map/Shapefiles/World/World Map", layer = "TM_WORLD_BORDERS-0.3") # Read in world shapefile
row.names(shapefile) <- as.character(shapefile#data$UN)
centroids <- gCentroid(shapefile, byid = TRUE, id = as.character(shapefile#data$UN)) # create centroids
dist_matrix <- as.data.frame(geosphere::distm(centroids))
The result looks something like this:
V1 V2 V3 V4
1 0.0 4296620.6 2145659.7 4077948.2
2 4296620.6 0.0 2309537.4 219442.4
3 2145659.7 2309537.4 0.0 2094277.3
4 4077948.2 219442.4 2094277.3 0.0
1) Instead of the first column (1,2,3,4) and row (V1, V2, V3, V4) I would like to have country IDs (shapefile#data$UN) or names (shapefile#data#NAME). How does that work?
2) I'm not sure of the value that is returned. Is it metres, kilometres, etc?
3) Is geosphere::distm preferable to geosphere:distGeo in this instance?
1.
This should work to add the column and row names to your matrix. Just as you had done when adding the row names to shapefile
crnames<-as.character(shapefile#data$UN)
colnames(dist_matrix)<- crnames
rownames(dist_matrix)<- crnames
2.
The default distance function in distm is distHaversine, which takes a radius( of the earth) variable in m. So I assume the output is in m.
3.
Look at the documentation for distGeo and distHaversine and decide the level of accuracy you want in your results. To look at the docs in R itself just enter ?distGeo.
edit: answer to q1 may be wrong since the matrix data may be aggregated, looking at alternatives

Observations becoming NA when ordering levels of factors in R with ordered()

Hi have a longitudinal data frame p that contains 4 variables and looks like this:
> head(p)
date.1 County.x providers beds price
1 Jan/2011 essex 258 5545 251593.4
2 Jan/2011 greater manchester 108 3259 152987.7
3 Jan/2011 kent 301 7191 231985.7
4 Jan/2011 tyne and wear 103 2649 143196.6
5 Jan/2011 west midlands 262 6819 149323.9
6 Jan/2012 essex 2 27 231398.5
The structure of my variables is the following:
'data.frame': 259 obs. of 5 variables:
$ date.1 : Factor w/ 66 levels "Apr/2011","Apr/2012",..: 23 23 23 23 23 24 24 24 25 25 ...
$ County.x : Factor w/ 73 levels "avon","bedfordshire",..: 22 24 32 65 67 22 32 67 22 32 ...
$ providers: int 258 108 301 103 262 2 9 2 1 1 ...
$ beds : int 5545 3259 7191 2649 6819 27 185 24 70 13 ...
$ price : num 251593 152988 231986 143197 149324 ...
I want to order date.1 chronologically. Prior to apply ordered(), this variable does not contain NA observations.
> summary(is.na(p$date.1))
Mode FALSE NA's
logical 259 0
However, once I apply my function for ordering the levels corresponding to date.1:
p$date.1 = with(p, ordered(date.1, levels = c("Jun/2010", "Jul/2010",
"Aug/2010", "Sep/2010", "Oct/2010", "Nov/2010", "Dec/2010", "Jan/2011", "Feb/2011",
"Mar/2011","Apr/2011", "May/2011", "Jun/2011", "Jul/2011", "Aug/2011", "Sep/2011",
"Oct/2011", "Nov/2011", "Dec/2011" ,"Jan/2012", "Feb/2012" ,"Mar/2012" ,"Apr/2012",
"May/2012", "Jun/2012", "Jul/2012", "Aug/2012", "Sep/2012", "Oct/2012", "Nov/2012",
"Dec/2012", "Jan/2013", "Feb/2013", "Mar/2013", "Apr/2013", "May/2013",
"Jun/2013", "Jul/2013", "Aug/2013", "Sep/2013", "Oct/2013", "Nov/2013",
"Dec/2013", "Jan/2014",
"Feb/2014", "Mar/2014", "Apr/2014", "May/2014", "Jun/2014", "Jul/2014" ,"Aug/2014",
"Sep/2014", "Oct/2014", "Nov/2014", "Dec/2014", "Jan/2015", "Feb/2015", "Mar/2015",
"Apr/2015","May/2015", "Jun/2015" ,"Jul/2015" ,"Aug/2015", "Sep/2015", "Oct/2015",
"Nov/2015")))
It seems I miss some observations.
> summary(is.na(p$date.1))
Mode FALSE TRUE NA's
logical 250 9 0
Has anyone come across with this problem when using ordered()? or alternatively, is there any other possible solution to group my observations chronologically?
It is possible that one of your p$date.1 doesn't matched to any of the levels. Try this ord.monas the levels.
ord.mon <- do.call(paste, c(expand.grid(month.abb, 2010:2015), sep = "/"))
Then, you can try this to see if there's any mismatch between the two.
p$date.1 %in% ord.mon
Last, You can also sort the data frame after transforming the date.1 columng into Date (Note that you have to add an actual date beforehand)
p <- p[order(as.Date(paste0("01/", p$date.1), "%d/%b/%Y")), ]

Losing time value in datetime stamp when import data from Access into R

I am importing data with a DateTime stamp from Access to R and continue to 'lose' my time value. I have had a similar issue a while back (posted right here) and I had to convert the times to a number before importing. While this was not too difficult, it is a step I would like to avoid. This post is also helpful and suggests the reason might be because of the large number or records. I am currently trying to import over 110k records.
As an FYI this post is very helpful for info on dealing with times in R, but did not provide a specific solution for this issue.
My data in Access (2013) looks like this.
As you can see I have a UTC and local time, both of which have the date and time in the same field.
I used the following code to read in the table and look at the head.
DataConnect <- odbcConnect("MstrMUP")
Temp <- sqlFetch(DataConnect, "TempData_3Nov2014")
head(Temp)
IndID UTCDateTime LocalDateTime Temp
1 MTG_030_A 2013-02-08 2013-02-08 25
2 MTG_030_A 2013-02-08 2013-02-08 26
3 MTG_030_A 2013-02-08 2013-02-08 31
4 MTG_030_A 2013-02-08 2013-02-08 29
5 MTG_030_A 2013-02-09 2013-02-08 39
6 MTG_030_A 2013-02-09 2013-02-08 44
As you can see, the time portion of the DateTime stamp is missing, and I can not seem to locate it using str or as.numeric, both of which suggest the time value is not stored (at least that is how I read it).
> str(Temp)
'data.frame': 110382 obs. of 4 variables:
$ IndID : Factor w/ 17 levels "BHS_034_A","BHS_035_A",..: 13 13 13 13 13 13 13 13 13 13 ...
$ UTCDateTime : POSIXct, format: "2013-02-08" "2013-02-08" ...
$ LocalDateTime: POSIXct, format: "2013-02-08" "2013-02-08" ...
$ Temp : int 25 26 31 29 39 44 42 49 42 38 ...
> head(as.numeric(MTG30$LocalDateTime))
[1] 1360306800 1360306800 1360306800 1360306800 1360306800 1360306800
Because all numeric values are the same, they must all be the same date, and do not include time. Correct...?
The Question:
Is this an R issue or Access? Any suggestions on how to import 110k rows of data from Access into R without losing the time portion of a DateTime stamp would be appreciated.
I am sure there is a better method than my earlier work around
oh, I almost forgot, I am running the "Sock it to Me" version of R.
EDIT/ADDITION In response to #Richard Scriven thoughts on unclass
Unfortunatly, no, there is not a sec, min, or time value. All are 0.
> temp <- Temp[1:5,]
> unclass(as.POSIXlt(temp$UTCDateTime))
$sec
[1] 0 0 0 0 0
$min
[1] 0 0 0 0 0
$hour
[1] 0 0 0 0 0
$mday
[1] 8 8 8 8 9
$mon
[1] 1 1 1 1 1
$year
[1] 113 113 113 113 113
$wday
[1] 5 5 5 5 6
$yday
[1] 38 38 38 38 39
$isdst
[1] 0 0 0 0 0
$zone
[1] "MST" "MST" "MST" "MST" "MST"
$gmtoff
[1] -25200 -25200 -25200 -25200 -25200
attr(,"tzone")
[1] "" "MST" "MDT"
Thanks in advance.

Plot time(x axis) and time of day & duration(y axis) of episodes

I am measuring the duration of an event, and I would like to plot the duration, and time the event takes place in each observation day.
My dataset is the following
> str(question.stack)
'data.frame': 398 obs. of 6 variables:
$ id : Factor w/ 1 level "AA11": 1 1 1 1 1 1 1 1 1 1 ...
$ begin.recording : Factor w/ 1 level "8/15/2007": 1 1 1 1 1 1 1 1 1 1 ...
$ begin.of.episode: Factor w/ 111 levels "1/1/2009","1/11/2009",..: 86 86 86 87 88 90 90 96 96 103 ...
$ episode.day : int 12 12 12 13 14 15 15 17 17 18 ...
$ start.time : Factor w/ 383 levels "0:06:01","0:17:12",..: 324 15 18 179 269 320 379 281 287 298 ...
$ duration : num 278 14 1324 18 428 ...
I would like in the x axis the episode.day. The y axis should go from 00:00 to 23:59:59 (start.time). For example, for the second entry of the dataset, i would like a black bar starting at (x=12,y=10:55:12) till (x=12, y=11:09:12) denoting a 14 minute episode duration on day 12. An episode can span between more than 1 days.
Is this possible with R? If possible please only baseR solutions
Something similar is Plot dates on the x axis and time on the y axis with ggplot2 but not exactly what I am looking.
Many thanks
Ok I finally found it.
On the x axis I wanted to plot dates either as POSIXct or as number of day of recording (integer). On the y axis I wanted the time of day so that the graph would present a dark bar on each day (x-axis) and between the time (y-axis) that the episode take place.
R can plot POSIX, but in my case the episode start and end time (for the y-axis) should be date-"less"
I did this like this
#Cleaning the Dataset
qs<-question.stack
qs$id<-as.character(qs$id)
qs$begin.recording<-as.character(qs$begin.recording)
qs$begin.of.episode<-as.character(qs$begin.of.episode)
qs$start.time<-as.character(qs$start.time)
qs$start<-as.character(paste(qs$begin.of.episode,qs$start.time))
qs$duration<-round(qs$duration,0)
#Convert time and dates to POSIXct
qs$start<-as.POSIXct(qs$start,format="%m/%d/%Y %H:%M:%S",tz="UTC")
qs$start<-round(qs$start,"mins")
qs$end<-as.POSIXct(qs$start+qs$duration*60)
qs$start<-as.POSIXct(qs$start)
Now we have
str(qs)
'data.frame': 398 obs. of 8 variables:
$ id : chr "AA11" "AA11" "AA11" "AA11" ...
$ begin.recording : chr "8/15/2007" "8/15/2007" "8/15/2007" "8/15/2007" ...
$ begin.of.episode: chr "8/27/2007" "8/27/2007" "8/27/2007" "8/28/2007" ...
$ episode.day : int 12 12 12 13 14 15 15 17 17 18 ...
$ start.time : chr "6:15:12" "10:55:12" "11:15:12" "18:19:12" ...
$ duration : num 278 14 1324 18 428 ...
$ start : POSIXct, format: "2007-08-27 06:15:00" "2007-08-27 10:55:00" ...
$ end : POSIXct, format: "2007-08-27 10:53:00" "2007-08-27 11:09:00" ...
The following makes a vector which includes all minutes that there was an episode. One can fine tune it by seconds or upscale it by hours
tmp<-do.call(c, apply(qs, 1, function(x) seq(from=as.POSIXct(x[7]), to=as.POSIXct(x[8]),by="mins")))
The following makes a data frame. The switch of the time of day from POSIX to (date-"less") and then back to POSIX garantees that there will be the same date in all times of time.of.day. Perhaps one can also do it with the origin argument.
ep <- data.frame(sqs=tmp, date=as.Date(tmp,"%Y-%m-%d"),time.of.day=as.POSIXct(as.character(format(tmp,"%H:%M")),format="%H:%M"))
Plot
plot(ep$date, ep$time.of.day,pch=".")

Resources