Date problems plotting time series with ggplot2 from .csv file with individual columns for year month - r

I'm working on a data analysis project for hydrological modelling data. I've exported the results to .csv format and integrated into R as data frame (Out_1). Afterwards I selected some variables I need as you can see below.
Out_1 <- read.csv("Outlets_1.csv",header = TRUE)
Out_1s <- select(Out_1,SUB,YEAR,MON,AREAkm2,EVAPcms,FLOW_OUTcms,SED_OUTtons,YYYYMM)
str(Out_1s)
'data.frame': 480 obs. of 8 variables:
$ SUB : int 19 19 19 19 19 19 19 19 19 19 ...
$ YEAR : int 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
$ MON : int 1 2 3 4 5 6 7 8 9 10 ...
$ AREAkm2 : int 1025 1025 1025 1025 1025 1025 1025 1025 1025 1025 ...
$ EVAPcms : num 0.00601 0.00928 0.01696 0.01764 0.02615 ...
$ FLOW_OUTcms: num 2.31 2.84 3.16 18.49 34.42 ...
$ SED_OUTtons: num 215 308 416 3994 11440 ...
$ YYYYMM : int 198301 198302 198303 198304 198305 198306 198307 198308 198309 198310 ...
typeof(Out_1s$YEAR)
[1] "integer"
typeof(Out_1s$MON)
[1] "integer"
typeof(Out_1s$YYYYMM)
[1] "integer"
What I try to do exactly is to create graphical summaries with ggplot2 based on either combining the Out_1s.YEAR and Out_1s.MON columns or to identify the Out_1s.YYYYMM variable as YYYY-MM or MM-YYYY.
Out_1s$Date <- NA
typeof(Out_1s$Date)
[1] "character"
Out_1s$Date <- paste(Out_1s$YEAR,Out_1s$MON, sep = "-")
as.Date.character(Out_1s$Date, "%Y-%m")
graph1 <- ggplot(Out_1s, aes(Date, FLOW_OUTcms ))
graph1 + geom_line()
And the result which is not actually what was expected.

Two problems here.
First, a Date object is a year, month and day. To fix add a "01" to the paste statement.
Out_1s$Date <- paste(Out_1s$YEAR,Out_1s$MON, "01", sep = "-")
In your case since the date did not include a day, the as.Date function would return a series of NAs
Second, is the need to reassign the result from the as.Date function back to the original column.
Out_1s$Date <- as.Date.character(Out_1s$Date, "%Y-%m-%d")

Related

Measuring distance between centroids R

I want to create a matrix of the distance (in metres) between the centroids of every country in the world. Country names or country IDs should be included in the matrix.
The matrix is based on a shapefile of the world downloaded here: http://gadm.org/version2
Here is some rough info on the shapefile I'm using (I'm using shapefile#data$UN as my ID):
> str(shapefile#data)
'data.frame': 174 obs. of 11 variables:
$ FIPS : Factor w/ 243 levels "AA","AC","AE",..: 5 6 7 8 10 12 13
$ ISO2 : Factor w/ 246 levels "AD","AE","AF",..: 61 17 6 7 9 11 14
$ ISO3 : Factor w/ 246 levels "ABW","AFG","AGO",..: 64 18 6 11 3 10
$ UN : int 12 31 8 51 24 32 36 48 50 84 ...
$ NAME : Factor w/ 246 levels "Afghanistan",..: 3 15 2 11 6 10 13
$ AREA : int 238174 8260 2740 2820 124670 273669 768230 71 13017
$ POP2005 : int 32854159 8352021 3153731 3017661 16095214 38747148
$ REGION : int 2 142 150 142 2 19 9 142 142 19 ...
$ SUBREGION: int 15 145 39 145 17 5 53 145 34 13 ...
$ LON : num 2.63 47.4 20.07 44.56 17.54 ...
$ LAT : num 28.2 40.4 41.1 40.5 -12.3 ...
I tried this:
library(rgeos)
shapefile <- readOGR("./Map/Shapefiles/World/World Map", layer = "TM_WORLD_BORDERS-0.3") # Read in world shapefile
row.names(shapefile) <- as.character(shapefile#data$UN)
centroids <- gCentroid(shapefile, byid = TRUE, id = as.character(shapefile#data$UN)) # create centroids
dist_matrix <- as.data.frame(geosphere::distm(centroids))
The result looks something like this:
V1 V2 V3 V4
1 0.0 4296620.6 2145659.7 4077948.2
2 4296620.6 0.0 2309537.4 219442.4
3 2145659.7 2309537.4 0.0 2094277.3
4 4077948.2 219442.4 2094277.3 0.0
1) Instead of the first column (1,2,3,4) and row (V1, V2, V3, V4) I would like to have country IDs (shapefile#data$UN) or names (shapefile#data#NAME). How does that work?
2) I'm not sure of the value that is returned. Is it metres, kilometres, etc?
3) Is geosphere::distm preferable to geosphere:distGeo in this instance?
1.
This should work to add the column and row names to your matrix. Just as you had done when adding the row names to shapefile
crnames<-as.character(shapefile#data$UN)
colnames(dist_matrix)<- crnames
rownames(dist_matrix)<- crnames
2.
The default distance function in distm is distHaversine, which takes a radius( of the earth) variable in m. So I assume the output is in m.
3.
Look at the documentation for distGeo and distHaversine and decide the level of accuracy you want in your results. To look at the docs in R itself just enter ?distGeo.
edit: answer to q1 may be wrong since the matrix data may be aggregated, looking at alternatives

Observations becoming NA when ordering levels of factors in R with ordered()

Hi have a longitudinal data frame p that contains 4 variables and looks like this:
> head(p)
date.1 County.x providers beds price
1 Jan/2011 essex 258 5545 251593.4
2 Jan/2011 greater manchester 108 3259 152987.7
3 Jan/2011 kent 301 7191 231985.7
4 Jan/2011 tyne and wear 103 2649 143196.6
5 Jan/2011 west midlands 262 6819 149323.9
6 Jan/2012 essex 2 27 231398.5
The structure of my variables is the following:
'data.frame': 259 obs. of 5 variables:
$ date.1 : Factor w/ 66 levels "Apr/2011","Apr/2012",..: 23 23 23 23 23 24 24 24 25 25 ...
$ County.x : Factor w/ 73 levels "avon","bedfordshire",..: 22 24 32 65 67 22 32 67 22 32 ...
$ providers: int 258 108 301 103 262 2 9 2 1 1 ...
$ beds : int 5545 3259 7191 2649 6819 27 185 24 70 13 ...
$ price : num 251593 152988 231986 143197 149324 ...
I want to order date.1 chronologically. Prior to apply ordered(), this variable does not contain NA observations.
> summary(is.na(p$date.1))
Mode FALSE NA's
logical 259 0
However, once I apply my function for ordering the levels corresponding to date.1:
p$date.1 = with(p, ordered(date.1, levels = c("Jun/2010", "Jul/2010",
"Aug/2010", "Sep/2010", "Oct/2010", "Nov/2010", "Dec/2010", "Jan/2011", "Feb/2011",
"Mar/2011","Apr/2011", "May/2011", "Jun/2011", "Jul/2011", "Aug/2011", "Sep/2011",
"Oct/2011", "Nov/2011", "Dec/2011" ,"Jan/2012", "Feb/2012" ,"Mar/2012" ,"Apr/2012",
"May/2012", "Jun/2012", "Jul/2012", "Aug/2012", "Sep/2012", "Oct/2012", "Nov/2012",
"Dec/2012", "Jan/2013", "Feb/2013", "Mar/2013", "Apr/2013", "May/2013",
"Jun/2013", "Jul/2013", "Aug/2013", "Sep/2013", "Oct/2013", "Nov/2013",
"Dec/2013", "Jan/2014",
"Feb/2014", "Mar/2014", "Apr/2014", "May/2014", "Jun/2014", "Jul/2014" ,"Aug/2014",
"Sep/2014", "Oct/2014", "Nov/2014", "Dec/2014", "Jan/2015", "Feb/2015", "Mar/2015",
"Apr/2015","May/2015", "Jun/2015" ,"Jul/2015" ,"Aug/2015", "Sep/2015", "Oct/2015",
"Nov/2015")))
It seems I miss some observations.
> summary(is.na(p$date.1))
Mode FALSE TRUE NA's
logical 250 9 0
Has anyone come across with this problem when using ordered()? or alternatively, is there any other possible solution to group my observations chronologically?
It is possible that one of your p$date.1 doesn't matched to any of the levels. Try this ord.monas the levels.
ord.mon <- do.call(paste, c(expand.grid(month.abb, 2010:2015), sep = "/"))
Then, you can try this to see if there's any mismatch between the two.
p$date.1 %in% ord.mon
Last, You can also sort the data frame after transforming the date.1 columng into Date (Note that you have to add an actual date beforehand)
p <- p[order(as.Date(paste0("01/", p$date.1), "%d/%b/%Y")), ]

R. Handling dates and wide format from an imported Stata file

I have been given a Stata data file (counts.dta) that contains daily counts for the years 1975 to 2006 stored in wide-format. The columns are labelled month (full name of the month as a character string), day (numeric with values 1-31), and then the years from 1975 to 2006 with labels '_1975', '_1976' ... '_2006'. I assume that the underline is a consequence of something in Stata. There are dummy counts of zero (0) inserted for the date 29 February when the year-column is not a leap year.
I want to do several things. First, convert to long form with a sensible representation for year. Second, change the tri-partite representation of the date to something more sensible.
My approach has been to change the character string month to a factor and then to get it into the correct order:
require("foreign")
counts <- read.dta(file='counts.dta')
counts[['month']] <- as.factor( counts[['month']] )
counts[['month']] <-
factor(counts[['month']], levels( counts[['month']] )[c(5,4,8,1,9,7,6,2,12,11,10,3)])
I then have
str( counts )
'data.frame': 366 obs. of 34 variables:
$ month: Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ _1975: int 515 649 745 599 445 667 725 749 646 740 ...
$ _1976: int 485 685 529 467 630 723 712 685 715 504 ...
$ _1977: int 505 437 489 588 634 734 682 537 453 673 ...
and so forth. Converting to long format
lcounts <- reshape(counts,
direction="long",
varying=list(names( counts )[3:34]),
v.names="n.counts",
idvar=c("month","day"),
timevar="Year",
times=1975:2006)
str( lcounts )
gives
'data.frame': 11712 obs. of 4 variables:
$ month : Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ Year : int 1975 1975 1975 1975 1975 1975 1975 1975 1975 1975 ...
$ n.counts: int 515 649 745 599 445 667 725 749 646 740 ...
plus some further lines relating to the original Stata file.
My questions are: (1) what is now a good way to convert to factor-month, numeric-year and numeric-day to a useful date format, so that I can determine, for example, the day of the week, the interval between two dates and so on? (2) Was there a better way to have tackled the problem from the start?
This should be pretty easy because all you have to do is paste together the rows of your data.frame and use as.Date to create a Date class vector.
Let's start with some data similar to yours:
dat <- data.frame(month = c(rep("January",31), rep("February",29)),
day = c(1:31, 1:29),
Year = 1975,
n.counts = 515)
Then the creation of the date variable is simple:
dat$Date <- as.Date(with(dat, paste(as.numeric(month), day, Year)), "%m %d %Y")
str(dat)
# 'data.frame': 60 obs. of 5 variables:
# $ month : Factor w/ 2 levels "February","January": 2 2 2 2 2 2 2 2 2 2 ...
# $ day : int 1 2 3 4 5 6 7 8 9 10 ...
# $ Year : num 1975 1975 1975 1975 1975 ...
# $ n.counts: num 515 515 515 515 515 515 515 515 515 515 ...
# $ Date : Date, format: "1975-02-01" "1975-02-02" "1975-02-03" "1975-02-04" # ...
The main focus in this thread is naturally what to do in R after data import, but here I bundle together various details on the Stata side of this.
It is longstanding advice that data of this kind are much more easily handled in Stata in a long shape and reshape long is a standard command to do that conversion for data arriving with each year's data in a separate variable (R users: please read "column" as a translation). So, if possible, you should ask a provider of such Stata files to do that before export.
What the OP calls labels such as _1975 are legal variable names in Stata, and as the OP guesses the underscore is needed because variable names in Stata may not start with numeric characters.
On the information given, it would have been possible to export the data without loss from Stata in file formats other than .dta, notably as the usual kinds of text files (.csv, etc.).
Stata's preferred way of holding daily dates is as integers with origin 0 = 1 January 1960 (so 26 March 2015 would be 20173), which presumably is trivially easy to convert to any date representation in R.
In short, the particular and indeed peculiar form of the data as presented to the OP is in no sense either required by any Stata syntax or even recommended as part of good Stata practice.

R time series data manipulation with different data length_extract variable

I need some suggestions how to better design my problem’s resolution.
I starting from many Csv file of result of parametric study (time series data). I want to analyze the influence of some parameters on variable. The idea is to extract some variable from table of result for each id of parametric study and create a data.frame for each variable to easily make some plot and some analysis.
The problem is that some parameters change the time step of parametric study, so there are some csv much longer. One variable for example is Temperature. It is possible to maintain the differences on time step and evaluate Delta T varying one parameter? Plyr can do that? Or I have to resample part of my result to make this evaluation losing part of information?
I achieve to this point at moment:
head(data, 5)
names Date.Time Tout.dry.bulb RHout TsupIn TsupOut QconvIn[Wm2]
1 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:03:00 0 50 23 15.84257 -1.090683e-14
2 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:06:00 0 50 23 16.66988 0.000000e+00
3 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:09:00 0 50 23 13.83446 1.090683e-14
4 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:12:00 0 50 23 14.34774 2.181366e-14
5 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:15:00 0 50 23 12.59164 2.181366e-14
QconvOut[Wm2] Hvout[Wm2K] Qradout[Wm2] MeanRadTin MeanAirTin MeanOperTin
1 0.0000 17.76 -5.428583e-08 23 23 23
2 -281.3640 17.76 -1.151613e-07 23 23 23
3 -296.0570 17.76 -1.018871e-07 23 23 23
4 -245.7001 17.76 -1.027338e-07 23 23 23
5 -254.8158 17.76 -9.458750e-08 23 23 23
> str(data)
'data.frame': 1858080 obs. of 13 variables:
$ names : Factor w/ 35 levels "G_0-T_0-W_0-P1_0-P2_0",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Date.Time : POSIXct, format: "2005-01-01 00:03:00" "2005-01-01 00:06:00" "2005-01-01 00:09:00" ...
$ Tout.dry.bulb: num 0 0 0 0 0 0 0 0 0 0 ...
$ RHout : num 50 50 50 50 50 50 50 50 50 50 ...
$ TsupIn : num 23 23 23 23 23 23 23 23 23 23 ...
$ TsupOut : num 15.8 16.7 13.8 14.3 12.6 ...
$ QconvIn[Wm2] : num -1.09e-14 0.00 1.09e-14 2.18e-14 2.18e-14 ...
$ QconvOut[Wm2]: num 0 -281 -296 -246 -255 ...
$ Hvout[Wm2K] : num 17.8 17.8 17.8 17.8 17.8 ...
$ Qradout[Wm2] : num -5.43e-08 -1.15e-07 -1.02e-07 -1.03e-07 -9.46e-08 ...
$ MeanRadTin : num 23 23 23 23 23 23 23 23 23 23 ...
$ MeanAirTin : num 23 23 23 23 23 23 23 23 23 23 ...
$ MeanOperTin : num 23 23 23 23 23 23 23 23 23 23 ...
names(DF)
[1] "G_0-T_0-W_0-P1_0-P2_0" "G_0-T_0-W_0-P1_0-P2_1" "G_0-T_0-W_0-P1_0-P2_2"
[4] "G_0-T_0-W_0-P1_0-P2_3" "G_0-T_0-W_0-P1_0-P2_4" "G_0-T_0-W_0-P1_0-P2_5"
[7] "G_0-T_0-W_0-P1_0-P2_6" "G_0-T_0-W_0-P1_1-P2_0" "G_0-T_0-W_0-P1_1-P2_1"
[10] "G_0-T_0-W_0-P1_1-P2_2" "G_0-T_0-W_0-P1_1-P2_3" "G_0-T_0-W_0-P1_1-P2_4"
[13] "G_0-T_0-W_0-P1_1-P2_5" "G_0-T_0-W_0-P1_1-P2_6" "G_0-T_0-W_0-P1_2-P2_0"
[16] "G_0-T_0-W_0-P1_2-P2_1" "G_0-T_0-W_0-P1_2-P2_2" "G_0-T_0-W_0-P1_2-P2_3"
[19] "G_0-T_0-W_0-P1_2-P2_4" "G_0-T_0-W_0-P1_2-P2_5" "G_0-T_0-W_0-P1_2-P2_6"
[22] "G_0-T_0-W_0-P1_3-P2_0" "G_0-T_0-W_0-P1_3-P2_1" "G_0-T_0-W_0-P1_3-P2_2"
[25] "G_0-T_0-W_0-P1_3-P2_3" "G_0-T_0-W_0-P1_3-P2_4" "G_0-T_0-W_0-P1_3-P2_5"
[28] "G_0-T_0-W_0-P1_3-P2_6" "G_0-T_0-W_0-P1_4-P2_0" "G_0-T_0-W_0-P1_4-P2_1"
[31] "G_0-T_0-W_0-P1_4-P2_2" "G_0-T_0-W_0-P1_4-P2_3" "G_0-T_0-W_0-P1_4-P2_4"
[34] "G_0-T_0-W_0-P1_4-P2_5" "G_0-T_0-W_0-P1_4-P2_6"
From P1_4-P2_0 to P1_4-P2_6 the length is 113760 obs estand of 37920 because the time step change from 3 min to 1 min.
I’d like to have separated database for each variable in which I have date.time and value of variable for each names in column.
How I can do it?
Thank for any suggestion
I strongly suggest using a data structure that is appropriate for working with time series. In this case, the zoo package would work well. Load each CSV file into a zoo object, using your Date.Time column to define the index (timestamps) of the data. You can use the zoo() function to create those objects, for example.
Then use the merge function of zoo to combine the objects. It will find observations with the same timestamp and put them into one row. With merge, you can specify all=TRUE to get the union of all timestamps; or you can specify all=FALSE to get the intersection of the timestamps. For the union (all=TRUE), missing observations will be NA.
The read.zoo function could be difficult to use for reading your data. I suggest replacing your call to read.zoo with something like this:
table <- read.csv(filepath, header=TRUE, stringsAsFactors=FALSE)
dateStrings <- paste("2005/", table$Date.Time, sep="")
dates <- as.POSIXct(dateStrings)
dat <- zoo(table[,-1], dates)
(I assume that Date.Time is the first column in your file. That's why I wrote table[,-1].)

Plot time(x axis) and time of day & duration(y axis) of episodes

I am measuring the duration of an event, and I would like to plot the duration, and time the event takes place in each observation day.
My dataset is the following
> str(question.stack)
'data.frame': 398 obs. of 6 variables:
$ id : Factor w/ 1 level "AA11": 1 1 1 1 1 1 1 1 1 1 ...
$ begin.recording : Factor w/ 1 level "8/15/2007": 1 1 1 1 1 1 1 1 1 1 ...
$ begin.of.episode: Factor w/ 111 levels "1/1/2009","1/11/2009",..: 86 86 86 87 88 90 90 96 96 103 ...
$ episode.day : int 12 12 12 13 14 15 15 17 17 18 ...
$ start.time : Factor w/ 383 levels "0:06:01","0:17:12",..: 324 15 18 179 269 320 379 281 287 298 ...
$ duration : num 278 14 1324 18 428 ...
I would like in the x axis the episode.day. The y axis should go from 00:00 to 23:59:59 (start.time). For example, for the second entry of the dataset, i would like a black bar starting at (x=12,y=10:55:12) till (x=12, y=11:09:12) denoting a 14 minute episode duration on day 12. An episode can span between more than 1 days.
Is this possible with R? If possible please only baseR solutions
Something similar is Plot dates on the x axis and time on the y axis with ggplot2 but not exactly what I am looking.
Many thanks
Ok I finally found it.
On the x axis I wanted to plot dates either as POSIXct or as number of day of recording (integer). On the y axis I wanted the time of day so that the graph would present a dark bar on each day (x-axis) and between the time (y-axis) that the episode take place.
R can plot POSIX, but in my case the episode start and end time (for the y-axis) should be date-"less"
I did this like this
#Cleaning the Dataset
qs<-question.stack
qs$id<-as.character(qs$id)
qs$begin.recording<-as.character(qs$begin.recording)
qs$begin.of.episode<-as.character(qs$begin.of.episode)
qs$start.time<-as.character(qs$start.time)
qs$start<-as.character(paste(qs$begin.of.episode,qs$start.time))
qs$duration<-round(qs$duration,0)
#Convert time and dates to POSIXct
qs$start<-as.POSIXct(qs$start,format="%m/%d/%Y %H:%M:%S",tz="UTC")
qs$start<-round(qs$start,"mins")
qs$end<-as.POSIXct(qs$start+qs$duration*60)
qs$start<-as.POSIXct(qs$start)
Now we have
str(qs)
'data.frame': 398 obs. of 8 variables:
$ id : chr "AA11" "AA11" "AA11" "AA11" ...
$ begin.recording : chr "8/15/2007" "8/15/2007" "8/15/2007" "8/15/2007" ...
$ begin.of.episode: chr "8/27/2007" "8/27/2007" "8/27/2007" "8/28/2007" ...
$ episode.day : int 12 12 12 13 14 15 15 17 17 18 ...
$ start.time : chr "6:15:12" "10:55:12" "11:15:12" "18:19:12" ...
$ duration : num 278 14 1324 18 428 ...
$ start : POSIXct, format: "2007-08-27 06:15:00" "2007-08-27 10:55:00" ...
$ end : POSIXct, format: "2007-08-27 10:53:00" "2007-08-27 11:09:00" ...
The following makes a vector which includes all minutes that there was an episode. One can fine tune it by seconds or upscale it by hours
tmp<-do.call(c, apply(qs, 1, function(x) seq(from=as.POSIXct(x[7]), to=as.POSIXct(x[8]),by="mins")))
The following makes a data frame. The switch of the time of day from POSIX to (date-"less") and then back to POSIX garantees that there will be the same date in all times of time.of.day. Perhaps one can also do it with the origin argument.
ep <- data.frame(sqs=tmp, date=as.Date(tmp,"%Y-%m-%d"),time.of.day=as.POSIXct(as.character(format(tmp,"%H:%M")),format="%H:%M"))
Plot
plot(ep$date, ep$time.of.day,pch=".")

Resources