R. Handling dates and wide format from an imported Stata file - r

I have been given a Stata data file (counts.dta) that contains daily counts for the years 1975 to 2006 stored in wide-format. The columns are labelled month (full name of the month as a character string), day (numeric with values 1-31), and then the years from 1975 to 2006 with labels '_1975', '_1976' ... '_2006'. I assume that the underline is a consequence of something in Stata. There are dummy counts of zero (0) inserted for the date 29 February when the year-column is not a leap year.
I want to do several things. First, convert to long form with a sensible representation for year. Second, change the tri-partite representation of the date to something more sensible.
My approach has been to change the character string month to a factor and then to get it into the correct order:
require("foreign")
counts <- read.dta(file='counts.dta')
counts[['month']] <- as.factor( counts[['month']] )
counts[['month']] <-
factor(counts[['month']], levels( counts[['month']] )[c(5,4,8,1,9,7,6,2,12,11,10,3)])
I then have
str( counts )
'data.frame': 366 obs. of 34 variables:
$ month: Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ _1975: int 515 649 745 599 445 667 725 749 646 740 ...
$ _1976: int 485 685 529 467 630 723 712 685 715 504 ...
$ _1977: int 505 437 489 588 634 734 682 537 453 673 ...
and so forth. Converting to long format
lcounts <- reshape(counts,
direction="long",
varying=list(names( counts )[3:34]),
v.names="n.counts",
idvar=c("month","day"),
timevar="Year",
times=1975:2006)
str( lcounts )
gives
'data.frame': 11712 obs. of 4 variables:
$ month : Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ Year : int 1975 1975 1975 1975 1975 1975 1975 1975 1975 1975 ...
$ n.counts: int 515 649 745 599 445 667 725 749 646 740 ...
plus some further lines relating to the original Stata file.
My questions are: (1) what is now a good way to convert to factor-month, numeric-year and numeric-day to a useful date format, so that I can determine, for example, the day of the week, the interval between two dates and so on? (2) Was there a better way to have tackled the problem from the start?

This should be pretty easy because all you have to do is paste together the rows of your data.frame and use as.Date to create a Date class vector.
Let's start with some data similar to yours:
dat <- data.frame(month = c(rep("January",31), rep("February",29)),
day = c(1:31, 1:29),
Year = 1975,
n.counts = 515)
Then the creation of the date variable is simple:
dat$Date <- as.Date(with(dat, paste(as.numeric(month), day, Year)), "%m %d %Y")
str(dat)
# 'data.frame': 60 obs. of 5 variables:
# $ month : Factor w/ 2 levels "February","January": 2 2 2 2 2 2 2 2 2 2 ...
# $ day : int 1 2 3 4 5 6 7 8 9 10 ...
# $ Year : num 1975 1975 1975 1975 1975 ...
# $ n.counts: num 515 515 515 515 515 515 515 515 515 515 ...
# $ Date : Date, format: "1975-02-01" "1975-02-02" "1975-02-03" "1975-02-04" # ...

The main focus in this thread is naturally what to do in R after data import, but here I bundle together various details on the Stata side of this.
It is longstanding advice that data of this kind are much more easily handled in Stata in a long shape and reshape long is a standard command to do that conversion for data arriving with each year's data in a separate variable (R users: please read "column" as a translation). So, if possible, you should ask a provider of such Stata files to do that before export.
What the OP calls labels such as _1975 are legal variable names in Stata, and as the OP guesses the underscore is needed because variable names in Stata may not start with numeric characters.
On the information given, it would have been possible to export the data without loss from Stata in file formats other than .dta, notably as the usual kinds of text files (.csv, etc.).
Stata's preferred way of holding daily dates is as integers with origin 0 = 1 January 1960 (so 26 March 2015 would be 20173), which presumably is trivially easy to convert to any date representation in R.
In short, the particular and indeed peculiar form of the data as presented to the OP is in no sense either required by any Stata syntax or even recommended as part of good Stata practice.

Related

Date problems plotting time series with ggplot2 from .csv file with individual columns for year month

I'm working on a data analysis project for hydrological modelling data. I've exported the results to .csv format and integrated into R as data frame (Out_1). Afterwards I selected some variables I need as you can see below.
Out_1 <- read.csv("Outlets_1.csv",header = TRUE)
Out_1s <- select(Out_1,SUB,YEAR,MON,AREAkm2,EVAPcms,FLOW_OUTcms,SED_OUTtons,YYYYMM)
str(Out_1s)
'data.frame': 480 obs. of 8 variables:
$ SUB : int 19 19 19 19 19 19 19 19 19 19 ...
$ YEAR : int 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
$ MON : int 1 2 3 4 5 6 7 8 9 10 ...
$ AREAkm2 : int 1025 1025 1025 1025 1025 1025 1025 1025 1025 1025 ...
$ EVAPcms : num 0.00601 0.00928 0.01696 0.01764 0.02615 ...
$ FLOW_OUTcms: num 2.31 2.84 3.16 18.49 34.42 ...
$ SED_OUTtons: num 215 308 416 3994 11440 ...
$ YYYYMM : int 198301 198302 198303 198304 198305 198306 198307 198308 198309 198310 ...
typeof(Out_1s$YEAR)
[1] "integer"
typeof(Out_1s$MON)
[1] "integer"
typeof(Out_1s$YYYYMM)
[1] "integer"
What I try to do exactly is to create graphical summaries with ggplot2 based on either combining the Out_1s.YEAR and Out_1s.MON columns or to identify the Out_1s.YYYYMM variable as YYYY-MM or MM-YYYY.
Out_1s$Date <- NA
typeof(Out_1s$Date)
[1] "character"
Out_1s$Date <- paste(Out_1s$YEAR,Out_1s$MON, sep = "-")
as.Date.character(Out_1s$Date, "%Y-%m")
graph1 <- ggplot(Out_1s, aes(Date, FLOW_OUTcms ))
graph1 + geom_line()
And the result which is not actually what was expected.
Two problems here.
First, a Date object is a year, month and day. To fix add a "01" to the paste statement.
Out_1s$Date <- paste(Out_1s$YEAR,Out_1s$MON, "01", sep = "-")
In your case since the date did not include a day, the as.Date function would return a series of NAs
Second, is the need to reassign the result from the as.Date function back to the original column.
Out_1s$Date <- as.Date.character(Out_1s$Date, "%Y-%m-%d")

Observations becoming NA when ordering levels of factors in R with ordered()

Hi have a longitudinal data frame p that contains 4 variables and looks like this:
> head(p)
date.1 County.x providers beds price
1 Jan/2011 essex 258 5545 251593.4
2 Jan/2011 greater manchester 108 3259 152987.7
3 Jan/2011 kent 301 7191 231985.7
4 Jan/2011 tyne and wear 103 2649 143196.6
5 Jan/2011 west midlands 262 6819 149323.9
6 Jan/2012 essex 2 27 231398.5
The structure of my variables is the following:
'data.frame': 259 obs. of 5 variables:
$ date.1 : Factor w/ 66 levels "Apr/2011","Apr/2012",..: 23 23 23 23 23 24 24 24 25 25 ...
$ County.x : Factor w/ 73 levels "avon","bedfordshire",..: 22 24 32 65 67 22 32 67 22 32 ...
$ providers: int 258 108 301 103 262 2 9 2 1 1 ...
$ beds : int 5545 3259 7191 2649 6819 27 185 24 70 13 ...
$ price : num 251593 152988 231986 143197 149324 ...
I want to order date.1 chronologically. Prior to apply ordered(), this variable does not contain NA observations.
> summary(is.na(p$date.1))
Mode FALSE NA's
logical 259 0
However, once I apply my function for ordering the levels corresponding to date.1:
p$date.1 = with(p, ordered(date.1, levels = c("Jun/2010", "Jul/2010",
"Aug/2010", "Sep/2010", "Oct/2010", "Nov/2010", "Dec/2010", "Jan/2011", "Feb/2011",
"Mar/2011","Apr/2011", "May/2011", "Jun/2011", "Jul/2011", "Aug/2011", "Sep/2011",
"Oct/2011", "Nov/2011", "Dec/2011" ,"Jan/2012", "Feb/2012" ,"Mar/2012" ,"Apr/2012",
"May/2012", "Jun/2012", "Jul/2012", "Aug/2012", "Sep/2012", "Oct/2012", "Nov/2012",
"Dec/2012", "Jan/2013", "Feb/2013", "Mar/2013", "Apr/2013", "May/2013",
"Jun/2013", "Jul/2013", "Aug/2013", "Sep/2013", "Oct/2013", "Nov/2013",
"Dec/2013", "Jan/2014",
"Feb/2014", "Mar/2014", "Apr/2014", "May/2014", "Jun/2014", "Jul/2014" ,"Aug/2014",
"Sep/2014", "Oct/2014", "Nov/2014", "Dec/2014", "Jan/2015", "Feb/2015", "Mar/2015",
"Apr/2015","May/2015", "Jun/2015" ,"Jul/2015" ,"Aug/2015", "Sep/2015", "Oct/2015",
"Nov/2015")))
It seems I miss some observations.
> summary(is.na(p$date.1))
Mode FALSE TRUE NA's
logical 250 9 0
Has anyone come across with this problem when using ordered()? or alternatively, is there any other possible solution to group my observations chronologically?
It is possible that one of your p$date.1 doesn't matched to any of the levels. Try this ord.monas the levels.
ord.mon <- do.call(paste, c(expand.grid(month.abb, 2010:2015), sep = "/"))
Then, you can try this to see if there's any mismatch between the two.
p$date.1 %in% ord.mon
Last, You can also sort the data frame after transforming the date.1 columng into Date (Note that you have to add an actual date beforehand)
p <- p[order(as.Date(paste0("01/", p$date.1), "%d/%b/%Y")), ]

TM - Clustering data with special date variable

Ive got the following data from tripadvisor:
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : Factor w/ 674 levels "id","rn106322397",..: 672 671 670 669 668 667 666 665 664 663 ...
$ quote : Factor w/ 606 levels "\"Picturesque Lake Konigssee\"",..: 389 139 113 149 384 39 176 598 199 603 ...
$ rating : Factor w/ 6 levels "1","2","3","4",..: 3 5 5 5 4 5 5 5 4 5 ...
$ date : Factor w/ 505 levels "date","Reviewed 1 August 2014\n",..: 200 200 427 427 427 443 434 351 313 494 ...
$ reviewnospace: Factor w/ 674 levels "- Good car parking facilities- Organized boat trips- Ensure that you have enough time at hand for the boat trip",..: 624 573 144 211 507 26 351 672 451 249 ...
I try to cluster the data on the basis of the date, to get two groups - winter and summer vacationers. With this clustering i want to analyse the reviews afterwards. I am using the tm package and tried it with the following code:
> x <- read.csv ("seeganz.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
> corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
> meta(corp,tag = "date") <- x$date
> idx <- meta(corp, "date") == 'December'
But it is not working as the content say 0 documents:
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 1
Content: documents: 0
As the date has the structure "Reviewed 1 August 2014", how do I have to adapt this code to get, for example just the reviews from Nov - Feb?
Do you have any idea how I can solve this problem?
Thank you.
Generic Approach:
Use substr(date, 10, nchar(date)) to get to 1 August 2014 call this new vector dateNew
Use normal date function e.g. as.Date(dateNew,...) to change dateNew into a vector of type Date where you can do subsetting/subtraction and other operations
References from http://www.statmethods.net/input/dates.html
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]

Combining Two Rows with Different Levels according to Some Conditions into One in R

This is a part of my data: (The actual data contains about 10,000 observations with about 500 levels of SalesItem)
s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
s2<-c(155,153,154,150,176,165,159,143,179,150)
S<-data.frame(SalesItem=factor(s1), Sales=s2)
> str(S)
'data.frame': 10 obs. of 2 variables:
$ SalesItem: Factor w/ 10 levels "1008","1009",..: 1 2 3 4 5 6 7 8 9 10
$ Sales : num 155 153 154 150 176 165 159 143 179 150`
What I want to do is, if diff(SalesItem)=1, I want to combine the level of SalesItem into 1, for example: diff between SalesItem 1008 and 1009 equal to one, so, I want to rename SalesItem 1009 to 1008. So, later I can compute the sum of Sales for this SalesItem as one, because of my actual data=10,000, so, it is quite hard for me to do this one by one.
Is there any simplest way for me to do that?
Clearly the fact that you have converted the first column to a factor indicates that you might need those factors in some place. so i would suggest that instead of changing any of the columns, add a third column to your data frame which will help you maintain the SalesItem relevant to that value. here are the steps for it :
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> s3 = ifelse((s1-1) %in% s1, s1-1, s1)
> S <- data.frame(SalesItem=s1, Sales=s2, ItemId=s3)
then you can just count on the basis of the ItemId column.
This is not a terribly efficient solution, but since your data only contains 10000 records, it is not going to be a big problem.
Set up provided example data, but convert the SalesItem field to an integer so that the diff() operation makes sense.
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> S<-data.frame(SalesItem=s1, Sales=s2)
Reorder data frame so that the SalesItem field is in ascending order (not necessary for current data set, but required for solution) then find the differences.
> S = S[order(S$SalesItem),]
> d = c(0, diff(S$SalesItem))
Duplicate the SalesItem data and then filter based on the values of the differences.
> labels = s1
> #
> for (n in 1:nrow(S)) {if (d[n] == 1) labels[n] = labels[n-1]}
> S$labels = labels
The (temporary) labels field now has the required new values for the SalesItem field. Once you are happy that this is doing the right thing, you can modify last line in above code to simply over-write the existing SalesItem field.
> S
SalesItem Sales labels
1 1008 155 1008
2 1009 153 1008
3 1012 154 1012
4 1013 150 1012
5 1016 176 1016
6 1017 165 1016
7 1018 159 1016
8 1019 143 1016
9 1054 179 1054
10 1055 150 1054

Plot time(x axis) and time of day & duration(y axis) of episodes

I am measuring the duration of an event, and I would like to plot the duration, and time the event takes place in each observation day.
My dataset is the following
> str(question.stack)
'data.frame': 398 obs. of 6 variables:
$ id : Factor w/ 1 level "AA11": 1 1 1 1 1 1 1 1 1 1 ...
$ begin.recording : Factor w/ 1 level "8/15/2007": 1 1 1 1 1 1 1 1 1 1 ...
$ begin.of.episode: Factor w/ 111 levels "1/1/2009","1/11/2009",..: 86 86 86 87 88 90 90 96 96 103 ...
$ episode.day : int 12 12 12 13 14 15 15 17 17 18 ...
$ start.time : Factor w/ 383 levels "0:06:01","0:17:12",..: 324 15 18 179 269 320 379 281 287 298 ...
$ duration : num 278 14 1324 18 428 ...
I would like in the x axis the episode.day. The y axis should go from 00:00 to 23:59:59 (start.time). For example, for the second entry of the dataset, i would like a black bar starting at (x=12,y=10:55:12) till (x=12, y=11:09:12) denoting a 14 minute episode duration on day 12. An episode can span between more than 1 days.
Is this possible with R? If possible please only baseR solutions
Something similar is Plot dates on the x axis and time on the y axis with ggplot2 but not exactly what I am looking.
Many thanks
Ok I finally found it.
On the x axis I wanted to plot dates either as POSIXct or as number of day of recording (integer). On the y axis I wanted the time of day so that the graph would present a dark bar on each day (x-axis) and between the time (y-axis) that the episode take place.
R can plot POSIX, but in my case the episode start and end time (for the y-axis) should be date-"less"
I did this like this
#Cleaning the Dataset
qs<-question.stack
qs$id<-as.character(qs$id)
qs$begin.recording<-as.character(qs$begin.recording)
qs$begin.of.episode<-as.character(qs$begin.of.episode)
qs$start.time<-as.character(qs$start.time)
qs$start<-as.character(paste(qs$begin.of.episode,qs$start.time))
qs$duration<-round(qs$duration,0)
#Convert time and dates to POSIXct
qs$start<-as.POSIXct(qs$start,format="%m/%d/%Y %H:%M:%S",tz="UTC")
qs$start<-round(qs$start,"mins")
qs$end<-as.POSIXct(qs$start+qs$duration*60)
qs$start<-as.POSIXct(qs$start)
Now we have
str(qs)
'data.frame': 398 obs. of 8 variables:
$ id : chr "AA11" "AA11" "AA11" "AA11" ...
$ begin.recording : chr "8/15/2007" "8/15/2007" "8/15/2007" "8/15/2007" ...
$ begin.of.episode: chr "8/27/2007" "8/27/2007" "8/27/2007" "8/28/2007" ...
$ episode.day : int 12 12 12 13 14 15 15 17 17 18 ...
$ start.time : chr "6:15:12" "10:55:12" "11:15:12" "18:19:12" ...
$ duration : num 278 14 1324 18 428 ...
$ start : POSIXct, format: "2007-08-27 06:15:00" "2007-08-27 10:55:00" ...
$ end : POSIXct, format: "2007-08-27 10:53:00" "2007-08-27 11:09:00" ...
The following makes a vector which includes all minutes that there was an episode. One can fine tune it by seconds or upscale it by hours
tmp<-do.call(c, apply(qs, 1, function(x) seq(from=as.POSIXct(x[7]), to=as.POSIXct(x[8]),by="mins")))
The following makes a data frame. The switch of the time of day from POSIX to (date-"less") and then back to POSIX garantees that there will be the same date in all times of time.of.day. Perhaps one can also do it with the origin argument.
ep <- data.frame(sqs=tmp, date=as.Date(tmp,"%Y-%m-%d"),time.of.day=as.POSIXct(as.character(format(tmp,"%H:%M")),format="%H:%M"))
Plot
plot(ep$date, ep$time.of.day,pch=".")

Resources