R boxplot : How do I "reorder" the date field? - r

The table consist of a Month/Year field i.e. "January 2016".
How do I use the reorder a boxplot to display the X axis in date order (Jan 2016...Feb 2016.... What I tried using the following code :
boxplot(YR$S~reorder(format(YR$MY,'%M %Y'),YR$MY),outline =FALSE)
<pre>
IDX MY Day V Time G S W
24 January 2015 1 G 1821 6 11 71
25 January 2015 2 G 1600 9 15 1
26 January 2015 5 G 1700 5 14 64
27 January 2015 6 F 1805 3 14 4
28 January 2015 7 G 1716 3 15 45
29 January 2015 9 F 1910 3 8 38

As I mentioned above depending on the format of the data and how best to bin the data (i.e. monthly daily) would affect the recommendation. Below is different approaches which I would consider (may not be the best way but it can get the job done):
#Sample data
string<-rep(c("January 2016", "February 2016", "March 2016"), 3)
day<-rep(c(1:3), each=3)
value<-runif(9,10, 20)
#data frame with string, int and float
df<-data.frame(string, day, value)
#Date as string
boxplot(df$value~df$string, las=2, main="String")
#undersirable - x - axis not in order
#Date as a Date Class
#convert to Date Class
#xdate<-as.Date(paste(df$string, day), format= "%B %Y %d")
#Need to convert everything to first of month to bin by month
xdate<-as.Date(paste(df$string, 1), format= "%B %Y %d")
b<-boxplot(df$value~xdate, las=2, main="Date", names=unique(months(xdate)))
#Good - may need work on x axis labels
#Date as a factor
#convert to factor
xfactor<-as.factor(df$string)
#sets the factors in month order (drops the year suffix)
xfactor<-factor(xfactor, levels = paste(month.name, "2016"))
#remove unused levels
xfactor<-droplevels(xfactor)
boxplot(df$value~xfactor, las=2, main="factor")
#Good - may need work on x axis labels depending in timeframe on interest
All three attempts have their pro and cons and depending on the initial format, how much data, report frequency and the final results determines the best approach.
Hope this helps.

Convert your dates to class Date, so that boxplot can pick the appropriate continuous scale for the x axis and order your values automatically:
y <- YR$S
oldloc <- Sys.getlocale("LC_TIME"); Sys.setlocale("LC_TIME", "english")
x <- as.Date(with(YR, paste(paste(MY, Day, sep = "-"))), format="%B-%Y-%d")
Sys.setlocale("LC_TIME", oldloc)
boxplot(y~x)
I set the locale to English so that R knows how to interpret "January" in foreign languages (in German for example it's "Januar"). You may omit that, if you are set to English already...
Data used:
YR <- read.table(header=T, text="
MY Day V Time G S W
February-2015 1 G 1821 6 11 71
January-2015 2 G 1600 9 15 1
January-2015 5 G 1700 5 14 64
January-2015 6 F 1805 3 14 4
January-2015 7 G 1716 3 15 45
January-2015 9 F 1910 3 8 38")

Thanks one and all for your responses.
Turns out that there is a much shorter and simpler soution. The library "Rlab" has a built in box chart + binning function called "bplot. Here's a copde example: (MY = month-year field, S = number of sunspots)
library(Rlab)
bplot(MY,S)

Related

Time series - Convert every column of dataframe to time series

I have a dataframe df in R:
month abc1 def2 xyz3
201201 1 2 4
201202 2 5 7
201203 4 11 4
201204 6 23 40
I would like to convert each of the columns (of which there are ~50, each with ~100 monthly observations) to a time series format in order to check for seasonality in the data, using the decompose function.
I assumed a for loop using the ts function would be the best way of doing this. I would like to use something along the lines of the loop below, although I realise using a function on the left side of the <- produces an error. Is there a way to dynamically name variables generated by a loop?
for(i in 2:ncol(df)) {
paste(names(df[, i]), "_ts") <- ts(df[ ,i], start = c(2012, 1), end = c(2021,11), frequency = 12)
}
You could try zoo:
test = data.frame(month=c("201201", "201202", "201203", "201204"), abc1=c(1,2,3,4), def2=c(4,6,7,10), xyz3=c(12,15,16,19))
library(zoo)
ZOO =zoo(test[, c("abc1", "def2", "xyz3")], order.by=as.Date(paste0(test$month, "01"), format="%Y%m%d"))
ts(ZOO, frequency=12)
Output:
abc1 def2 xyz3
Jan 1 1 4 12
Feb 1 2 6 15
Mar 1 3 7 16
Apr 1 4 10 19
attr(,"index")
[1] 2012-01-01 2012-02-01 2012-03-01 2012-04-01
Update:
Now with correct frequency.

R: assign months to day of the year

Here's my data which has 10 years in one column and 365 day of another year in second column
dat <- data.frame(year = rep(1980:1989, each = 365), doy= rep(1:365, times = 10))
I am assuming all years are non-leap years i.e. they have 365 days.
I want to create another column month which is basically month of the year the day belongs to.
library(dplyr)
dat %>%
mutate(month = as.integer(ceiling(day/31)))
However, this solution is wrong since it assigns wrong months to days. I am looking for a dplyr
solution possibly.
We can convert it to to datetime class by using the appropriate format (i.e. %Y %j) and then extract the month with format
dat$month <- with(dat, format(strptime(paste(year, doy), format = "%Y %j"), '%m'))
Or use $mon to extract the month and add 1
dat$month <- with(dat, strptime(paste(year, doy), format = "%Y %j")$mon + 1)
tail(dat$month)
#[1] 12 12 12 12 12 12
This should give you an integer value for the months:
dat$month.num <- month(as.Date(paste(dat$year, dat$doy), '%Y %j'))
If you want the month names:
dat$month.names <- month.name[month(as.Date(paste(dat$year, dat$doy), '%Y %j'))]
The result (only showing a few rows):
> dat[29:33,]
year doy month.num month.names
29 1980 29 1 January
30 1980 30 1 January
31 1980 31 1 January
32 1980 32 2 February
33 1980 33 2 February

month.abb[] is resulting in incorrect results

I have the following data set. I am trying to split the date_1 field into month and days. Then converting the month number to a month name.
date_1,no_of_births_1
1/1,1482
2/2,1213
3/23,1220
4/4,1319
5/11,1262
6/18,1271
I am using month.abb[] for converting the month number to name. But instead of providing month name for each value of month number, the result is generating wrong array.
for example: month.abb[2] is generating Apr instead of Feb.
date_1 no_of_births_1 V1 V2 month
1 1/1 1482 1 1 Jan
2 2/2 1213 2 2 Apr
3 3/23 1220 3 23 May
4 4/4 1319 4 4 Jun
5 5/11 1262 5 11 Jul
6 6/18 1271 6 18 Aug
below is the code i am using,
birthday<-read.csv("Birthday_s.csv",header = TRUE)
birthday$date_1<-as.character(birthday$date_1)
#split the data
listx<-sapply(birthday$date_1,function(x) strsplit(x,"/"))
library(base)
#convert to data frame
mat<-as.data.frame(matrix(unlist(listx),ncol = 2, byrow = TRUE))
#combine birthday and mat
birthday2<-cbind(birthday,mat)
#convert month number to month name
birthday2$month<-sapply(birthday2$V1, function(x) month.abb[as.numeric(x)])
When I run your code, I get the correct months. However, your code is more complicated than necessary. Here are two ways to extract month and day from date_1:
First, when you read the data, use stringsAsFactors=FALSE, which prevents strings from getting converted to factors.
birthday <- read.csv("Birthday_s.csv",header = TRUE, stringsAsFactors=FALSE)
Extract month and days using date functions:
library(lubridate)
birthday$month = month(as.POSIXct(birthday$date_1, format="%m/%d"), abbr=TRUE, label=TRUE)
birthday$day = day(as.POSIXct(birthday$date_1, format="%m/%d"))
Extract month and days using Regular Expressions:
birthday$month = month.abb[as.numeric(gsub("([0-9]{1,2}).*", "\\1", birthday$date_1))]
birthday$day = as.numeric(gsub(".*/([0-9]{1,2}$)", "\\1", birthday$date_1))

Plotting the frequency of string matches over time in R

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

Choose specific date with strptime in r

I have a text file dataset with headers
YEAR MONTH DAY value
which runs hourly from 1/6/2010 to 14/7/2012. I open and plot the data with the following commands:
data=read.table('example.txt',header=T)
time = strptime(paste(data$DAY,data$MONTH,data$YEAR,sep="-"), format="%d-%m-%Y")
plot(time,data$value)
However, when the data are plotted, the x axis only shows 2011 and 2012. . How can I do to keep the 2011 and 2012 labels but also to add some specific month, e.g. if I want March, June & September?
I have made the data available on this link
https://dl.dropbox.com/u/107215263/example.txt
You need to use function axis.POSIXct to format and dispose of your date labels as you wish:
plot(time,data$value,xaxt="n") #Skip the x-axis here
axis.POSIXct(1, at=pretty(time), format="%B %Y")
To see all possible formats, see ?strptime.
You can of course play with parameter at to place your ticks wherever you want, for instance:
axis.POSIXct(1, at=seq(time[1],time[length(time)],"3 months"),
format="%B %Y")
While this doesn't answer question directly, I would like to suggest you to use xts package for any timeseries analysis. It makes timeseries analysis very convenient
require(xts)
DF <- read.table("https://dl.dropbox.com/u/107215263/example.txt", header = TRUE)
head(DF)
## YEAR MONTH DAY value
## 1 2010 6 1 95.3244
## 2 2010 6 2 95.3817
## 3 2010 6 3 100.1968
## 4 2010 6 4 103.8667
## 5 2010 6 5 104.5969
## 6 2010 6 6 107.2666
#Get Index for xts object which we will create in next step
DFINDEX <- ISOdate(DF$YEAR, DF$MONTH, DF$DAY)
#Create xts timeseries
DF.XTS <- .xts(x = DF$value, index = DFINDEX, tzone = "GMT")
head(DF.XTS)
## [,1]
## 2010-06-01 12:00:00 95.3244
## 2010-06-02 12:00:00 95.3817
## 2010-06-03 12:00:00 100.1968
## 2010-06-04 12:00:00 103.8667
## 2010-06-05 12:00:00 104.5969
## 2010-06-06 12:00:00 107.2666
#plot xts
plot(DF.XTS)

Resources