X axis date/time format issues with xts object in r - r

I have been using R for several years, but could really use some help with this. Time series data is not my norm. For some background, this data comes from I-buttons, which record temperature that were planted in different sized patches in a landscape. My data looks like this in general:
Ibutton_new.csv:
Date Edge1 Edge2 Edge3...
2012-7-16 25 24 24.5
2012-7-16 24 23 23
2012-7-16 23.5 22.5 22.5
2012-7-16 27.5 24.5 24.5
2012-7-16 27 27.5 26.5
2012-7-16 27 26.5 27
2012-7-17 26 25 25
2012-7-17 25 25 25
2012-7-17 24 23 23
2012-7-17 24 23 23
2012-7-17 28 29 27.5
2012-7-17 28 28 28
etc for a year
Step 1: I convert my data into an xts object:
library(zoo)
library(xts)
x<-read.csv("Ibutton_new.csv")
x$Date <- mdy(x$Date)
x.xts <- xts(x[,-1], order.by=x[,1])
class(x.xts)
[1] "xts" "zoo"
str(x.xts)
An ‘xts’ object on 2012-07-16/2013-06-22 containing:
Data: num [1:2048, 1:114] 25 24 23.5 27.5 27 27 26 25 24 28 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:114] "edge_1" "edge_2" "edge_3" "edge_4" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
NULL
Step 2:Make a graph
windows()
plot(x.xts,col=color1,main="Soil Temperature (C) Across Whole
Site",lwd=2,ylim=c(-10,50),cex.axis=1.5)
Okay, in general I am pretty happy with this, except for the x axis which is just ugly. I don't know why it is putting time stamps with the date values? I would like it just to have tick marks and labels by month. So I tried this:
plot(x.xts,col=color1,main="Soil Temperature (C) Across Whole
Site",lwd=2,ylim=c(-10,50),cex.axis=1.5,major.ticks="months",grid.ticks.on="months")
So grid lines and tick marks looked fine, but the labels were still the same. Then I tried this:
plot(x.xts,xaxt="n",col=color1,main="Soil Temperature (C) Across Whole
Site",lwd=2,ylim=c(-10,50),cex.axis=1.5)
ticks <- axTicksByTime(x.xts,"months",format.labels="%b-%Y")
axis(1,at = .index(x.xts)[ticks], labels = names(ticks),mgp=c(0,0.5,0))
And hilariously got this:
So I am so close, but not quite there. Any suggestions? Normally I would just import this into powerpoint and edit it. I have an add on that can export high quality pictures, but my computer is in a repair shop right now. Also, I hate feeling like R got the best of me. I am sure there is an easy solution or something silly I did but am not seeing, possibly a formatting issue in step one? It took me forever to be able to even make a proper xts object. Again, I don't normally work with this type of data or packages. Thanks in advance!

Using axis() just adds a formatted axis over the existing plot, similar to lines(), for example.
The solution is to include the formatting you want in the original call to plot() like so:
plot(x.xts,xaxt="n",col=color1,main="Soil Temperature (C) Across Whole
Site",lwd=2,ylim=c(-10,50),cex.axis=1.5, format.labels="%b-%Y")

Related

How to modify number of characters in a vector?

so I have this dataset where age of the respondent was an open-ended question and responses sometimes look as follows:
Age:
23
45
36 years
27
33yo
...
I would like to save the numeric data, without introducing (and filtering out NAs), and I wondered if there is an option of making this out of it:
Age:
23
45
36
27
33
...
..by restricting the n.o. characters in the vector and converting them later to "numeric".
I believe there is a simple line for this. I just somehow couldn't find it.

How do I interpret the results of the findCorrelation() function?

I am having trouble with the findCorrelation() function, Here is my input and the output.
>findCorrelation(cor(subset(segdata, select=-c(56))),cutoff=0.9)
[1] 16 17 14 15 30 51 31 25 40
>cor(segdata)[c(16,17,14,15,30,51,31,25,40),c(16,17,14,15,30,51,31,25,40)]
enter image description here
I deleted the 56 colum because this is factor variable.
Above the code, I use cutoff=0.9. it means print only those variables whose correlation is greater than or equal to 0.9.
But, in the result image file, the end variable(P12002900) has very very low correlation. As i use "cutoff=0.9", Low correlations such as P12002900 should not be output.
why this is printed??
so I use Vehicle bigdata that presented in R.
>library(mlbench)
>library(caret)
>data(Vehicle)
>findCorrelation(cor(subset(Vehicle,select=-c(Class))),cutoff=0.9)
[1]3 8 11 7 9 2
>cor(subset(Vehicle,select=-c(Class)))[c(3,8,11,7,9,2),c(3,8,11,7,9,2)]
this is result image.
enter image description here
the last variable(Circ) has lower than 0.9 correlation.
but it is printed....
please help me... thanks you for your help!

Using abline() when x-axis is date (ie, time-series data)

I want to add multiple vertical lines to a plot.
Normally you would specify abline(v=x-intercept) but my x-axis is in the form Jan-95 - Dec-09. How would I adapt the abline code to add a vertical line for example in Feb-95?
I have tried abline(v=as.Date("Jan-95")) and other variants of this piece of code.
Following this is it possible to add multiple vertical lines with one piece of code, for example Feb-95, Feb-97 and Jan-98?
An alternate solution could be to alter my plot, I have a column with month information and a column with the year information, how do I collaborate these to have a year month on the X-axis?
example[25:30,]
Year Month YRM TBC
25 1997 1 Jan-97 136
26 1997 2 Feb-97 157
27 1997 3 Mar-97 163
28 1997 4 Apr-97 152
29 1997 5 May-97 151
30 1997 6 Jun-97 170
The first note: your YRM column is probably a factor, not a datetime object, unless you converted it manually. I assume we do not want to do that and our plot is looking fine with YRM as a factor.
In that case
vline_month <- function(s) abline(v=which(s==levels(df$YRM)))
# keep original order of levels
df$YRM <- factor(df$YRM, levels=unique(df$YRM))
plot(df$YRM, df$TBC)
vline_month(c("Jan-97", "Apr-97"))
Disclaimer: this solution is a quick hack; it is neither universal nor scalable. For accurate representation of datetime objects and extensible tools for them, see packages zoo and xts.
I see two issues:
a) converting your data to a date/POSIX element, and
b) actually plotting vertical lines at specific rows.
For the first, create a proper date string then use strptime().
The second issue is resolved by converting the POSIX date to numeric using as.numeric().
# dates need Y-M-D
example$ymd <- paste(example$Year, '-', example$Month, '-01', sep='')
# convet to POSIX date
example$ymdPX <- strptime(example$ymd, format='%Y-%m-%d')
# may want to define tz otherwise system tz is used
# plot your data
plot(example$ymdPX, example$TBC, type='b')
# add vertical lines at first and last record
abline(v=as.numeric(example$ymdPX[1]), lwd=2, col='red')
abline(v=as.numeric(example$ymdPX[nrow(example)]), lwd=2, col='red')

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

read.csv appends/modifies column headings with date values

I'm trying to read a csv file into R that has date values in some of the colum headings.
As an example, the data file looks something like this:
ID Type 1/1/2001 2/1/2001 3/1/2001 4/1/2011
A Supply 25 35 45 55
B Demand 26 35 41 22
C Supply 25 35 44 85
D Supply 24 39 45 75
D Demand 26 35 41 22
...and my read.csv logic looks like this
dat10 <- read.csv("c:\data.csv",header=TRUE, sep=",",as.is=TRUE)
The read.csv works fine except it modifies the name of the colums with dates as follows:
x1.1.2001 x2.1.2001 x3.1.2001 x4.1.2001
Is there a way to prevent this, or a easy way to correct afterwards?
Set check.names=FALSE. But be aware that 1/1/2001 et al are syntactically invalid names, therefore they may cause you some headaches.
You can always change the column names using the colnames function. For example,
colnames(dat10) = gsub("\\.", "/", colnames(dat10))
However, having slashes in your column names isn't a particularly good idea. You can always change them just before you print out the table or when you create a graph.

Resources