R: Error making time series stationary using diff() - r

I have a dataset with monthly data starting from the 1st of every month during a period of time.
Here's head() of my Date column:
> class(gas_data$Date)
[1] "Date"
> head(gas_data$Date)
[1] "2010-10-01" "2010-11-01" "2010-12-01" "2011-01-01" "2011-02-01" "2011-03-01"
Trying to make the time series stationary, I'm using diff() from the base package:
> gas_data_diff <- diff(gas_data, differences = 1, lag = 12)
> head(gas_data_diff)
data frame with 0 columns and 6 rows
> names(gas_data_diff)
character(0)
> gas_data_diff %>%
+ ggplot(aes(x=Date, y=Price.Gas)) +
+ geom_line(color="darkorchid4")
Error in FUN(X[[i]], ...) : object 'Price.Gas' not found
As you can see I get an error, and when trying to visualize the data with head() or look for the feature names I get an unexpected output.
Why am I getting this error and how could I fix it?
Here's head() of my original data
> head(gas_data)
Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth Average.Temperature Price.Electricity Price.Gas
1 2010-10-01 2.08 3.54 5.40 0.2 10.44 43.50 46.00
2 2010-11-01 -3.04 3.46 6.74 -0.1 5.52 46.40 49.66
3 2010-12-01 0.31 3.54 9.00 -0.9 0.63 58.03 62.26
4 2011-01-01 2.65 3.59 7.58 0.6 4.05 48.43 55.98
5 2011-02-01 1.52 3.20 5.68 0.4 6.29 46.47 53.74
6 2011-03-01 -1.38 3.40 5.93 0.5 6.59 51.41 60.39
This is how the non-stationary plot of the original data looks like for gas prices for instances

Explanation
As per help from diff, the argument x must be
x : a numeric vector or matrix containing the values to be differenced.
As in your case, diff returns an eplty data.frame if x is a dataframe.
Best approach IMO
I don't see much of point in using the date column in diff. So. I am likely to follow the following approach.
rownames(df) <- df$Date
diff(as.matrix(df[, - 1]), lag = 1)
# converr to a matrix and apply diff
diff_mat <- diff(as.matrix(df[, - 1]), lag = 1)
# convert back to dataframe and set the Date column
diff_df <- as.data.frame(diff_mat)
diff_df$Date <- diff_df$Date
# now plot function should work
Using the data given to address the comments.
Convert the Date vector to numeric and then convert to matrix in diff
df$Date <- as.numeric(df$Date)
diff(as.matrix(df), 1, 2)
# Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth
# [1,] -1 8.47 0.16 0.92 -0.5
# [2,] 1 -1.01 -0.03 -3.68 2.3
# [3,] 0 -3.47 -0.44 -0.48 -1.7
# [4,] -3 -1.77 0.59 2.15 0.3
# Average.Temperature Price.Electricity Price.Gas
# [1,] 0.03 8.73 8.94
# [2,] 8.31 -21.23 -18.88
# [3,] -1.18 7.64 4.04
# [4,] -1.94 6.90 8.89
Creating the Data
df <- read.table(text = "Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth Average.Temperature Price.Electricity Price.Gas
2010-10-01 2.08 3.54 5.40 0.2 10.44 43.50 46.00
2010-11-01 -3.04 3.46 6.74 -0.1 5.52 46.40 49.66
2010-12-01 0.31 3.54 9.00 -0.9 0.63 58.03 62.26
2011-01-01 2.65 3.59 7.58 0.6 4.05 48.43 55.98
2011-02-01 1.52 3.20 5.68 0.4 6.29 46.47 53.74
2011-03-01 -1.38 3.40 5.93 0.5 6.59 51.41 60.39",
header = T, sep ="")
df$Date <- as.Date(df$Date)

Related

Using apply and abind with for loop

I am using apply and abind to create a dataframe with the average of all the individual values from three similar data frames. I want to loop this code where the only thing that changes is the name of the instrument I am using (CAI.600, Thermo.1, etc).
This is what I have so far:
FIDs <- c('CAI.600', 'Thermo.1')
for (Instrument in FIDs) {
A.avg <- apply(abind::abind(paste0('FID.Eval.A.1.', Instrument),
paste0('FID.Eval.A.2.', Instrument),
paste0('FID.Eval.A.3.', Instrument),
along = 3), 1:2, mean)
assign(paste0('FID.Eval.A.', Instrument), A.avg)
}
where all the df's look similar to this (same number of rows and columns):
> FID.Eval.A.1.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 84.98 20.90 0.06 254.96 0.01
2 49.98 20.90 0.09 150.09 0.09
3 25.00 20.94 0.09 75.24 0.31
4 85.03 10.00 0.08 251.99 -1.22
5 50.03 10.00 0.09 148.51 -1.06
6 24.99 10.00 0.07 74.00 -1.27
7 84.99 0.10 0.06 246.99 -3.13
8 50.03 0.10 0.14 146.50 -2.39
9 24.96 0.10 0.10 72.97 -2.55
> FID.Eval.A.2.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.45 21.37 0.53 255.43 0.48
2 50.45 21.37 0.56 150.56 0.56
3 25.47 21.41 0.56 75.71 0.78
4 85.50 10.47 0.55 252.46 -0.75
5 50.50 10.47 0.56 148.98 -0.59
6 25.46 10.47 0.54 74.47 -0.80
7 85.46 0.57 0.53 247.46 -2.66
8 50.50 0.57 0.61 146.97 -1.92
9 25.43 0.57 0.57 73.44 -2.08
> FID.Eval.A.3.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.32 21.24 0.40 255.30 0.35
2 50.32 21.24 0.43 150.43 0.43
3 25.34 21.28 0.43 75.58 0.65
4 85.37 10.34 0.42 252.33 -0.88
5 50.37 10.34 0.43 148.85 -0.72
6 25.33 10.34 0.41 74.34 -0.93
7 85.33 0.44 0.40 247.33 -2.79
8 50.37 0.44 0.48 146.84 -2.05
9 25.30 0.44 0.44 73.31 -2.21
I ether get an error message stating "along must be between 0 and 2", or when I adjust along I get a warning stating "argument is not numeric or logical: returning NA".
Should I be using something other than for loop.
When I run abind without using for loop, the end result looks like this:
## Average of repeat tests
FID.Eval.A.CAI.600 <- apply(abind::abind(FID.Eval.A.1.CAI.600,
FID.Eval.A.2.CAI.600,
FID.Eval.A.3.CAI.600,
along = 3), 1:2, mean)
FID.Eval.A.CAI.600 <- as.data.frame(FID.Eval.A.CAI.600)
> FID.Eval.A.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.25 21.17 0.33 255.23 0.28
2 50.25 21.17 0.36 150.36 0.36
3 25.27 21.21 0.36 75.51 0.58
4 85.30 10.27 0.35 252.26 -0.95
5 50.30 10.27 0.36 148.78 -0.79
6 25.26 10.27 0.34 74.27 -1.00
7 85.26 0.37 0.33 247.26 -2.86
8 50.30 0.37 0.41 146.77 -2.12
9 25.23 0.37 0.37 73.24 -2.28
Where 'FID.Eval.A.CAI.600' displays the average for each value from the three df's.
To fix immediate problem, use get() to return object by character reference. As of now, your paste0 calls will only return character strings and not actual object.
abind::abind(get(paste0('FID.Eval.A.1.', Instrument), envir=.GlobalEnv),
get(paste0('FID.Eval.A.2.', Instrument), envir=.GlobalEnv),
get(paste0('FID.Eval.A.3.', Instrument), envir=.GlobalEnv),
along = 3)
In fact, for a more dynamic solution consider mget to return all objects by name pattern without hard-coding each of the three objects.
Also, in R it best to avoid use of assign as much as possible. Instead, consider building one list of many objects with functional assignment and avoid flooding global environment with many separate objects. Below iterates using sapply to build a named list of average matrices.
FIDs <- c('CAI.600', 'Thermo.1')
mat_list <- sapply(FIDs, function(Instrument) {
FIDs_list <- mget(ls(pattern=Instrument, envir=.GlobalEnv), envir=.GlobalEnv)
FIDs_arry <- do.call(abind::abind, c(FIDs_list, along=length(FIDs_list)))
return(apply(FIDS_arry, 1:2, mean))
}, simplify = FALSE)
# OUTPUT ITEMS
mat_list$CAI.600
mat_list$Thermo.1
Even update names to conform to your original needs.
names(mat_list) <- paste0("FID.Eval.A.", names(mat_list))
mat_list$FID.Eval.A.CAI.600
mat_list$FID.Eval.A.Thermo.1

How to select the values in my dataframe which has logical operator "<" (less than), divide them by two, and convert all as.numeric

I have a dataframe with character-type columns. I need to take each value as a string, find out if there is a logical operator exist. If it is TRUE, then divide it by two and there this numeric value. If there is NO logical operator, just put back here a number. At the end I would want to have all "numeric" values dataframe.
This dataframe
1 <0.0001 5.89 34.6 0.044 1.14 <1 1.77 <1 19.2 310 20.2 2.94 1.38 31.9 0.94 0.115 5.2 2.38 38.4 0.078
2 <0.0001 5.77 40.7 0.042 1.25 <1 1.67 <1 20.6 260 19.5 3.14 1.51 30.2 1.04 0.098 27.7 2.54 39.4 0.07
3 <0.0001 6.77 44.3 0.039 1.38 <1 1.62 <1 21.3 180 16.9 3.79 1.65 26.8 1.26 0.076 2.6 2.63 43.4 0.63
To demonstrate, I downloaded your data and truncated it significantly for the purposes of this Q/A.
dat <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
U Th Na Ca Fe Rb Sr
<0.7 7.7 <0.1 <0.4 2.5 <0.001 <150
<0.7 6.5 <0.1 <0.4 2.4 <0.001 <150
<0.7 5.8 <0.1 <0.4 2.9 <0.001 <150
<0.7 7.5 <0.1 <0.4 2.8 <0.001 <150
<0.7 7.6 <0.1 <0.4 3.2 <0.001 <150
<0.7 6.8 0.89 <0.4 4 0.0049 <150
")
myfun <- function(x, ineq = c("<", ">", "=", "<=", ">=")) {
ptn <- paste0("^(", paste(ineq, collapse = "|"), ")") # "^(<|>|=|<=|>=)"
hasineq <- grepl(ptn, x)
x <- gsub(ptn, "", x)
x <- as.numeric(x)
x[hasineq] <- x[hasineq] / 2
x
}
# the use of dat[] here is to preserve its class as 'data.frame'
dat[] <- lapply(dat, myfun)
dat
# U Th Na Ca Fe Rb Sr
# 1 0.35 7.7 0.05 0.2 2.5 0.0005 75
# 2 0.35 6.5 0.05 0.2 2.4 0.0005 75
# 3 0.35 5.8 0.05 0.2 2.9 0.0005 75
# 4 0.35 7.5 0.05 0.2 2.8 0.0005 75
# 5 0.35 7.6 0.05 0.2 3.2 0.0005 75
# 6 0.35 6.8 0.89 0.2 4.0 0.0049 75

How to insert a distance matrix into R and run hierarchical clustering

I know If I have raw data, I can create a distance matrix, however for this problem I have a distance matrix and I want to be able to run commands in R on it, like hclust. Below is my distance matrix I want in R. I am not sure storing this data in matrix form will work as I will be unable to run hclust on a matrix.
I have tried to create it using as.dist functions to no avail. My faulty code:
test=as.dist(c(.76,2.97,4.88,3.86,.8,4.17,1.96,.21,1.51,.51), diag = FALSE, upper = FALSE)
test
1 2 3 4 5 6 7 8 9
2 2.97
3 4.88 2.97
4 3.86 4.88 0.51
5 0.80 3.86 2.97 0.21
6 4.17 0.80 4.88 1.51 0.80
7 1.96 4.17 3.86 0.51 4.17 0.51
8 0.21 1.96 0.80 2.97 1.96 2.97 0.80
9 1.51 0.21 4.17 4.88 0.21 4.88 4.17 0.21
10 0.51 1.51 1.96 3.86 1.51 3.86 1.96 1.51 0.51
Since you already have the distance values, you don't need to use dist() to calculate them. The data can be stored in a regular matrix
test <- matrix(ncol=5,nrow=5)
test[lower.tri(test)] <- c(.76,2.97,4.88,3.86,.8,4.17,1.96,.21,1.51,.51)
diag(test) <- 0
> test
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00 NA NA NA NA
[2,] 0.76 0.00 NA NA NA
[3,] 2.97 0.80 0.00 NA NA
[4,] 4.88 4.17 0.21 0.00 NA
[5,] 3.86 1.96 1.51 0.51 0
In order to apply hclust(), this matrix can then be converted to a distance matrix with as.dist():
> test <- as.dist(test, diag = TRUE)
1 2 3 4 5
1 0.00
2 0.76 0.00
3 2.97 0.80 0.00
4 4.88 4.17 0.21 0.00
5 3.86 1.96 1.51 0.51 0.00
> hclust(test)
#
#Call:
#hclust(d = test)
#
#Cluster method : complete
#Number of objects: 5
> plot(hclust(test))

Substituting the results of a calculation

I'm munging data, specifically, I've opened this pdf http://pubs.acs.org/doi/suppl/10.1021/ja105035r/suppl_file/ja105035r_si_001.pdf and scraped the data from table s4,
1a 1b 1a 1b
1 5.27 4.76 5.09 4.75
2 2.47 2.74 2.77 2.80
4 1.14 1.38 1.12 1.02
6 7.43 7.35 7.22-7.35a 7.25-7.36a
7 7.38 7.34 7.22-7.35a 7.25-7.36a
8 7.23 7.20 7.22-7.35a 7.25-7.36a
9(R) 4.16 3.89 4.12b 4.18b
9(S) 4.16 3.92 4.12b 4.18b
10 1.19 0.91 1.21 1.25
pasted it into notepad and saved it as a txt file.
s4 <- read.table("s4.txt", header=TRUE, stringsAsFactors=FALSE)
gives,
X1a X1b X1a.1 X1b.1
1 5.27 4.76 5.09 4.75
2 2.47 2.74 2.77 2.80
4 1.14 1.38 1.12 1.02
6 7.43 7.35 7.22-7.35a 7.25-7.36a
7 7.38 7.34 7.22-7.35a 7.25-7.36a
8 7.23 7.20 7.22-7.35a 7.25-7.36a
in order to use the data I need to change it all to numeric and remove the letters, thanks to this link R regex gsub separate letters and numbers I can use the following code,
gsub("([[:alpha:]])","",s4[,3])
I can get rid of the extraneous letters.
What I want to do now, and the point of the question, is to change the ranges,
"7.22-7.35" "7.22-7.35" "7.22-7.35"
with their means,
"7.29"
Could I use gsub for this? (or would I need to strsplit across the hyphen, combine into a vector and return the mean?).
You need a single regex in strsplit for this task (removing letters and splitting):
s4[] <- lapply(s4, function(x) {
if (is.numeric(x)) x
else sapply(strsplit(as.character(x), "-|[[:alpha:]]"),
function(y) mean(as.numeric(y)))
})
The result:
> s4
X1a X1b X1a.1 X1b.1
1 5.27 4.76 5.090 4.750
2 2.47 2.74 2.770 2.800
4 1.14 1.38 1.120 1.020
6 7.43 7.35 7.285 7.305
7 7.38 7.34 7.285 7.305
8 7.23 7.20 7.285 7.305
Here's an approach that seems to work right on the sample data:
df[] <- lapply(df, function(col){
col <- gsub("([[:alpha:]])","", col)
col <- ifelse(grepl("-", col), mean(as.numeric(unlist(strsplit(col[grepl("-", col)], "-")))), col)
as.numeric(col)
})
> df
# X1a X1b X1a.1 X1b.1
#1 5.27 4.76 5.090 4.750
#2 2.47 2.74 2.770 2.800
#4 1.14 1.38 1.120 1.020
#6 7.43 7.35 7.285 7.305
#7 7.38 7.34 7.285 7.305
#8 7.23 7.20 7.285 7.305
Disclaimer: It only works right if the ranges in each column are all the same (as in the sample data)
something like that :
mean(as.numeric(unlist(strsplit("7.22-7.35","-"))))
should work (and correspond to what you had in mind I guess)
or you can do :
eval(parse(text=paste0("mean(c(",gsub("-",",","7.22-7.35"),"))")))
but I'm not sure this is simpler...
To apply it to a vector :
vec<-c("7.22-7.35","7.22-7.35")
1st solution : sapply(vec, function(x) mean(as.numeric(unlist(strsplit(x,"-")))))
2nd solution : sapply(vec, function(x) eval(parse(text=paste0("mean(c(",gsub("-",",",x),"))"))))
In both cases, you'll get :
7.22-7.35 7.22-7.35
7.285 7.285
Also,
library(gsubfn)
indx <- !sapply(s4, is.numeric)
s4[indx] <- lapply(s4[indx], function(x)
sapply(strapply(x, '([0-9.]+)', ~as.numeric(x)), mean))
s4
# X1a X1b X1a.1 X1b.1
#1 5.27 4.76 5.090 4.750
#2 2.47 2.74 2.770 2.800
#4 1.14 1.38 1.120 1.020
#6 7.43 7.35 7.285 7.305
#7 7.38 7.34 7.285 7.305
#8 7.23 7.20 7.285 7.305

Cut Function in R program

Time Velocity
0 0
1.5 1.21
3 1.26
4.5 1.31
6 1.36
7.5 1.41
9 1.46
10.5 1.51
12 1.56
13 1.61
14 1.66
15 1.71
16 1.76
17 1.81
18 1.86
19 1.91
20 1.96
21 2.01
22.5 2.06
24 2.11
25.5 2.16
27 2.21
28.5 2.26
30 2.31
31.5 2.36
33 2.41
34.5 2.4223
36 2.4323
So I have data about Time and Velocity...I want to use the cut or the which function to separate my data into 6 min intervals...my Maximum Time usually goes up to 3000 mins
So I would want the output to be similar to this...
Time Velocity
0 0
1.5 1.21
3 1.26
4.5 1.31
6 1.36
Time Velocity
6 1.36
7.5 1.41
9 1.46
10.5 1.51
12 1.56
Time Velocity
12 1.56
13 1.61
14 1.66
15 1.71
16 1.76
17 1.81
18 1.86
So what I did so far is read the data using data=read.delim("clipboard")
I decided to use the function 'which'....but I would need to do it for up 3000 mins etc
dat <- data[which(data$Time>=0
& data$Time < 6),],
dat1 <- data[which(data$Time>=6
& data$Time < 12),]
etc
But this wouldn't be so convenient if I had time to went up to 3000 mins
Also I would want all my results to be contained in one output/ variable
I will assume here that you really don't want to duplicate the values across the bins.
cuts = cut(data$Time, seq(0, max(data$Time)+6, by=6), right=FALSE)
x <- by(data, cuts, FUN=I)
x
## cuts: [0,6)
## Time Velocity
## 1 0.0 0.00
## 2 1.5 1.21
## 3 3.0 1.26
## 4 4.5 1.31
## ------------------------------------------------------------------------------------------------------------
## cuts: [6,12)
## Time Velocity
## 5 6.0 1.36
## 6 7.5 1.41
## 7 9.0 1.46
## 8 10.5 1.51
## ------------------------------------------------------------------------------------------------------------
## <snip>
## ------------------------------------------------------------------------------------------------------------
## cuts: [36,42)
## Time Velocity
## 28 36 2.4323
I don't think that you want to get duplicated bounds. here A simple solution without using cut( similar to #Mathew solution).
dat <- transform(dat, index = dat$Time %/% 6)
by(dat,dat$index,FUN=I)
If you really need duplicates timestamps that are integral multipes of 6 then you will have to do some data duplication before splitting.
txt <- "Time Velocity\n0 0\n1.5 1.21\n3 1.26\n4.5 1.31\n6 1.36\n7.5 1.41\n9 1.46\n10.5 1.51\n12 1.56\n13 1.61\n14 1.66\n15 1.71\n16 1.76\n17 1.81\n18 1.86\n19 1.91\n20 1.96\n21 2.01\n22.5 2.06\n24 2.11\n25.5 2.16\n27 2.21\n28.5 2.26\n30 2.31\n31.5 2.36\n33 2.41\n34.5 2.4223\n36 2.4323"
DF <- read.table(text = txt, header = TRUE)
# Create duplicate timestamps where timestamp is multiple of 6 second
posinc <- DF[DF$Time%%6 == 0, ]
neginc <- DF[DF$Time%%6 == 0, ]
posinc <- posinc[-1, ]
neginc <- neginc[-1, ]
# Add tiny +ve and -ve increments to these duplicated timestamps
posinc$Time <- posinc$Time + 0.01
neginc$Time <- neginc$Time - 0.01
# Bind original dataframe without 6 sec multiple timestamp with above duplicated timestamps
DF2 <- do.call(rbind, list(DF[!DF$Time%%6 == 0, ], posinc, neginc))
# Order by timestamp
DF2 <- DF2[order(DF2$Time), ]
# Split the dataframe by quotient of timestamp divided by 6
SL <- split(DF2, DF2$Time%/%6)
# Round back up the timestamps of split data to 1 decimal place
RESULT <- lapply(SL, function(x) {
x$Time <- round(x$Time, 1)
return(x)
})
RESULT
## $`0`
## Time Velocity
## 2 1.5 1.21
## 3 3.0 1.26
## 4 4.5 1.31
## 51 6.0 1.36
##
## $`1`
## Time Velocity
## 5 6.0 1.36
## 6 7.5 1.41
## 7 9.0 1.46
## 8 10.5 1.51
## 91 12.0 1.56
##
## $`2`
## Time Velocity
## 9 12 1.56
## 10 13 1.61
## 11 14 1.66
## 12 15 1.71
## 13 16 1.76
## 14 17 1.81
## 151 18 1.86
##
## $`3`
## Time Velocity
## 15 18.0 1.86
## 16 19.0 1.91
## 17 20.0 1.96
## 18 21.0 2.01
## 19 22.5 2.06
## 201 24.0 2.11
##
## $`4`
## Time Velocity
## 20 24.0 2.11
## 21 25.5 2.16
## 22 27.0 2.21
## 23 28.5 2.26
## 241 30.0 2.31
##
## $`5`
## Time Velocity
## 24 30.0 2.3100
## 25 31.5 2.3600
## 26 33.0 2.4100
## 27 34.5 2.4223
## 281 36.0 2.4323
##
## $`6`
## Time Velocity
## 28 36 2.4323
##

Resources