Merging zoo object using do.call in R - r

A have several csv file like this :
,timestamp,AirTemperature_House
1,2013-09-01 00:00:00,8.22
2,2013-09-01 01:00:00,6.53
3,2013-09-01 02:00:00,6.67
4,2013-09-01 03:00:00,5.58
5,2013-09-01 04:00:00,4.16
6,2013-09-01 05:00:00,4.76
7,2013-09-01 06:00:00,5.06
8,2013-09-01 07:00:00,5.16
9,2013-09-01 08:00:00,6.83
10,2013-09-01 09:00:00,8.59
11,2013-09-01 10:00:00,10.99
12,2013-09-01 11:00:00,11.08
I grouped them to a list of zoo object using the following code :
raw_data<-list.files(path = "./AWS_Data_STU/Air_temp/",pattern="Air",full.names = T)
data_stu<-lapply(raw_data,function(x){
ss<-read.csv(x)
ss<-zoo(ss,order.by = ss$timestamp)
})
I made a list of zoo object which all look like this one :
str(data_stu[[1]])
‘zoo’ series from 2013-09-01 00:00:00 to 2014-04-30 23:00:00
Data: num [1:5808] 8.22 6.53 6.67 5.58 4.16 4.76 5.06 5.16 6.83 8.59 ...
Index: Factor w/ 5808 levels "2013-09-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10
...
I want to merge all my list to a data frame as :
X1 x2 x3 X4 x5 x6 x7
1 12.95 NA NA NA
2 14.81 14.37 NA NA 12.78 NA
3 15.02 15.11 NA NA 12.61 NA
4 13.91 14.25 NA NA 11.89 NA
5 12.34 13.96 NA NA 10.86 NA
6 14.40 14.47 NA NA 10.40 NA
I used the do call function
do.call(merge.zoo,data_stu )
structure(c(7.66, 7.29, 7.34, 7.15, 6.76, 6.41, 6.25, 6.36, 6.78,
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8
but it gave me only NA object.
Any ideas ?

The problem is that the index of all your zoo objects are factors. You need to convert them to POSIXct. Also, you should not call methods directly. I.e., you should call merge instead of merge.zoo and let R handle method dispatch.
You can also use read.zoo to help with the conversion.
data_stu <- do.call(merge, lapply(raw_data, read.zoo, sep=",", header=TRUE,
FUN=as.POSIXct, colClasses=c("NULL", "character", "numeric")))

Related

Interpolate NA values when column ends on NA

I have a column with numeric data with NA and also ending on NA:
df <- data.frame(
Diam_av = c(12.3, 13, 15.5, NA, NA, NA, NA, 13.7, NA, NA, NA, 9.98, 4,0, 8.76, NA, NA, NA)
)
I want to interpolate the missing values. This works fine with zoo's function na.approx as long as there are positive boundary values to interpolate from but it fails if, as in my case, one of the boundary values is NA (at the end of the column Daim_av:
library(zoo)
df %>%
mutate(Diam_intpl = na.approx(Diam_av))
Error: Problem with `mutate()` input `Diam_intpl`.
x Input `Diam_intpl` can't be recycled to size 18.
ℹ Input `Diam_intpl` is `na.approx(Diam_av)`.
ℹ Input `Diam_intpl` must be size 18 or 1, not 15.
Any idea how to exclude/neutralize column-final NA values?
Add na.rm=F to remove the error message. Add rule=2 to get the value from the last non-NA value.
df %>%
mutate(Diam_intpl = na.approx(Diam_av, na.rm=F),
Diam_intpl2 = na.approx(Diam_av, na.rm=F, rule=2))
Diam_av Diam_intpl Diam_intpl2
1 12.30 12.30 12.30
2 13.00 13.00 13.00
3 15.50 15.50 15.50
4 NA 15.14 15.14
5 NA 14.78 14.78
6 NA 14.42 14.42
7 NA 14.06 14.06
8 13.70 13.70 13.70
9 NA 12.77 12.77
10 NA 11.84 11.84
11 NA 10.91 10.91
12 9.98 9.98 9.98
13 4.00 4.00 4.00
14 0.00 0.00 0.00
15 8.76 8.76 8.76
16 NA NA 8.76
17 NA NA 8.76
18 NA NA 8.76
If I understand well, you can replace NAs with imputeTS::na_interpolation(), that has many options:
library(imputeTS)
df$interpolated <- na_interpolation(df,option = 'linear')$Diam_av
Diam_av interpolated
1 12.30 12.30
2 13.00 13.00
3 15.50 15.50
4 NA 15.14
5 NA 14.78
6 NA 14.42
7 NA 14.06
8 13.70 13.70
9 NA 12.77
10 NA 11.84
11 NA 10.91
12 9.98 9.98
13 4.00 4.00
14 0.00 0.00
15 8.76 8.76
16 NA 8.76
17 NA 8.76
18 NA 8.76

Quarterly year-to-year changes

I have a quarterly time series. I am trying to apply a function which is supposed calculate the year-to-year growth and year-to-year difference and multiply a variable by (-1).
I already used a similar function for calculating quarter-to-quarter changes and it worked.
I modified this function for yoy changes and it does not have any effect on my data frame. And any error popped up.
Do you have any suggestion how to modify the function or how to accomplish to apply the yoy change function on a time series?
Here is the code:
Date <- c("2004-01-01","2004-04-01", "2004-07-01","2004-10-01","2005-01-01","2005-04-01","2005-07-01","2005-10-01","2006-01-01","2006-04-01","2006-07-01","2006-10-01","2007-01-01","2007-04-01","2007-07-01","2007-10-01")
B1 <- c(3189.30,3482.05,3792.03,4128.66,4443.62,4876.54,5393.01,5885.01,6360.00,6930.00,7430.00,7901.00,8279.00,8867.00,9439.00,10101.00)
B2 <- c(7939.97,7950.58,7834.06,7746.23,7760.59,8209.00,8583.05,8930.74,9424.00,9992.00,10041.00,10900.00,11149.00,12022.00,12662.00,13470.00)
B3 <- as.numeric(c("","","","",140.20,140.30,147.30,151.20,159.60,165.60,173.20,177.30,185.30,199.30,217.10,234.90))
B4 <- as.numeric(c("","","","",-3.50,-14.60,-11.60,-10.20,-3.10,-16.00,-4.90,-17.60,-5.30,-10.90,-12.80,-8.40))
df <- data.frame(Date,B1,B2,B3,B4)
The code will produce following data frame:
Date B1 B2 B3 B4
1 2004-01-01 3189.30 7939.97 NA NA
2 2004-04-01 3482.05 7950.58 NA NA
3 2004-07-01 3792.03 7834.06 NA NA
4 2004-10-01 4128.66 7746.23 NA NA
5 2005-01-01 4443.62 7760.59 140.2 -3.5
6 2005-04-01 4876.54 8209.00 140.3 -14.6
7 2005-07-01 5393.01 8583.05 147.3 -11.6
8 2005-10-01 5885.01 8930.74 151.2 -10.2
9 2006-01-01 6360.00 9424.00 159.6 -3.1
10 2006-04-01 6930.00 9992.00 165.6 -16.0
11 2006-07-01 7430.00 10041.00 173.2 -4.9
12 2006-10-01 7901.00 10900.00 177.3 -17.6
13 2007-01-01 8279.00 11149.00 185.3 -5.3
14 2007-04-01 8867.00 12022.00 199.3 -10.9
15 2007-07-01 9439.00 12662.00 217.1 -12.8
16 2007-10-01 10101.00 13470.00 234.9 -8.4
And I want to apply following changes on the variables:
# yoy absolute difference change
abs.diff = c("B1","B2")
# yoy percentage change
percent.change = c("B3")
# make the variable negative
negative = c("B4")
This is the fuction that I am trying to use for my data frame.
transformation = function(D,abs.diff,percent.change,negative)
{
TT <- dim(D)[1]
DData <- D[-1,]
nms <- c()
for (i in c(2:dim(D)[2])) {
# yoy absolute difference change
if (names(D)[i] %in% abs.diff)
{ DData[,i] = (D[5:TT,i]-D[1:(TT-4),i])
names(DData)[i] = paste('a',names(D)[i],sep='') }
# yoy percent. change
if (names(D)[i] %in% percent.change)
{ DData[,i] = 100*(D[5:TT,i]-D[1:(TT-4),i])/D[1:(TT-4),i]
names(DData)[i] = paste('p',names(D)[i],sep='') }
#CA.deficit
if (names(D)[i] %in% negative)
{ DData[,i] = (-1)*D[1:TT,i] }
}
return(DData)
}
This is what I would like to get :
Date pB1 pB2 aB3 B4
1 2004-01-01 NA NA NA NA
2 2004-04-01 NA NA NA NA
3 2004-07-01 NA NA NA NA
4 2004-10-01 NA NA NA NA
5 2005-01-01 39.33 -2.26 NA 3.5
6 2005-04-01 40.05 3.25 NA 14.6
7 2005-07-01 42.22 9.56 NA 11.6
8 2005-10-01 42.54 15.29 11.0 10.2
9 2006-01-01 43.13 21.43 19.3 3.1
10 2006-04-01 42.11 21.72 18.3 16.0
11 2006-07-01 37.77 16.99 22.0 4.9
12 2006-10-01 34.26 22.05 17.7 17.6
13 2007-01-01 30.17 18.3 19.7 5.3
14 2007-04-01 27.95 20.32 26.1 10.9
15 2007-07-01 27.04 26.1 39.8 12.8
16 2007-10-01 27.84 23.58 49.6 8.4
Grouping by the months, i.e. 6th and 7th substring using ave and do the necessary calculations. With sapply we may loop over the columns.
f <- function(x) {
g <- substr(Date, 6, 7)
l <- length(unique(g))
o <- ave(x, g, FUN=function(x) 100/x * c(x[-1], NA) - 100)
c(rep(NA, l), head(o, -4))
}
cbind(df[1], sapply(df[-1], f))
# Date B1 B2 B3 B4
# 1 2004-01-01 NA NA NA NA
# 2 2004-04-01 NA NA NA NA
# 3 2004-07-01 NA NA NA NA
# 4 2004-10-01 NA NA NA NA
# 5 2005-01-01 39.32901 -2.259202 NA NA
# 6 2005-04-01 40.04796 3.250329 NA NA
# 7 2005-07-01 42.21960 9.560688 NA NA
# 8 2005-10-01 42.54044 15.291439 NA NA
# 9 2006-01-01 43.12655 21.434066 13.83738 -11.428571
# 10 2006-04-01 42.10895 21.720063 18.03279 9.589041
# 11 2006-07-01 37.77093 16.986386 17.58316 -57.758621
# 12 2006-10-01 34.25636 22.050356 17.26190 72.549020
# 13 2007-01-01 30.17296 18.304329 16.10276 70.967742
# 14 2007-04-01 27.95094 20.316253 20.35024 -31.875000
# 15 2007-07-01 27.03903 26.102978 25.34642 161.224490
# 16 2007-10-01 27.84458 23.577982 32.48731 -52.272727

Replace all duplicated with na

My question is similar to replace duplicate values with NA in time series data using dplyr but while applying to other time series which are like below :
box_num date x y
6-WQ 2018-11-18 20.2 8
6-WQ 2018-11-25 500.75 7.2
6-WQ 2018-12-2 500.75 23
25-LR 2018-11-18 374.95 4.3
25-LR 2018-11-25 0.134 9.3
25-LR 2018-12-2 0.134 4
73-IU 2018-12-2 225.54 0.7562
73-IU 2018-12-9 28 0.7562
73-IU 2018-12-16 225.54 52.8
library(dplyr)
df %>%
group_by(box_num) %>%
mutate_at(vars(x:y), funs(replace(., duplicated(.), NA)))
The above code can identify and replace with NA, but the underlying problem is I'm trying to replace all NA with a linear trend in the coming step. Since it's a time series.But when we see for box_num : 6-WQ after 20.2 we can see directly a large shift which we can say it's a imputed value so I would to replace both the imputed values as NA and the other case is like for box_num 73-IU imputed values got entered after one week so I would like to replace imputed values with NA
Expected output :
box_num date x y
6-WQ 2018-11-18 20.2 8
6-WQ 2018-11-25 NA 7.2
6-WQ 2018-12-2 NA 23
25-LR 2018-11-18 374.95 4.3
25-LR 2018-11-25 NA 9.3
25-LR 2018-12-2 NA 4
73-IU 2018-12-2 NA NA
73-IU 2018-12-9 28 NA
73-IU 2018-12-16 NA 52.8
foo = function(x){
replace(x, ave(x, x, FUN = length) > 1, NA)
}
myCols = c("x", "y")
df1[myCols] = lapply(df1[myCols], foo)
df1
# box_num date x y
#1 6-WQ 2018-11-18 20.20 8.0
#2 6-WQ 2018-11-25 NA 7.2
#3 6-WQ 2018-12-2 NA 23.0
#4 25-LR 2018-11-18 374.95 4.3
#5 25-LR 2018-11-25 NA 9.3
#6 25-LR 2018-12-2 NA 4.0
#7 73-IU 2018-12-2 NA NA
#8 73-IU 2018-12-9 28.00 NA
#9 73-IU 2018-12-16 NA 52.8
#DATA
df1 = structure(list(box_num = c("6-WQ", "6-WQ", "6-WQ", "25-LR", "25-LR",
"25-LR", "73-IU", "73-IU", "73-IU"), date = c("2018-11-18", "2018-11-25",
"2018-12-2", "2018-11-18", "2018-11-25", "2018-12-2", "2018-12-2",
"2018-12-9", "2018-12-16"), x = c(20.2, 500.75, 500.75, 374.95,
0.134, 0.134, 225.54, 28, 225.54), y = c(8, 7.2, 23, 4.3, 9.3,
4, 0.7562, 0.7562, 52.8)), class = "data.frame", row.names = c(NA,
-9L))
With tidyverse you can do:
df %>%
group_by(box_num) %>%
mutate_at(vars(x:y), funs(ifelse(. %in% subset(rle(sort(.))$values, rle(sort(.))$length > 1), NA, .)))
box_num date x y
<fct> <fct> <dbl> <dbl>
1 6-WQ 2018-11-18 20.2 8.00
2 6-WQ 2018-11-25 NA 7.20
3 6-WQ 2018-12-2 NA 23.0
4 25-LR 2018-11-18 375. 4.30
5 25-LR 2018-11-25 NA 9.30
6 25-LR 2018-12-2 NA 4.00
7 73-IU 2018-12-2 NA NA
8 73-IU 2018-12-9 28.0 NA
9 73-IU 2018-12-16 NA 52.8
First, it sorts the values in "x" and "y" and computes the run length of equal values. Second, it creates a subset for those values that have a run length > 1. Finally, it compares whether the values in "x" and "y" are in the subset, and if so, they get NA.

How to Drop X in Column names after Merge

I've merged two data frames by common row names, and the merge worked fine, but I am getting an x before each column name.
How can I remove the X from each column header?
z<- merge(p, y, by='ID')
head(z)
ID x y V1 X198101 X198102 X198103 X198104 X198105 X198106
1 410320 -122.5417 37.75 NA 119.45 33.15 104.23 5.61 4.85 0
2 410321 -122.5000 37.75 NA 129.49 37.76 114.94 5.28 5.24 0
3 410322 -122.4583 37.75 NA 163.68 42.80 131.22 7.25 6.94 0
4 410323 -122.4167 37.75 NA 141.14 32.26 110.45 7.77 4.62 0
5 410324 -122.3750 37.75 NA 130.87 25.87 102.15 8.38 4.13 0
6 410325 -122.3333 37.75 NA 129.03 25.21 102.37 9.42 4.35 0
Thanks!
It is better to have column names not start with numbers. By default, the make.names or make.unique adds the X prefix when it starts with numbers. To remove it, one option is sub
names(z) <- sub("^X", "", names(z))
z
# ID x y V1 198101 198102 198103 198104 198105 198106
#1 410320 -122.5417 37.75 NA 119.45 33.15 104.23 5.61 4.85 0
#2 410321 -122.5000 37.75 NA 129.49 37.76 114.94 5.28 5.24 0
#3 410322 -122.4583 37.75 NA 163.68 42.80 131.22 7.25 6.94 0
#4 410323 -122.4167 37.75 NA 141.14 32.26 110.45 7.77 4.62 0
#5 410324 -122.3750 37.75 NA 130.87 25.87 102.15 8.38 4.13 0
#6 410325 -122.3333 37.75 NA 129.03 25.21 102.37 9.42 4.35 0
If we apply make.names
make.names(names(z))
#[1] "ID" "x" "y" "V1" "X198101" "X198102"
#[7] "X198103" "X198104" "X198105" "X198106"
The 'X' prefix is returned. So, in general, it is safe to have column names with 'character' prefix instead of just numbers. Also, if we wanted to extract say '198101' column, we need a backtick
z$198104
#Error: unexpected numeric constant in "z$198104"
z$`198104`
#[1] 5.61 5.28 7.25 7.77 8.38 9.42
This isn't actually caused by merge, it must be something earlier in your code. If it happens when you read in the data, try the check.names=FALSE option.
a <- data.frame(a=1:3, b=4:6)
b <- data.frame(a=1:3, c=7:9)
names(b)[2] <- 2485
merge(a,b)
## a b 2485
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

R - How to change values in one Matrix based on elements in another Matrix

I have the following covariance matrix in R:
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.5 NA -1.8 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 -1.79 NA 5.4 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28
Each column and row corresponds to a football player (i.e., AB-2000). So the intersection of AB-2000, AB-2000 gives the variance for that players performance. A row like AB-2000, AF-0200 gives the covariance of two players performance.
Currently, the matrix shows all covariance values. However, not all covariance values matter. In fact, the only ones that matter are when two players are playing the same game that week (in this case, have the same game ID (GID)).
The following table shows the GID for a PLAYER on certain week:
GID PLAYER
3467 AB-2000
3460 AB-2600
3463 AB-3500
3467 AC-0100
3458 AD-0100
3461 AF-0200
How do I go about keeping only the values in the covariance matrix when the two players have the same GID (for instance, players AB-2000 and AC-0100)?
Thanks for the help!
I think this does what you're asking, if I'm interpreting the question correctly. I've given you a couple solutions, pick your poison. The first relies on a nested for loop which could be slow and further optimized if you knew for sure your matrix was symmetric.
m <- read.table(header=T, stringsAsFactors=F, text="
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.5 NA -1.8 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 -1.79 NA 5.4 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28
")
p <- read.table(header=T, stringsAsFactors=F, text="
GID PLAYER
3467 AB-2000
3460 AB-2600
3463 AB-3500
3467 AC-0100
3458 AD-0100
3461 AF-0200
")
m_t2 <- cm
names(m_t2) <- row.names(m_t2)
## Replace names with GID:
row_names <- p$GID[which(p$PLAYER == row.names(m_t2))]
col_names <- p$GID[which(p$PLAYER == names(m_t2))]
for (i in 1:nrow(m_t2)) {
m_t2[i, col_names != row_names[i]] <- NA
}
m_t2 <- as.matrix(m_t2)
Alternatively this solution does relies on the tidyr and dplyr packages but it should be quite efficient for very large datasets:
m <- cm
names(m) <- row.names(m)
m$row_names <- row.names(m)
library(tidyr)
library(dplyr)
d <- m %>%
gather(col_names, "cv", -row_names, convert=T) %>%
left_join(p, by = c("row_names" = "PLAYER")) %>%
mutate(GID_row = GID) %>%
select(-GID) %>%
left_join(p, by=c("col_names" = "PLAYER")) %>%
mutate(GID_col = GID) %>%
mutate(new_cv = ifelse((GID_row == GID_col), cv, NA)) %>%
select(row_names, col_names, new_cv) %>%
spread(col_names, new_cv)
m_t <- as.matrix(d[,-1])
row.names(m_t) <- d[["row_names"]]
The solution in either case looks like this:
> m_t
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.50 NA NA 3.65 NA NA
AB-2600 NA 7.18 NA NA NA NA
AB-3500 NA NA 5.4 NA NA NA
AC-0100 3.65 NA NA 4.22 NA NA
AD-0100 NA NA NA NA 5.9 NA
AF-0200 NA NA NA NA NA 4.28

Resources