How to get correlations between two variables with lags

How to get correlations between two variables with lags - r

I would like to check if there is a correlation between "birds" & "wolfs" in different lags.Getting the correlation value is easy but how can I address the lag issue ( I need to check the correleation value for 1:4 lags )? The output that I look for is a data table that contains the lag value and the related correlation value.
df <- read.table(text = " day birds wolfs
0 2 21
1 8 4
2 2 5
3 2 4
4 3 6
5 1 12
6 7 10
7 1 9
8 2 12 header = TRUE)
Output(not real results):
Lag CorValue
0 0.9
1 0.8
2 0.7
3 0.9

If you do this :
corLag<-ccf(df$birds,df$wolfs,lag.max=max(df$day))
it will return this :
Autocorrelations of series ‘X’, by lag
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
-0.028 0.123 -0.045 -0.019 0.145 -0.176 -0.082 -0.126 -0.296 0.757 -0.134 -0.180 0.070 -0.272 0.549 -0.170 -0.117
the first row is the lag, the second is the correlation value. you can check that cor(df$birds,df$wolfs)is indeed equal to -0.296

Related

Pivoting multiple sets of columns using pivot_longer in R

I want to pivot multiple sets of variables in a data frame. My data looks like this:
require(dplyr)
require(tidyr)
x_1=rnorm(10,0,1)
x_2=rnorm(10,0,1)
x_3=rnorm(10,0,1)
y_1=rnorm(10,0,1)
y_2=rnorm(10,0,1)
aid=rep(1:5,2)
data=data.frame(aid, x_1,x_2,y_1,y_2)
> data
aid x_1 x_2 y_1 y_2
1 1 -0.82305819 0.9366731 0.95419200 2.29544019
2 2 0.64424320 -0.2807793 0.51303834 0.02560463
3 3 -1.11108822 -0.2475625 0.05747951 -0.51218368
4 4 -1.04026895 -0.4138653 0.57751999 0.60942652
5 5 1.29097040 -1.7829966 1.59940532 0.75868562
6 1 -0.57845406 -1.0002074 0.04302291 0.86766265
7 2 0.08996163 -0.7949632 -2.10422124 -0.43432995
8 3 0.14331978 0.4203010 -1.12748270 0.14484670
9 4 -0.25207187 1.5559295 0.23621422 -0.04719046
10 5 -0.25617731 0.6241852 -1.21131110 1.02236458
I want to pivot x and y variables separately. I did that using following lines of codes.
data2 = data %>% reshape(.,direction = "long",
varying = list(c('x_1','x_2'),
c('y_1','y_2')),
v.names = c("x",'y'))
I need to generalize this to any number of columns. That means, in this example x and y have 2 columns each. But for a another data set it may be different. If there are more columns, it would be difficult to type everything under varying parameter.
In order to avoid specifying the columns when pivoting, I tried this code:
data1 <- data%>% pivot_longer(!aid, names_to = c("id"), names_pattern = "(.)(.)")
But it gave this error:
Error: `regex` should define 1 groups; found.
Can anyone help me to fix this?
Thank you.

The brackets around the matched pattern represents that we are capturing that pattern as a group. In the below code, we capture one or more lower-case letters ([a-z]+) followed by a _ (not inside the brackets, thus it is removed) and the second capture group matches one or more digits (\\d+), and this will be matched with the corresponding values of names_to - i.e. .value represents the value of the column, thus we get the columns 'x' and 'y' with the values and the second will be a new column names that returs the suffix digits of the column names i.e. 'time'
library(tidyr)
pivot_longer(data, cols = -aid, names_to = c(".value", "time"),
names_pattern = "^([a-z]+)_(\\d+)")
-output
# A tibble: 20 × 4
aid time x y
<int> <chr> <dbl> <dbl>
1 1 1 -0.823 0.954
2 1 2 0.937 2.30
3 2 1 0.644 0.513
4 2 2 -0.281 0.0256
5 3 1 -1.11 0.0575
6 3 2 -0.248 -0.512
7 4 1 -1.04 0.578
8 4 2 -0.414 0.609
9 5 1 1.29 1.60
10 5 2 -1.78 0.759
11 1 1 -0.578 0.0430
12 1 2 -1.00 0.868
13 2 1 0.0900 -2.10
14 2 2 -0.795 -0.434
15 3 1 0.143 -1.13
16 3 2 0.420 0.145
17 4 1 -0.252 0.236
18 4 2 1.56 -0.0472
19 5 1 -0.256 -1.21
20 5 2 0.624 1.02
In the OP's code, there are two groups ((.) and (.)) and only one element in names_to, thus it fails along with the fact that there is _ between the 'x', 'y' and the digit. Also, by default, the names_pattern will be in regex mode and some characters are thus in metacharacter mode i.e. . represents any character and not the literal .

In this case names_sep is a handy alternative to names_pattern as the column names are already separated by _:
library(dplyr)
library(tidyr)
data %>%
pivot_longer(-aid,
names_to =c(".value","time"),
names_sep ="_"
)
aid time x y
<int> <chr> <dbl> <dbl>
1 1 1 1.08 -1.49
2 1 2 0.871 0.449
3 2 1 -1.01 -0.577
4 2 2 1.23 -0.0890
5 3 1 -0.905 -0.289
6 3 2 1.16 -0.380
7 4 1 -0.316 -0.446
8 4 2 0.902 1.05
9 5 1 -0.908 1.36
10 5 2 -0.558 -1.57
11 1 1 -0.383 1.22
12 1 2 0.704 0.000539
13 2 1 0.595 -0.668
14 2 2 -0.461 1.46
15 3 1 2.00 -0.365
16 3 2 -1.14 0.150
17 4 1 -2.13 -0.827
18 4 2 0.642 -0.798
19 5 1 0.397 -0.0143
20 5 2 0.981 1.79

Error in emmfcn(...) : Variable 'CO2' is not in the dataset in r

I want to perform a comparison between the slope of different regressions: CO2 changes through time (day) for 8 different nests.
> structure(as1)
# A tibble: 16 x 4
day nest N2O CO2
<dbl> <dbl> <dbl> <dbl>
1 1 1 0.00549 0.206
2 1 2 0.129 0.0343
3 1 3 0.157 0.113
4 1 4 0.0760 0.106
5 2 1 -0.0487 0.214
6 2 2 -0.0561 0.358
7 2 3 -0.0522 0.767
8 2 4 -0.0193 0.188
9 3 1 -0.0757 0.255
10 3 2 -0.237 0.753
11 3 3 -0.117 0.745
12 3 4 0.0345 0.502
13 4 1 0.135 0.325
14 4 2 0.264 0.767
15 4 3 0.0116 0.926
16 4 4 0.0342 0.358
I'm following the instructions given in https://stats.stackexchange.com/questions/33013/what-test-can-i-use-to-compare-slopes-from-two-or-more-regression-models by the answer with a rate of 16.
Instead of using the library lsmeans as it suggests I used emmeans because R encourages to switch to emmeans the rest of the way. However I've also tried it with lsmeans and I get the same problem. When I run this:
library(emmeans)
m.interaction <- lm(CO2 ~ day*nest, data = as1)
anova(m.interaction)
# Obtain slopes
m.interaction$coefficients
m.lst <- lstrends(m.interaction, "day", var="CO2", data = as1)
Everything works fine until lstrends, where I get this error:
##Error in emmfcn(...) : Variable 'CO2' is not in the dataset
Does somebody know what can be happening?
Thanks in advance!

Why am I getting 'train' and 'class' have different lengths"

Why am I getting -
'train' and 'class' have different lengths
In spite of having both of them with same lengths
y_pred=knn(train=training_set[,1:2],
test=Test_set[,-3],
cl=training_set[,3],
k=5)
Their lengths are given below-
> dim(training_set[,-3])
[1] 300 2
> dim(training_set[,3])
[1] 300 1
> head(training_set)
# A tibble: 6 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -1.77 -1.47 0
2 -1.10 -0.788 0
3 -1.00 -0.360 0
4 -1.00 0.382 0
5 -0.523 2.27 1
6 -0.236 -0.160 0
> Test_set
# A tibble: 100 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -0.304 -1.51 0
2 -1.06 -0.325 0
3 -1.82 0.286 0
4 -1.25 -1.10 0
5 -1.15 -0.485 0
6 0.641 -1.32 1
7 0.735 -1.26 1
8 0.924 -1.22 1
9 0.829 -0.582 1
10 -0.871 -0.774 0

It's because knn is expecting class to be a vector and you are giving it a data table with one column. The test knn is doing is whether nrow(train) == length(cl). If cl is a data table that does not give the answer you are expecting. Compare:
> length(data.frame(a=c(1,2,3)))
[1] 1
> length(c(1,2,3))
[1] 3
If you use cl=training_set$Purchased, which extracts the vector from the table, that should fix it.
This is specific gotcha if you are moving from data.frame to data.table because the default drop behaviour is different:
> dt <- data.table(a=1:3, b=4:6)
> dt[,2]
b
1: 4
2: 5
3: 6
> df <- data.frame(a=1:3, b=4:6)
> df[,2]
[1] 4 5 6
> df[,2, drop=FALSE]
b
1 4
2 5
3 6

Replacing conditional values with previous values in r

I have some data on organism survival as a function of time. The data is constructed using the averages of many replicates for each time point, which can yield a forward time step with an increase in survival. Occasionally, this results in a survivorship greater than 1, which is impossible. How can I conditionally change values greater than 1 to the value preceeding it in the same column?
Here's what the data looks like:
>df
Generation Treatment time lx
1 0 1 0 1
2 0 1 2 1
3 0 1 4 0.970
4 0 1 6 0.952
5 0 1 8 0.924
6 0 1 10 0.913
7 0 1 12 0.895
8 0 1 14 0.729
9 0 2 0 1
10 0 2 2 1
I've tried mutating the column of interest as such, which still yields values above 1:
df1 <- df %>%
group_by(Generation, Treatment) %>%
mutate(lx_diag = as.numeric(lx/lag(lx, default = first(lx)))) %>% #calculate running survival
mutate(lx_diag = if_else(lx_diag > 1.000000, lag(lx_diag), lx_diag)) #substitute values >1 with previous value
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 1.01
10 12 1 19 0.76 1.04
I expect the results to look something like:
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 0.814
10 12 1 19 0.76 0.814
I know you can conditionally change the values to a specific value (i.e. ifelse with no else), but I haven't found any solutions that can conditionally change a value in a column to the value in the previous row. Any help is appreciated.
EDIT: I realized that mutate and if_else are quite efficient when it comes to converting values. Instead of replacing values in sequence from the first to last, as I would have expected, the commands replace all values at the same time. So in a series of values >1, you will have some left behind. Thus, if you just run the command:
SurvTot1$lx_diag <- if_else(SurvTot1$lx_diag > 1, lag(SurvTot1$lx_diag), SurvTot1$lx_diag)
over again, you can rid of the values >1. Not the most elegant solution, but it works.

This looks like a very ugly solution to me, but I couldn't think of anything else:
df = data.frame(
"Generation" = rep(12,10),
"Treatent" = rep(1,10),
"Time" = c(seq(0,14,by=2),15,19),
"lx_diag" = c(1,1,1,0.996,0.992,0.968,0.925,0.814,1.04,1.04)
)
update_lag = function(x){
k <<- k+1
x
}
k=1
df %>%
mutate(
lx_diag2 = ifelse(lx_diag <=1,update_lag(lx_diag),lag(lx_diag,n=k))
)

Using the data from #Fino, here is my vectorized solution using base R
vals.to.replace <- which(df$lx_diag > 1)
vals.to.substitute <- sapply(vals.to.replace, function(x) tail( df$lx_diag[which(df$lx_diag[1:x] <= 1)], 1) )
df$lx_diag[vals.to.replace] = vals.to.substitute
df
Generation Treatent Time lx_diag
1 12 1 0 1.000
2 12 1 2 1.000
3 12 1 4 1.000
4 12 1 6 0.996
5 12 1 8 0.992
6 12 1 10 0.968
7 12 1 12 0.925
8 12 1 14 0.814
9 12 1 15 0.814
10 12 1 19 0.814

Correct data type passed to barplot

My initial goal was to set ylim for data plotted by barplot. When I started to dig deeper I've found several things that I do not understand. Let me explain my research:
I have 1D vector:
> str(vectorName)
num [1:999] 1 1 1 1 1 1 1 1 1 1 ...
> dim(vectorName)
NULL
> length(vectorName)
[1] 999
If I want to count the particular elements of this vector I do:
> vectorNameTable = table(vectorName)
> vectorNameTable
vectorName
0 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4
563 72 35 22 14 21 14 10 5 3 7 3 6 5 3 1 3
0.425 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 0.75 0.775 0.8 0.825
1 3 3 5 7 11 3 4 3 11 5 9 5 7 8 5 3
0.85 0.875 0.9 0.925 0.975 1
3 4 2 1 1 108
This is how I display those data more elegant way (in R-studio):
> View(vectorNameTable)
Which gives me output like this:
vectorName Freq
1 0 563
2 0.025 72
3 0.05 35
4 0.075 22
5 0.1 14
6 0.125 21
7 0.15 14
8 0.175 10
9 0.2 5
10 0.225 3
11 0.25 7
12 0.275 3
13 0.3 6
14 0.325 5
15 0.35 3
16 0.375 1
17 0.4 3
18 0.425 1
19 0.45 3
20 0.475 3
21 0.5 5
22 0.525 7
23 0.55 11
24 0.575 3
25 0.6 4
26 0.625 3
27 0.65 11
28 0.675 5
29 0.7 9
30 0.725 5
31 0.75 7
32 0.775 8
33 0.8 5
34 0.825 3
35 0.85 3
36 0.875 4
37 0.9 2
38 0.925 1
39 0.975 1
40 1 108
If I want to plot this data I do:
> barplot(vectorNameTable)
Which gives me this plot:
As you can see 0 is occurring more times than is y-axis size. So what I want is to set the size of y-axis using:
barplot(table(vectorNameTable), ylim=c(0,MAX_VALUE_IN_FREQ_COLUMN))
The problem is that I cannot find the largest value in Freq column. To be more precise I cannot even access the Freq column. I've tried:
> vectorNameTable[,1]
Error in vectorNameTable[, 1] : incorrect number of dimensions
and several other attempts, but seems that the only thing that I am able to obtain is whole row:
> vectorNameTable[1]
0
563
> vectorNameTable[2]
0.025
72
Or even the Freq value in given row:
> vectorNameTable[[1]]
[1] 563
> vectorNameTable[[2]]
[1] 72
The one possible workaround that is working is converting the data to matrix:
vectorNameDF = data.frame(vectorNameTable)
val = vectorNameDF[[1]]
frq = vectorNameDF[[2]]
val = as.numeric(levels(val))
vectorNameMTX = matrix(c(val, frq), nrow=length(val))
Then I cand do something like this:
barplot(vectorNameTable, ylim=c(0,max(vectorNameMTX[,2])+50))
Which will return:
But as you can see it is extreme overkill. Another mysterious thing that I've found is that plotting the graph this way (same as barplot(vectorNameMTX, beside=FALSE)):
> barplot(vectorNameMTX)
Will return this:
This command > barplot(vectorNameMTX, beside=TRUE) will return this:
Why this is happening? I mean what is this "line" on the left? And where is x-axis? If I do View(vectorNameMTX) it returns very similar table to View(vectorNameTable). The documentation for barplot says (only important things):
Bar Plots
Description
Creates a bar plot with vertical or horizontal bars.
Usage
barplot(height, ...)
height
either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.
I'm passing the matrix, but it does not working as expected:
> class(vectorNameMTX)
[1] "matrix"
On the other hand this one is not mentioned as supported type but it is working:
> class(vectorNameTable)
[1] "table"
Why I can't access columns of vectorNameTable? Why is passing the table object working while passing an matrix is not? What I'm missing here and what is the best way to achieve my goal?
Thank you

Table of a 1d vector is a 1d vector, so there is no columns. You can do something like
> a <- rbinom(1000, 25, 0.5)
> tb <- table(a)
> tb
a
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
8 20 31 71 96 155 141 146 136 94 46 33 15 7 1
> dim(tb)
[1] 15 # 1 dimension of 15
> tb[which.max(tb)]
11
155
So you can feed this max value to barplot.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to get correlations between two variables with lags - r

Related

Pivoting multiple sets of columns using pivot_longer in R

Error in emmfcn(...) : Variable 'CO2' is not in the dataset in r

Why am I getting 'train' and 'class' have different lengths"

Replacing conditional values with previous values in r

Correct data type passed to barplot

Categories

Resources