R-- repeating linear regression in a large dataset - r

I'm an R newbie working with an annual time series dataset (named "timeseries"). The set has one column for year and another 600 columns with the yearly values for different locations ("L1," "L2", etc), e.g. similar to the following:
Year L1 L2 L3 L4
1963 0.63 0.23 1.33 1.41
1964 1.15 0.68 0.21 0.4
1965 1.08 1.06 1.14 0.83
1966 1.69 1.85 1.3 0.76
1967 0.77 0.62 0.44 0.96
I'd like to do a linear regression for each site and can use the following for a single site:
timeL1<-lm(L1~Year, data=timeseries)
summary(timeL1)
But I think there must be a way to automatically repeat this for all the locations. Ideally, I'd like to end up with two vectors of results-- one with the coefficients for all the locations and one with the p-values for all the locations. From some searching, I thought the plyr package might work, but I can't figure it out. I'm still learning the basics of R, so any suggestions would be appreciated.

You can do this with one line of code:
apply(df[-1], 2, function(x) summary(lm(x ~ df$Year))$coef[1,c(1,4)])
L1 L2 L3 L4
Estimate -160.0660000 -382.2870000 136.4690000 106.9820000
Pr(>|t|) 0.6069965 0.3886881 0.7340981 0.7030296

A combination of apply and lapply can accomplish this.
d <- read.table(text="Year L1 L2 L3 L4
1963 0.63 0.23 1.33 1.41
1964 1.15 0.68 0.21 0.4
1965 1.08 1.06 1.14 0.83
1966 1.69 1.85 1.3 0.76
1967 0.77 0.62 0.44 0.96", header=TRUE)
year <- d$Year
d <- d[,-1]
models<-apply(d, 2, function(x) lm(x ~ year))
summaries <- lapply(models, summary)
pvals <- lapply(lapply(summaries, coefficients), function(x) x[4])
coefs <- lapply(lapply(summaries, coefficients), function(x) x[1])

Related

How to make a for loop for training and testing a model iteratively based on the cluster value

I am looking to do a transferability analysis on my clustered data. My goal is to train the model on one cluster ("group") in the data and test on each other group. The end result should be a matrix of model R^2 values.
In other words, I want the for loop to train the model on one group and test on another group until each group in my data frame has been trained on and tested on each of the other groups.
Here is an example of my data:
group
y
x1
x2
Temperate Sierras
37.25
11.32
0.85
Great Plains
6.25
4.67
0.91
Northern Forests
8.91
9.23
0.23
Tropical Forests
21.22
3.53
0.85
Highlands
87.99
7.91
0.34
Highlands
32.26
12.01
0.51
Northern Forests
14.85
5.53
0.77
Great Plains
96.93
4.81
0.37
Great Plains
72.12
9.86
0.43
The end goal is a matrix like the one below of test R^2 values with the training group on the y axis and the group it was tested on on the x axis.
group
Temperate Sierras
Great Plains
Northern Forests
Tropical Forests
Highlands
Temperate Sierras
0.55
0.42
0.85
0.27
0.82
Great Plains
0.22
0.82
0.91
0.28
0.18
Northern Forests
0.92
0.34
0.23
0.55
0.41
Tropical Forests
0.42
0.31
0.85
0.19
0.73
Highlands
0.29
0.67
0.34
0.46
0.55
Here is my code so far:
groups <- unique(df$group) #5 unique groups in data
for (group in groups){
print(paste0(Sys.time(), " now procesing: ", group))
df.train <- df[df$group != group,]
df.test <- df[df$group == group,]
mod <- ranger(y ~ ., data = df.train)
pred <- predict(mod, df.test)
r2.test <- summary(lm(pred$predictions ~ df.test$y))$adj.r.squared
if (group == groups[1]){
results.df <- data.frame(group = group,
r2.test = r2.test
)}
else {
results.df <- rbind(results.df, data.frame(group = group,
r2.test = r2.test
))}}
results.df
The code results in one testing group and excludes that group in the training data, but includes all other groups. It seems my issue is in the definition of df.train and df.test in the loop but I am unsure how to fix it so it trains and tests on each group iteratively.
Any help or ideas would be much appreciated.

Time-series average of cross-sectional correlations

I have a panel dataset looking like this:
head(panel_data)
date symbol close rv rv_plus rv_minus rskew rkurt Mkt.RF SMB HML
1 1999-11-19 a 25.4 19.3 6.76 12.6 -0.791 4.36 -0.11 0.35 -0.5
2 1999-11-22 a 26.8 10.1 6.44 3.69 0.675 5.38 0.02 0.22 -0.92
3 1999-11-23 a 25.2 8.97 2.56 6.41 -1.04 4.00 -1.29 0.08 0.3
4 1999-11-24 a 25.6 5.81 2.86 2.96 -0.505 5.45 0.87 0.08 -0.89
5 1999-11-26 a 25.6 2.78 1.53 1.25 0.617 5.60 0.23 0.92 -0.2
6 1999-11-29 a 26.1 5.07 2.76 2.30 -0.236 7.27 -0.6 0.570 -0.14
where the variable symbol depicts different stocks. I want to calculate the time-series average of the cross-sectional correlation between the variables rskew and rkurt. This means I need to compute the correlation between rskew and rkurt over all different stocks at each point in time and then calculate the time-series average afterwards.
I tried to do it with the rollapply function from the zoo package, but since the number of different stocks is not the same for all dates, I cannot simply define width as an integer. Here is what i tried for a sample width of 20:
panel_data <- panel_data %>%
group_by(date) %>%
mutate(cor_skew_kurt = rollapply(data = panel_data[7:8],
width=20,
FUN=cor,
align="right",
na.rm=TRUE,
fill=NA)) %>%
ungroup
Is there a way to do this without having to define a fixed width for each date group?
Or should I maybe use a different approach to do this?
[Edited] Can you try running the below code? I have recreated an example emulating your issue. if I understood your problem correctly this code should at least put you on the path to the right solution as it solves the issue of unequal time window length.
###################
#Recreating an example dataset with unequal dates across stocks
seed(1)
date6 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26','1999-11-29')
date5 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26')
date4 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24')
cor_skew_kurt <- c(rep(NaN,21))
symbol <- c(rep('a',6),rep('b',5),rep('c',4),rep('d',6))
rskew <- rnorm(21,mean=1, sd =1)
rkurt <- rnorm(21, mean=5, sd = 1)
panel_data <- cbind.data.frame(date = c(date6,date5,date4,date6), symbol = symbol, rskew = rskew, rkurt = rkurt, cor_skew_kurt = cor_skew_kurt )
panel_data$date <- as.Date(panel_data$date, '%Y-%m-%d')
# Computing the cor_skew_kurt and filling the table <- ANSWER TO YOUR QUESTION
for (date in unique(panel_data$date))
{
panel_data[panel_data$date == date,"cor_skew_kurt"] <- as.double(cor(panel_data[panel_data$date == date,'rskew'],panel_data[panel_data$date == date,'rkurt']))
}

How can I get row-wise max based on condition of specific column in R dataframe?

I'm trying to get the maximum value BY ROW across several columns (climatic water deficit -- def_59_z_#) depending on how much time has passed (time since fire -- YEAR.DIFF). Here are the conditions:
If 1 year has passed, select the deficit value for first year.
(def_59_z_1).
If 2 years: max deficit of first 2 years.
If 3 years: max of deficit of first 3 years.
If 4 years: max of deficit of first 4 years.
If 5 or more years: max of first 5 years.
However, I am unable to extract a row-wise max when I include a condition. There are several existing posts that address row-wise min and max (examples 1 and 2) and sd (example 3) -- but these don't use conditions. I've tried using apply but I haven't been able to find a solution when I have multiple columns involved as well as a conditional requirement.
The following code simply returns 3.5 in the new column def59_z_max15, which is the maximum value that occurs in the dataframe -- except when YEAR.DIFF is 1, in which case def_50_z_1 is directly returned. But for all the other conditions, I want 0.98, 0.67, 0.7, 1.55, 1.28 -- values that reflect the row maximum of the specified columns. Link to sample data here. How can I achieve this?
I appreciate any/all suggestions!
data <- data %>%
mutate(def59_z_max15 = ifelse(YEAR.DIFF == 1,
(def59_z_1),
ifelse(YEAR.DIFF == 2,
max(def59_z_1, def59_z_2),
ifelse(YEAR.DIFF == 3,
max(def59_z_1, def59_z_2, def59_z_3),
ifelse(YEAR.DIFF == 4,
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4),
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4, def59_z_5))))))
Throw this function in an apply family function
func <- function(x) {
first.val <- x[1]
if (first.val < 5) {
return(max(x[2:(first.val+)])
} else {
return(max(x[2:6]))
}
}
Your desired output should be obtained by:
apply(data, 1, function(x) func(x)) #do it by row by setting arg2 = 1
An option would be to get the pmax (rowwise max - vectorized) for each set of conditions separately in a loop (map - if the value of 'YEAR.DIFF' is 1, select only the 'def_59_z_1', for 2, get the max of 'def_59_z_1' and 'def_59_z_2', ..., for 5, max of 'def_59_z_1' to 'def_59_z_5', coalesce the columns together and replace the rest of the NA with the pmax of all the 'def59_z" columns
library(tidyverse)
out <- map_dfc(1:5, ~
df1 %>%
select(seq_len(.x) + 1) %>%
transmute(val = na_if((df1[["YEAR.DIFF"]] == .x)*
pmax(!!! rlang::syms(names(.))), 0))) %>%
transmute(def59_z_max15 = coalesce(!!! rlang::syms(names(.)))) %>%
bind_cols(df1, .)%>%
mutate(def59_z_max15 = case_when(is.na(def59_z_max15) ~
pmax(!!! rlang::syms(names(.)[2:6])), TRUE ~ def59_z_max15))
head(out, 10)
# YEAR.DIFF def59_z_1 def59_z_2 def59_z_3 def59_z_4 def59_z_5 def59_z_max15
#1 5 0.25 -2.11 0.98 -0.07 0.31 0.98
#2 9 0.67 0.65 -0.27 0.52 0.26 0.67
#3 10 0.56 0.33 0.03 0.70 -0.09 0.70
#4 2 -0.34 1.55 -1.11 -0.40 0.94 1.55
#5 4 0.98 0.71 0.41 1.28 -0.14 1.28
#6 3 0.71 -0.17 1.70 -0.57 0.43 1.70
#7 4 -1.39 -1.71 -0.89 0.78 1.22 0.78
#8 4 -1.14 -1.46 -0.72 0.74 1.32 0.74
#9 2 0.71 1.39 1.07 0.65 0.29 1.39
#10 1 0.28 0.82 -0.64 0.45 0.64 0.28
data
df1 <- read.csv("https://raw.githubusercontent.com/CaitLittlef/random/master/data.csv")

avoid nested for-loops for tricky operation R

I have to do an operation that involves two matrices, matrix #1 with data and matrix #2 with coefficients to multiply columns of matrix #1
matrix #1 is:
dim(dat)
[1] 612 2068
dat[1:6,1:8]
X0005 X0010 X0011 X0013 X0015 X0016 X0017 X0018
1 1.96 1.82 8.80 1.75 2.95 1.10 0.46 0.96
2 1.17 0.94 2.74 0.59 0.86 0.63 0.16 0.31
3 2.17 2.53 10.40 4.19 4.79 2.22 0.31 3.32
4 3.62 1.93 6.25 2.38 2.25 0.69 0.16 1.01
5 2.32 1.93 3.74 1.97 1.31 0.44 0.28 0.98
6 1.30 2.04 1.47 1.80 0.43 0.33 0.18 0.46
and matrix #2 is:
dim(lo)
[1] 2068 8
head(lo)
i1 i2 i3 i4 i5 i6
X0005 -0.11858852 0.10336788 0.62618771 0.08706041 -0.02733101 0.006287923
X0010 0.06405406 0.13692216 0.64813610 0.15750302 -0.13503956 0.139280709
X0011 -0.06789727 0.30473549 0.07727417 0.24907723 -0.05345123 0.141591330
X0013 0.20909664 0.01275553 0.21067894 0.12666704 -0.02836527 0.464548147
X0015 -0.07690560 0.18788859 -0.03551084 0.19120773 -0.10196578 0.234037820
X0016 -0.06442454 0.34993481 -0.04057001 0.20258195 -0.09318325 0.130669546
i7 i8
X0005 0.08571777 0.031531478
X0010 0.31170850 -0.003127279
X0011 0.52527759 -0.065002026
X0013 0.27858049 -0.032178156
X0015 0.50693977 -0.058003429
X0016 0.53162596 -0.052091767
I want to multiply each column of matrix#1 by its correspondent coefficient of matrix#2 first column, and sum up all resulting columns. Then repeat the operation but with coefficients of matrix#2 second column, then third column, and so on...
The result is then a matrix with 8 columns, which are lineal combinations of data in matrix#1
My attempt includes nested for-loops. it works, but takes about 30' to execute. Is there any way to avoid these loops and reduce computational effort?
here is my attempt:
r=nrow(dat)
n=ncol(dat)
m=ncol(lo)
eme<-matrix(NA,r,m)
for (i in(1:m)){
SC<-matrix(NA,r,n)
for (j in(1:n)){
nom<-rownames(lo)
x<-dat[ , colnames(dat) == nom[j]]
SC[,j]<-x*lo[j,i]
SC1<-rowSums(SC)
}
eme[,i]<-SC1
}
Thanks for your help
It looks like you are just doing matrix - vector multiplication. In R, use the%*% operator, so all the looping is delegated to a fortran routine. I think it equates to the following
apply(lo, 2, function(x) dat %*% x)
Your code could be improved by moving the nom <- assignment outside the loops since it recalculates the same thing every iteration. Also, what is the point of SC1 being computed during each iteration?

Biomass model with density

I am trying to develop a common biomass model for several species with wood density.
here is my data set
Species_Name DBH_cm Wood_Density Leaf_Biomass_kg
Aam 10.9 0.55 4.495175666
Aam 8.3 0.55 3.003987585
Aam 18.3 0.55 7.0453234
Akashmoni 26.6 0.68 8.68327883
Akashmoni 18 0.68 5.514198965
Akashmoni 20.6 0.68 7.140993296
Amloki 13.7 0.64 0.418757191
Amloki 14.6 0.64 0.348964326
Amra 19 0.29 0
Arjun 13.3 0.82 0
Bajna 13 0.70 0
Bel 19.6 0.83 0.458638794
Sal 14.40 0.82 0.996750392
Sal 12.20 0.82 0.644956136
Sal 10.00 0.82 0.947928706
Sal 14.20 0.82 0.767434214
Sal 11.50 0.82 0.636970398
Sal 13.20 0.82 0.445111844
Sal 13.30 0.82 0.706039477
Sal 10.70 0.82 0.475809213
I tried to give NA to missing values by using
tree[which(tree$Leaf_Biomass_kg == 0),]$Leaf_Biomass_kg <- NA
my model code is
library(nlme)
start <- coef(lm(log(Leaf_Biomass_kg)~log(DBH_cm)+log(Wood_Density), data=tree))
start[1] <- exp(start[1])
names(start) <- c("a","b1", "b2")
m <- nlme(Leaf_Biomass_kg~a*DBH_cm^b1*Wood_Density^b2,
data=cbind(tree,g="a"),
fixed=a+b1+b2~1,
start=start,
groups=~g,
weights=varPower(form=~DBH_cm))
it gives
Error in finiteDiffGrad(model, data, pars) :
NAs are not allowed in subscripted assignments
Can anyone help me in this regard
I add na.action=na.exclude in my model but the problem still exists
m <- nlme(Leaf_Biomass_kg~a*DBH_cm^b1*Wood_Density^b2,
data=cbind(tree,g="a"),
fixed=a+b1+b2~1,
start=start,
groups=~g,
weights=varPower(form=~DBH_cm)
na.action=na.exclude)
You have missing values NA in your data set. Try setting na.exclude or remove records with NA values. You may want to impute missing values. This question has been adressed here:
https://stats.stackexchange.com/questions/11000/how-does-r-handle-missing-values-in-lm

Resources