I have a data frame as "df" and 41 variables var1 to var41. If I write this command
pcdtest(plm(var1~ 1 , data = df, model = "pooling"))[[1]]
I can see the test value. But I need to apply this test 41 times. I want to access variable by column number which is "df[1]" for "var1" and "df[41]" for "var41"
pcdtest(plm(df[1]~ 1 , data = dfp, model = "pooling"))[[1]]
But it fails. Could you please help me to do this? I will have result in for loop. And I will calculate the descriptive statistics for all the results. But it is very difficult to do test for each variable.
I think you can easily adapt the following code to your data. Since you didn't provide any of your data, I used data that comes with the plm package.
library(plm) # for pcdtest
# example data from plm package
data("Cigar" , package = "plm")
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)
# example for one column
p_model <- plm(formula = pop~1, data = Cigar.p, model = "pooling")
pcdtest(p_model)[[1]]
# run through multiple models
l_plm_models <- list() # store plm models in this list
l_tests <- list() # store testresults in this list
for(i in 3:ncol(Cigar.p)){ # start in the third column, since the first two are state and year
fmla <- as.formula(paste(names(Cigar.p)[i], '~ 1', sep = ""))
l_plm_models[[i]] <- plm(formula = as.formula(paste0(colnames(Cigar.p)[i], "~ 1", sep = "")),
data = Cigar.p,
model = "pooling")
l_tests[[i]] <- pcdtest(l_plm_models[[i]])[[1]]
}
testresult <- data.frame("z" = unlist(l_tests), row.names = (colnames(Cigar.p[3:11])))
> testresult
z
price 175.36476
pop 130.45774
pop16 155.29092
cpi 176.21010
ndi 175.51938
sales 99.02973
pimin 175.74600
fact1 176.21010
fact2 176.21010
# example for cipstest
matrix_results <- matrix(NA, nrow = 11, ncol = 2) # use 41 here for your df
l_ctest <- list()
for(i in 3:ncol(Cigar.p)){
l_ctest[[i]] <- cipstest(Cigar.p[, i], lags = 4, type = 'none', model = 'cmg', truncated = F)
matrix_results[i, 1] <- as.numeric(l_ctest[[i]][1])
matrix_results[i, 2] <- as.numeric(l_ctest[[i]][7])
}
res <- data.frame(matrix_results)
names(res) <- c('cips-statistic', 'p-value')
print(res)
Try using as.formula(), for example:
results <- list()
for (i in 1:41){
varName <- paste0('var',i)
frml <- paste0(varName, ' ~ 1')
results[[i]] <-
pcdtest(plm(as.formula(frml) , data = dfp, model = "pooling"))[[1]]
}
You can use reformulate to create the formula and apply the code for 41 times using lapply :
var <- paste0('var', 1:41)
result <- lapply(var, function(x) pcdtest(plm(reformulate('1', x),
data = df, model = "pooling"))[[1]])
Related
I am having 2 sets of raster data and their names are:
ntl_'a number'.tif
pop_'a number'.tif
My goal is to create a function that reads the first pair of rasters (e.g., ntl_1.tif and pop_1.tif), then executes the below code and then repeats the process with the next pair:
library(raster)
library(DescTools)
#create a data.frame of values from the NTL and pop raster data
ntl = raster("path/ntl_1.tif")
vals_ntl <- as.data.frame(values(ntl))
ntl_coords = as.data.frame(xyFromCell(ntl, 1:ncell(ntl)))
combine <- as.data.frame(cbind(ntl_coords,vals_ntl))
pop<-raster("path/pop_1.tif")
pop = resample(pop, ntl, method = 'bilinear')
vals_pop <- as.data.frame(values(pop))
block.data <- as.data.frame(cbind(combine, vals_pop))
names(block.data)[3] <- "ntl"
names(block.data)[4] <- "pop"
block.data <- na.omit(block.data)
block.data = subset(block.data, select = -c(x, y))
# sort by ntl
block.data <-block.data[order(block.data$ntl),]
ntl_vector <- block.data[ , "ntl"]
pop_vector <- block.data[ , "pop"]
#compute gini index
Gini(ntl_vector, pop_vector, unbiased = FALSE)
My issue is with the code inside the function, I do not know how to properly make the syntax (the above code is for a pair of raster while I have hundreds of pairs). Hopefully I can get the results (i.e., the gini coefficient) of every pair in my console or, even better, in a data.frame. The data are here.
library(purrr)
library(fs)
raster_gini <- function(
.ntl = "ntl_1.tif",
.pop = "pop_1.tif",
.rdgal = TRUE
) {
if(.rdgal) {
ntl = raster(.ntl)
vals_ntl <- as.data.frame(values(ntl))
ntl_coords = as.data.frame(xyFromCell(ntl, 1:ncell(ntl)))
combine <- as.data.frame(cbind(ntl_coords,vals_ntl))
pop<-raster(.pop)
pop = resample(pop, ntl, method = 'bilinear')
vals_pop <- as.data.frame(values(pop))
block.data <- as.data.frame(cbind(combine, vals_pop))
#rename the columns
names(block.data)[3] <- "ntl"
names(block.data)[4] <- "pop"
#remove NA values
block.data <- na.omit(block.data)
#remove the columns x & y
block.data = subset(block.data, select = -c(x, y))
# sort by ntl
block.data <-block.data[order(block.data$ntl),]
ntl_vector <- block.data[ , "ntl"]
pop_vector <- block.data[ , "pop"]
#compute gini index
gini <- Gini(ntl_vector, pop_vector, unbiased = FALSE)
c(ntl = .ntl, pop = .pop, gini = gini)
} else {
c(ntl = .ntl, pop = .pop)
}
}
doc_paths_ntl <- fs::dir_ls("path_to_ntl_raster", glob = "*tif*")
doc_paths_pop <- fs::dir_ls("path_to_pop_raster", glob = "*tif*")
result_df <- purrr::map2_dfr(.x = doc_paths_ntl, .y = doc_paths_pop, .f = raster_gini)
result_df <- result_df |>
dplyr::mutate(ntl = basename(ntl)) |>
dplyr::mutate(pop = basename(pop))
result_df
I have multiple datasets with different lengths. I want to apply a correlation function to delete correlated variables with 98%. How can I use a loop to apply a correlation function on multiple datasets in the same time and store the variables selected in new dataframes?
How can I also use lasso regression on multiple datasets, also using loop functions? Thank you
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C<-data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R<-data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E<-data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
# Corrélation
library("caret")
library("dplyr")
data.cor <- cor(subset(H, select = -c(X10)))
high.cor <- findCorrelation(data.cor, cutoff=0.98)
remove <- names(H[high.cor])
remove <- c(remove)
myvars <- names(H) %in% remove
var_selected <- H[!myvars]
new_data_H <- var_selected
Here's one way (of several) to do this:
# Corrélation
library(caret)
library(dplyr)
set.seed(99)
H <- data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C <- data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R <- data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E <- data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
# Combine input datasets a list
inputs <- list(H, C, R, E)
# Empty list to hold results
outputs <- list()
# Loop over each dataset, one at a time
for(df in inputs){
data.cor <- cor(df)
high.cor <- findCorrelation(data.cor, cutoff=0.40)
# Subset the dataset based on `high.cor`
# Add the subsetted dataset to a output list of datasets
outputs <- append(outputs, list(df[,-high.cor]))
}
# This is the first dataset processed by the loop
outputs[[1]]
# Second...
outputs[[2]]
# Third...
outputs[[3]]
edit: integrating your lasso routine
library(glmnet)
library(caret)
set.seed(99)
## Define data (indpendent variables)
H <- data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C <- data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R <- data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E <- data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
inputs <- list(H, C, R, E)
## Define targets (dependent variables)
Y_H <- data.frame(label_1 = replicate(1,sample(20:35, 10, rep = TRUE)))
Y_C <- data.frame(label_2 = replicate(1,sample(15:65, 10, rep = TRUE)))
Y_R <- data.frame(label_3 = replicate(1,sample(25:45, 10, rep = TRUE)))
Y_E <- data.frame(label_4 = replicate(1,sample(21:80, 10, rep = TRUE)))
targets <- list(Y_H, Y_C, Y_R, Y_E)
## Remove coorelated independent variables
outputs <- list()
for(df in inputs){
data.cor <- cor(df)
high.cor <- findCorrelation(data.cor, cutoff=0.40)
outputs <- append(outputs, list(df[,-high.cor]))
}
## Do lasso regression
lasso_cv <- list()
lasso_model <- list()
for(i in 1:length(outputs)){
for(j in 1:length(targets)){
lasso_cv[[i]] <- cv.glmnet(
as.matrix(outputs[[i]]), as.matrix(targets[[j]]), standardize = TRUE, type.measure = "mse", alpha = 1, nfolds = 3)
lasso_model[[i]] <- glmnet(
as.matrix(outputs[[i]]), as.matrix(targets[[j]]), lambda = lasso_cv[[i]]$lambda_cv, standardize = TRUE, alpha = 1)
}
}
Create target variables for each dataframe
Combhine all dataframes in list
Combine all targets in list
Note: every target variable correspond to a dataframe
Correlation: delete correlated variables
Performing lasso regression for all lists
Create dataframes
set.seed(99)
H <- data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C <- data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R <- data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E <- data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
Y_H <- data.frame(replicate(1,sample(20:35, 10, rep = TRUE)))
Y_H
names(Y_H)<-
names(Y_H)names(Y_H)=="replicate.1..sample.20.35..10..rep...TRUE.."] <-"label_1"
Y_C <- data.frame(replicate(1,sample(15:65, 10, rep = TRUE)))
names(Y_C) <-
names(Y_C)[names(Y_C)=="replicate.1..sample.15.65..10..rep...TRUE.."] <-"label_2"
Y_R <- data.frame(replicate(1,sample(25:45, 10, rep = TRUE)))
names(Y_R) <-names(Y_R)[names(Y_R) == "replicate.1..sample.25.45..10..rep...TRUE.."] <- "label_3"
Y_E <- data.frame(replicate(1,sample(21:80, 10, rep = TRUE)))
names(Y_E) <-names(Y_E)[names(Y_E) == "replicate.1..sample.15.65..10..rep...TRUE.."] <- "label_4"
inputs <- list(H, C, R, E)
targets <- list(Y_H, Y_C, Y_R, Y_E)
outputs <- list()
for(df in inputs){
data.cor <- cor(df)
high.cor <- findCorrelation(data.cor, cutoff=0.40)
outputs <- append(outputs, list(df[,-high.cor]))
}
library("glmnet")
lasso_cv <- list()
lasso_model <- list()
for(i in outputs){
for(j in targets){
lasso_cv[i] <- cv.glmnet(as.matrix(outputs[[i]]), as.matrix(targets[[j]]),
standardize = TRUE, type.measure="mse", alpha = 1,nfolds = 3)
lasso_model[i] <- glmnet(as.matrix(outputs[[i]]), as.matrix(targets[[j]]),lambda = lasso_cv[i]$lambda_cv, alpha = 1, standardize = TRUE)
}
}
My data is something like this:
ind1 <- rnorm(99)
ind2 <- rnorm(99)
ind3 <- rnorm(99)
ind4 <- rnorm(99)
ind5 <- rnorm(99)
dep <- rnorm(99, mean=ind1)
group <- rep(c("A", "B", "C"), each=33)
df <- data.frame(dep,group, ind1, ind2, ind3, ind4, ind5)
head(df)
dep group ind1 ind2 ind3 ind4 ind5
1 -3.4471651 A -1.61903503 0.8047651 -0.1498872 -0.1421423 0.0003106164
2 1.6097232 A -0.07202246 -1.7419735 -0.4600787 -0.8966089 0.5476866447
3 0.2419204 A 0.37266251 1.1456788 0.2693086 -0.7610068 -0.6588102710
4 0.4157412 A 1.17342107 0.4829935 1.5044331 0.4902491 -0.1845711119
5 -1.1026655 A 0.31428775 0.9488747 0.1806105 0.4187126 -0.4420550339
6 1.9605406 A 1.36719867 1.1116940 0.7026870 0.6450296 -1.0824261091
The next lines of code intend to run linear model for all possible combination of independent variables in the dataframe.
After selecting the significant models by p < 0.05 and ranking them by AIC value, I want to get following outputs: equation, estimate, R2, adj.R2, P value, AIC.
I tried it this way but the outputs are different from what I wanted. Does anyone have any idea to get them right?
indvar_list <- lapply(1:5, function(x)
combn(paste0("ind", 1:5), x, , simplify = FALSE))
formulas_list <- rapply(indvar_list, function(x)
as.formula(paste("dep ~", paste(x, collapse="+"))))
mult.lm <- lapply(formulas_list, function(x) glance(lm(as.formula(x),
data = df)))
outputs <- bind_rows(mult.lm, .id = 'index') %>%
filter(p.value < 0.05) %>%
arrange(AIC) %>%
filter(index %in% head(unique(index)))
thanks in advance!
It looks like all you're missing is the equation and estimate? The rest seems like it's exactly what you want. I think the issue is you're using "glance" which doesn't return these variables. I typically use summary when trying to get information from my model, but that doesn't include AIC. It's also not very clean to get at. You can add a second and third mult.lm which contain the estimates and our coefficient names used in the model, pasted together for simplicity.
mult.lm_est <- lapply(formulas_list, function(x) paste(summary(lm(as.formula(x),
data = df))$coefficients[,1], collapse = ","))
mult.lm_coef <- lapply(formulas_list, function(x) paste(rownames(summary(lm(as.formula(x),
data = df))$coefficients), collapse = ","))
Then you can bind rows on those lists as you did on your original, which will allow you to use dplyr joins by index
Use base::format to get the equation and broom::tidy to get the estimates "coefficients"
mult.lm <- lapply(formulas_list, function(x) {
mod <- lm(as.formula(x), data = df)
data.frame(model=format(x), tidy(mod), glance(mod),
stringsAsFactors = FALSE, row.names = NULL)
})
#To return more clean dataframe
mult.lm <- lapply(formulas_list, function(x) {
mod <- lm(as.formula(x), data = df)
tmod <- tidy(mod)
gmod <- glance(mod)
gmod[2:nrow(tmod),] <- ''
data.frame(model=c(format(x), rep('', nrow(tmod)-1)), tmod, gmod,
stringsAsFactors = FALSE, row.names = NULL)
})
I am very new to R language. Here is my code ..
apply(DataFrame,2,range)
maxValue <- apply(DataFrame,2,max)
minValue <- apply(DataFrame,2,min)
#not used ...
DataFrame<- as.data.frame(scale(DataFrame, center = minValue,scale = maxValue-minValue))
ind<- sample(1:nrow(DataFrame),2000)
trainDF <- DataFrame[ind,]
testDF <- DataFrame[-ind,]
head(trainDF)
allVars <-colnames(DataFrame)
predictorVars <- allVars[!allVars%in%"TEMP"]
predictorVars <- paste(predictorVars,collapse = "+")
form=as.formula(paste("TEMP~",predictorVars,collapse = "+"))
#form= (paste("TEMP~DEWP+SLP+STP+VISIB+WDSP+MAX+MIN"))
neuralModel <- neuralnet(formula= form, hidden = c(4,2),linear.output = FALSE ,data=trainDF)
plot(neuralModel)
head(testDF)
prediction <- compute(neuralModel, testDF[,1:11])
str(prediction)
prediction <- prediction$net.result*(max(testDF$TEMP)-min(testDF$TEMP))+min(testDF$TEMP)
actualValues <- ((testDF$TEMP)*(max(testDF$TEMP)-min(testDF$TEMP))+min(testDF$TEMP))
here when I check prediction[1] or any test data, it always give me same response.
I created a data set like this, the reason is I wanted to use a time series based data set
getSymbols("^GSPC")
DF=data.frame(GSPC,DATE=time(GSPC))
PriceChange=(DF$GSPC.Close-DF$GSPC.Open)
DF$Class<-as.factor(ifelse(PriceChange>0,"UP","DOWN"))
DF$year = as.numeric(format(DF$DATE, format = "%Y"))
DF$MONTH = as.numeric(format(DF$DATE, format = "%m"))
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume Class Year Month
1418.03 1429.42 1407.86 1416.60 3429160000 Down 2007 1
1416.60 1421.84 1408.43 1418.34 3004460000 Up 2007 1
Then I replaced the year by a number which can be added on to keep the month index (I guess there is a smarter way of doing this)
DF=data.table(DF)
DF[year==2007,year:=0]
DF[year==2008,year:=12]
DF[year==2009,year:=24]
DF[year==2010,year:=36]
DF[year==2011,year:=48]
DF[year==2012,year:=60]
DF[year==2013,year:=72]
DF[year==2014,year:=84]
DF[year==2015,year:=96]
DF[year==2016,year:=108]
DF$Month_Index=(DF$year+DF$MONTH)
so the data set has now the additional column
Month_Index
01
01
Month_Index=115
Then I used the createTimeSlices from caret to make a growing window prediction.
TimeSlices=createTimeSlices(1:Month_Index, 5, horizon = 2,
fixedWindow = FALSE, skip = 0)
for(i in 1:nrow(DF))
{
plsFitTime <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
Prediction=predict(plsFitTime,DF[TimeSlices$test[[i]],])
}
Now I want to save predictions for each step and its proper index along with the accuracy. My question is how can I do this.
One of the ways to accomplish this:
library(quantmod)
library(data.table)
library(caret)
getSymbols("^GSPC")
DF <- data.frame(GSPC,DATE=time(GSPC))
PriceChange <- (DF$GSPC.Close-DF$GSPC.Open)
DF$Class <- as.factor(ifelse(PriceChange>0,"UP","DOWN"))
You can create Month indices in the following two ways:
# 1
DF$yearMon <- zoo::as.yearmon(DF$DATE)
DF <- data.table(DF)
DF[, Month_Index:= .GRP, by = yearMon]
# 2
DF$year <- as.numeric(format(DF$DATE, format = "%Y"))
DF$MONTH <- as.numeric(format(DF$DATE, format = "%m"))
DF[, Month_Index2 := .GRP, by = .(year, MONTH)]
identical(DF$Month_Index, DF$Month_Index2)
[1] TRUE
Month_Index <- length(unique(DF$Month_Index))
TimeSlices <- createTimeSlices(1:Month_Index, 5, horizon = 2,
fixedWindow = FALSE, skip = 0)
Create three empty lists to save your results:
totalSlices <- length(TimeSlices$train)
plsFitTime <- vector("list", totalSlices)
Prediction <- vector("list", totalSlices)
Accuracy <- vector("list", totalSlices)
Save all the results to these lists:
k <- 1:totalSlices
for(i in seq_along(k))
{
plsFitTime[[i]] <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
Prediction[[i]] <- predict(plsFitTime[[i]],
DF[TimeSlices$test[[i]],])
Accuracy[[i]] <- confusionMatrix(Prediction[[i]],
DF[TimeSlices$test[[i]],]$Class)$overall[1]
}
All the models are saved in plsFitTime, predictions in Prediction, and accuracies in Accuracy.
Update:
A more tidier approach would be to use the purrr package.
After creating time slices, you can use:
library(purrr)
customFunction <- function(x, y) {
model <- train(Class~.,
data = DF[x],
method = "pls")
prediction <- predict(model, DF[y])
accuracy <- confusionMatrix(prediction,
DF[y]$Class)$overall[1]
return(list(prediction, accuracy))
}
results <- map2_df(TimeSlices$train, TimeSlices$test, customFunction)
map2_df is a function that takes 2 lists .x and .y as the arguments, applies the function .f on all elements of those lists and returns the results as a dataframe.
You can create the function on the fly (just like lapply), but I created customFunction in the global environment just to keep the code clean.
DF[x] in the function is equivalent to DF[TimeSlices$train[[n]]] and DF[y] is DF[TimeSlices$test[[n]]]
map2_df now does everything that the for loop above did, and returns only the predictions and accuracies for all the models in the form of a dataframe.
class(results)
[1] "tbl_df" "tbl" "data.frame"
dim(results)
[1] 2 109
Each column in results is a list. 109 columns are the results from 109 models.
To access the result of each model (in this case prediction and accuracy) use results$columnName or results[[columnNumber]] .
If you want to store the models as well, just change the return statement in customFunction to include the model: return(list(model, prediction, accuracy))
You can use plyr to gather your results using a list:
results <- plyr::llply(1:length(TimeSlices$train), function(i){
plsFitTime <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
testData <- DF[TimeSlices$test[[i]],]
Prediction <- predict(plsFitTime, testData)
list(index = i, model = plsFitTime, prediction = Prediction)
})
# The model created for slice no. 3
results[[3]]$model
# ... and it's predictions
results[[3]]$prediction
You can add accuracy inside the function passed to llply if you need it.