First of all, thank you very much for your interest and time. My question (using R):
To predict the yvar, I have run a lasso regression which reduced the set of xvariables from 736 to 30.
lasso.mod =glmnet(x,y,alpha=1)
cv.out =cv.glmnet (x,y,alpha=1)
lasso.bestlam =cv.out$lambda.min
tmp_coef = coef(cv.out,s=lasso.bestlam)
varnames = data.frame(name = tmp_coef#Dimnames[[1]][tmp_coef#i])
mylist = list(name = tmp_coef#Dimnames[[1]][tmp_coef#i])
Hence, I have the remaining variable names as a data frame and also as a list.
How is it possible to create a new data frame which has these remaining 30 variables and their observations in it? In other words: How can I get a subset of my original data which does not contain 737 variables but only 31?
I think this should be quite easy, however I have been spending more than two hours and it never worked...
Best wishes,
Thomas
Cannot test your solution as I do not have the data, but this should do the trick:
varnames <- tmp_coef#Dimnames[[1]][tmp_coef#i]
as.data.frame(cbind(x[, varnames], y))
Your tmp_coef#Dimnames[[1]][tmp_coef#i] variable contains the names of the remaining variables, but also contains "(Intercept)" as the first item. If you discard it with -1], you can extract the columns:
x <- as.data.frame(x[, tmp_coef#Dimnames[[1]][tmp_coef#i][-1]])
Even simpler, you can use the indices in tmp_coef#i directly:
x <- as.data.frame(x[, tmp_coef#i[-1]])
Related
I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)
I have a dataset with 61 columns (60 explanatory variables and 1 response variable).
All the explantory variables all numerical, and the response is categorical (Default).Some of the ex. variables have negative values (financial data), and therefore it seems more sensible to standardize rather than normalize. However, when standardizing using the "apply" function, I have to remove the response variable first, so I do:
model <- read.table......
modelwithnoresponse <- model
modelwithnoresponse$Default <- NULL
means <- apply(modelwithnoresponse,2mean)
standarddeviations <- apply(modelwithnoresponse,2,sd)
modelSTAN <- scale(modelwithnoresponse,center=means,scale=standarddeviations)
So far so good, the data is standardized. However, now I would like to add the response variable back to the "modelSTAN". I've seen some posts on dplyr, merge-functions and rbind, but I couldnt quite get to work so that response would simply be added back as the last column to my "modelSTAN".
Does anyone have a good solution to this, or maybe another workaround to standardize it without removing the response variable first?
I'm quite new to R, as I'm a finance student and took R as an elective..
If you want to add the column model$Default to the modelSTAN data frame, you can do it like this
# assign the column directly
modelSTAN$Default <- model$Default
# or use cbind for columns (rbind is for rows)
modelSTAN <- cbind(modelSTAN, model$Default)
However, you don't need to remove it at all. Here's an alternative:
modelSTAN <- model
## get index of response, here named default
resp <- which(names(modelSTAN) == "default")
## standardize all the non-response columns
means <- colMeans(modelSTAN[-resp])
sds <- apply(modelSTAN[-resp], 2, sd)
modelSTAN[-resp] <- scale(modelSTAN[-resp], center = means, scale = sds)
If you're interested in dplyr:
library(dplyr)
modelSTAN <- model %>%
mutate(across(-all_of("default"), scale))
Note, in the dplyr version I didn't bother saving the original means and SDs, you should still do that if you want to back-transform later. By default, scale will use the mean and sd.
my question is a follow-up to this question on imputation by group using "mice":
multiple imputation and multigroup SEM in R
The code in the answer works fine as far as the imputation part goes. But afterwards I am left with a list of actually complete data but more than one set. The sample looks as follows:
'Set up data frame'
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
'Introduce NAs'
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
df
'Impute values by group:'
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
df.clean
As you can see, df.clean is a list of 3. One element per group. But each element containing a complete data set I am looking for.
The original answer suggests to rbind() the obtained data in df.clean which leaves me with a new data set with 45 (3x the original size) observations.
Here is the original code for the last step:
imputed.both <- do.call(args = df.clean, what = rbind)
Which data is the "right" one? And why the last step?
Thanks a bunch!
There's a bug in the code, i have a edited version below that works:
#Set up data frame
set.seed(12345)
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
#Introduce NAs
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
# check NAs
colSums(is.na(df))
#Impute values by group:
# here's the bug
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
imputed.both <- do.call(args = df.clean, what = rbind)
dim(imputed.both)
# returns 15,4
In the code in the question, you have
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
dim(do.call(rbind,df.clean))
#this returns 45,4
The function is specified with "x" but you call df from the global environment. Hence you impute on the complete df.
So to answer your question, if you do this step:
split(df,df$ID)
You split your data frame into a list of data.frames with only A,B or Cs. Then if you lapply through this list, you get
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
names(df.clean)
lapply(df.clean,dim)
each item of the list df.clean contains a subset of the original df, with ID being A, B or C. Now you combine this list together into a data.frame using:
imputed.both <- do.call(rbind,df.clean)
I have a data.frame of 373127 obs. of 193 variables. Some variables are factors which I want to use dummyVars() to separate each factor into its own column. I then want to merge the separate dummy variable columns back into my original data.frame, so I thought I could do the whole thing with apply, but something is not working and I can't figure out what it is.
Sample:
dat_final <- apply(dummies.var1, 1, function(x) {
dummies.var1 <- dummyVars(~ dat1$factor.var1 -1, data = dat1)
})
Thanks!
You can do the following that will create a new df, trsf, but you could always reassign back to the original df:
library(caret)
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
See more here
The real answer is .... Don't do that. It's almost never necessary.
You could do something like this:
# Example data
df = data.frame(x = rep(LETTERS, each = 3), y = rnorm(78))
df = cbind(df, model.matrix(~df$x - 1))
However, as pointed out by #user30257 it is hard to see why you want to do it. In general, modeling tools in R don't need dummy vars, but deal with factors directly.
Creating dummy variables can be very important in feature selection, which it sounds like the original poster was doing.
For instance, suppose you have a feature that contains duplicated information (i.e., one of its levels corresponds to something measured elsewhere). You can determine this is the case very simply by comparing the dummy variables for these features using a variety of dissimilarity measures.
My preference is to use:
sparse.model.matrix and
cBind
This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.
Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.
Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.