I am running the following imputation task in R as a for loop:
myData <- essuk[c(2,3,4,5,6,12)]
myDataImp <- matrix(0,dim(myData)[1],dim(myData)[2])
lower <- c(0)
upper <- c(Inf)
for (k in c(1:5))
{
gmm.fit1 <- gmm.tmvnorm(matrix(myData[,k],length(myData[,k]),1), lower=lower, upper=upper)
useMu <- matrix(gmm.fit1$coefficients[1],1,1)
useSigma <- matrix(gmm.fit1$coefficients[2],1,1)
replaceThese <- myData[,k]<=0
myDataImp[,k] <- myData[,k]
myDataImp[replaceThese,k] <- rtmvnorm(n=sum(replaceThese), c(useMu), c(useSigma), c(-Inf), c(0))
}
The steps are pretty straightforward
Define the data set and an empty imputation data set.
For column 1-5, fit a model.
Extract model estimates to be used for imputation.
Run a model using model estimates and replace values <= 0 with the new values in the imputation data set.
However, I want to do this separately for multiple groups, rather than for the full sample. Column 12 in the data set contains information on group membership (integers ranging from 1-72).
I have tried several options, including splitting the data frame with data_list <- split(myData, myData$V12) and use the lapply() function. However, this does not work due to how model estimates are formatted:
Error in as.data.frame.default(data) :
cannot coerce class ""gmm"" to a data.frame
I have also thought about the possibility of doing a nested for loop, although I am not sure how that could be accomplished. Any suggestions are much appreciated.
what about using subset() ?
myData$V12 = as.factor(myData$V12)
listofresults= c()
for (i in levels(myData$V12)){
data = subset (myData, myData$V12 == i)
#your analysis here: result saved in myDataImp
listofresults = c(listofresults, myDataImp)
}
not the most elegant, but should work.
Related
I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)
I'm doing a naive-bayes algorithm in R. The main goal is to predict a variable's value. But in this specific task, I'm trying to see which column is better at predicting it.
This is an example of what works (but in the real dataset doing it manually isn't an option):
library(naivebayes)
data("mtcars")
mtcars$vsLog <- as.logical(as.integer(mtcars$vs))
mtcars_train <- mtcars[1:20,]
mtcars_test <- mtcars[20:32,]
car_model <- naive_bayes( data=mtcars_train, vsLog ~ mpg )
predictions <- predict(car_model,mtcars_test)
What I'm having trouble with is performing a for loop, in which the model takes one column at a time, and save how good each model did at predicting the values.
I've looked at different ways to input the columns as something I can iterate over, but couldn't make it work.
My minimum reproducible example of my problem is:
library(naivebayes)
data("mtcars")
mtcars$vsLog <- as.logical(as.integer(mtcars$vs))
mtcars_train <- mtcars[1:20,]
mtcars_test <- mtcars[20:32,]
for (j in 1:ncol(mtcars)) {
car_model <- naive_bayes( data=mtcars_train, vsLog ~ mtcars_train[,j] )
predictions[j] <- predict(car_model,mtcars_test)
}
The problem is how to replace mpg in the first example with something I can loop over. Things I've tried: mtcars_train$mpg , unlist( mtcars_train[,j] ) , colnames .
I really tried googling this, I hope it's not too silly of a question.
Thanks for reading
This might be helpful. If you want to use a for loop, you can use seq_along with the names of your columns you want to loop through in your dataset. You can use reformulate to create a formula, which would you vsLog in your example, as well as the jth item in your column names. In this example, you can store your predict results in a list. Perhaps this might translate to your real dataset.
pred_lst <- list()
mtcars_names <- names(mtcars_train)
for (j in seq_along(mtcars_names)) {
car_model <- naive_bayes(reformulate(mtcars_names[j], "vsLog"), data=mtcars_train)
pred_lst[[j]] <- predict(car_model, mtcars_test)
}
CONTEXT:
I'm working with respondent-level survey data. Each row of my
data frame represents an individual person's survey responses.
My data frame consists of individual-level utility estimates from a Maximum Difference experiment AND
categorical variables indicating in which of several subgroups an individual survey respondent
resides.
Each subgroup variable is a single categorical variable with exactly two levels. However, in my
desired output, I'd like a data frame where each level of each subgroup has its own column.
OBJECTIVE:
I want to create a function that, for each user-defined subgroup, will conduct recursive T Tests over
every Maximum Difference item in the data frame, extract elements of the T Test output, and store the
elements in a data frame
Using T statistic results as an example, the end result should look like this:
Males_T_stat Females_T_stat
MD_item1 2.71 2.5
MD_item2 1.71 1.5
MD_item3 0.71 0.5
CURRENT CODE:
Right now, I'm focused on writing code to iteratively execute the T Tests and store each test's entire
output object in a list. The code I've used to, unsuccessfully, attempt this is below:
Create a test data frame:
dat <- data.frame(
md1 = 1:60,
gender = factor(rep(c("m", "f"), 30)),
generation = factor(rep(c("a", "b"), 30)),
md2 = 61:120
)
Specify the names of my respondent subgroup (i.e., the categorical variables).
groupnames <- c("gender", "generation")
item_vec <- dat %>% select(contains(("md")))
group_vec <- dat[groupnames]
Convert the subgroup name vectors to data frames. this step may be superfluous, but I'm more comfortable working with data frames.
item_vec <- data.frame(item_vec)
group_vec <- data.frame(group_vec)
So far, I've tried using nested for loops to run the T Tests and store each test output in a list. This code partially works; for each subgroup named in "group_vec", the code produces T Test results for the last item in "item_vec" only. However, I want the results for EVERY item in "item_vec", which is where I've currently stalled.
res <- list()
for (i in 1:length(group_vec)) {
res[[i]] <- list(test)
for (j in 1:length(item_vec)) {
test <- (t.test(item_vec[[j]] ~ group_vec[[i]]))
res[i] <- list(test)
}
}
res
Thank you in advance for any help you can provide!
In the nested loop, replace
res[i] <- list(test)
with
res[[i]][[j]] <- list(test)
as the 'j' is loop over the item_vec. If we just assign it to res[[i]] or res[i], for every item_vec in the 'group_vec', it just updates/replace the previous with the next and as there is nothing to update after the last, the last one remains for each 'group_vec'
Also, it may be better to initialize res as
res <- vector('list', length(group_vec))
and then make the changes as in the for loop
for (i in 1:length(group_vec)) {
res[[i]] <- list(test)
for (j in 1:length(item_vec)) {
test <- (t.test(item_vec[[j]] ~ group_vec[[i]]))
res[[i]][[j]] <- list(test)
}
}
I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)
I recently started programming in R, and am trying to compute slopes for a data set. This is my code:
slopes<- vector()
gdd.values <- length(unique(data.gdd$GDD))
for (i in 1:gdd.values){
subset.data <- data.gdd[which(data.gdd$GDD==i),]
volume <- apply(subset.data[,4,6],1,prod)
species.richness <- apply(subset.data[,7:59],1,sum)
slopes[i] <- lm(log(species.richness) ~ log(volume))$coefficients[2]
}
When I run it the "slopes" value remains empty. All other values are fine (no other empty sets). Let me know if you find any obvious mistakes. Thanks
Currently, you are iterating across the length of unique values and not unique values themselves. So, as #RobJensen comments, adjust the for loop vector and iteration. Hence, why some or all returned values result in missing as subset.data may contain no rows due to imprecise filter.
However, consider a more streamlined approach using the often underused and overlooked by() to subset dataset by needed grouping factor(s) and bind returned list into a vector:
coeff_list <- by(data.gdd, data.gdd$GDD, FUN=function(df) {
volume <- apply(df[,4,6],1,prod)
species.richness <- apply(df[,7:59],1,sum)
lm(log(species.richness) ~ log(volume))$coefficients[2]
})
slopes <- do.call(c, coeff_list)