I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)
Related
I have collected data from a survey in order to perform a choice based conjoint analysis.
I have preprocessed and clean data with python in order to use them in R.
However, when I apply the function dfidx on the dataset I get the following error: the two indexes don't define unique observations.
I really do not understand why. Before creating the .csv file I checked if there were duplicates through the pandas function final_df.duplicated().sum() and its out put was 0 meaning that there were no duplicates.
Can please some one help me to understand what I am doing wrong ?
Here is the code:
df <- read.csv('.../survey_results.csv')
df <- df[,-c(1)]
df$Platform <- as.factor(df$Platform)
df$Deposit <- as.factor(df$Deposit)
df$Fees <- as.factor(df$Fees)
df$Financial_Instrument <- as.factor(df$Financial_Instrument)
df$Leverage <- as.factor(df$Leverage)
df$Social_Trading <- as.factor(df$Social_Trading)
df.mlogit <- dfidx(df, idx = list(c("resp.id","ques"), "position"), shape='long')
Here is the link to the dataset that I am using https://github.com/AlbertoDeBenedittis/conjoint-survey-shiny/blob/main/survey_results.csv
Thank you in advance for you time
The function dfidx() is build for data frames "for which observations are defined by two (potentialy nested) indexes" (ref).
I don't think this function is build for more than two idxs. Especially that, in your df, there aren't any duplicates ONLY when considering the combinations of the three columns you mention above (resp.id, ques and position).
One solution to this problem is to "combine" the two columns resp.id and ques into one (called for example resp.id.ques) with paste(...).
df$resp.id.ques <- paste(df$resp.id, df$ques, sep="_")
Then you can write the following line which should work just fine:
df.mlogit <- dfidx(df, idx = list("resp.id.ques", "position"))
I'm doing a naive-bayes algorithm in R. The main goal is to predict a variable's value. But in this specific task, I'm trying to see which column is better at predicting it.
This is an example of what works (but in the real dataset doing it manually isn't an option):
library(naivebayes)
data("mtcars")
mtcars$vsLog <- as.logical(as.integer(mtcars$vs))
mtcars_train <- mtcars[1:20,]
mtcars_test <- mtcars[20:32,]
car_model <- naive_bayes( data=mtcars_train, vsLog ~ mpg )
predictions <- predict(car_model,mtcars_test)
What I'm having trouble with is performing a for loop, in which the model takes one column at a time, and save how good each model did at predicting the values.
I've looked at different ways to input the columns as something I can iterate over, but couldn't make it work.
My minimum reproducible example of my problem is:
library(naivebayes)
data("mtcars")
mtcars$vsLog <- as.logical(as.integer(mtcars$vs))
mtcars_train <- mtcars[1:20,]
mtcars_test <- mtcars[20:32,]
for (j in 1:ncol(mtcars)) {
car_model <- naive_bayes( data=mtcars_train, vsLog ~ mtcars_train[,j] )
predictions[j] <- predict(car_model,mtcars_test)
}
The problem is how to replace mpg in the first example with something I can loop over. Things I've tried: mtcars_train$mpg , unlist( mtcars_train[,j] ) , colnames .
I really tried googling this, I hope it's not too silly of a question.
Thanks for reading
This might be helpful. If you want to use a for loop, you can use seq_along with the names of your columns you want to loop through in your dataset. You can use reformulate to create a formula, which would you vsLog in your example, as well as the jth item in your column names. In this example, you can store your predict results in a list. Perhaps this might translate to your real dataset.
pred_lst <- list()
mtcars_names <- names(mtcars_train)
for (j in seq_along(mtcars_names)) {
car_model <- naive_bayes(reformulate(mtcars_names[j], "vsLog"), data=mtcars_train)
pred_lst[[j]] <- predict(car_model, mtcars_test)
}
I recently started programming in R, and am trying to compute slopes for a data set. This is my code:
slopes<- vector()
gdd.values <- length(unique(data.gdd$GDD))
for (i in 1:gdd.values){
subset.data <- data.gdd[which(data.gdd$GDD==i),]
volume <- apply(subset.data[,4,6],1,prod)
species.richness <- apply(subset.data[,7:59],1,sum)
slopes[i] <- lm(log(species.richness) ~ log(volume))$coefficients[2]
}
When I run it the "slopes" value remains empty. All other values are fine (no other empty sets). Let me know if you find any obvious mistakes. Thanks
Currently, you are iterating across the length of unique values and not unique values themselves. So, as #RobJensen comments, adjust the for loop vector and iteration. Hence, why some or all returned values result in missing as subset.data may contain no rows due to imprecise filter.
However, consider a more streamlined approach using the often underused and overlooked by() to subset dataset by needed grouping factor(s) and bind returned list into a vector:
coeff_list <- by(data.gdd, data.gdd$GDD, FUN=function(df) {
volume <- apply(df[,4,6],1,prod)
species.richness <- apply(df[,7:59],1,sum)
lm(log(species.richness) ~ log(volume))$coefficients[2]
})
slopes <- do.call(c, coeff_list)
I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value
I'm completely new to programming and R, but have a dataset that can only be analyzed with a more powerful statistics program such as R.
I have a large but simple dataset consisting of thousands of different groups with multiple samples that I want to compare against the control group with a mann whitney U test, data structure is pictured below.
Group, Measurements
a 0.14534
cont 0.42574
d 0.36347
c 0.14284
a 0.23593
d 0.36347
cont 0.33514
cont 0.29210
b 0.36345
...
The problem comes from that the nature of the test requires that only two groups are designated. However, as I have more than 1 group it does not work.
This is what I have so far and I as you see it does not work in a repeated fashion and only works if I have two groups in my input file.
data1 = read.csv(file.choose(), header=TRUE, stringsAsFactors=FALSE)
attach(data1)
testoutput <- wilcox.test(group ~ measurement, mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
write.table(testoutput$p.value, file="mwUtest.tsv", sep="\t")
How do I do write and loop the test properly for it to test all my groups against my designated control group? I assume the sapply or lapply functions are used before the wilcox.test, but I dont know how.
I'm sorry if this simple question has been brought up before, but I could not find any previous question regarding this specific problem.
In R, there's often many solutions for the same problem. Here's how I would solve this.
First, I would split my data and have one dataframe with experiments and one with controls:
experiments <- dat[dat$group!="cont",]
controls <- dat[dat$group=="cont",]
Then I would split my experimental data by group, and feed that to my test together with my control measurements. Note that this construction makes it easy to extract more values from the test: just return a (named) vector.
result <- lapply(split(experiments, experiments$group),function(x){
mytest = wilcox.test(x$measurement,controls$measurement,mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
return(mytest$p.value)
})
Combining to a table is then easy:
output <- do.call(rbind,result)
Data used:
set.seed(123)
nobs=100
dat <- data.frame(group=sample(c(LETTERS[1:6],"cont"),nobs,T),
measurement=runif(nobs),stringsAsFactors=F)