Transfer of columns from one data frame to another under conditional statement - r

I have this code which was mostly written by one of the members here that exports all the graphs I need from my data set under the condition that the trendline coefficient is positive (increasing trendline).
lung <- read.csv("LAC.csv")
attach(lung) #data
age <- lung$Age
mirna <- data.frame(lung)
stuff <- data.frame(matrix(ncol = 500, nrow = 40))
pdf("test.pdf") # exports to pdf all the graphs
lapply(colnames(mirna)[-1],function(col){ #function for plotting
form <- formula(paste(col, "age", sep = "~"))
fit <- lm(form, mirna)
stuff_want <- stuff
if (coef(fit)[2] >0) { #plotting with condition
plot(form, df, xlab = "Age", main= "miRNA expression with increasing age")
abline(fit, col = 4)
}
})
dev.off()
This gives me a pdf file which I was hoping to use later to check which of the miRNA in the dataset are required and isolate the columns manually. However, I severely underestimated the number of mirRNA that meet the condition and now face a new conundrum on how to export the data from a column with and increasing trendline into a separate data frame which I would later save as a .csv file and use for further analysis.
Please keep in mind my knowledge of R is very limited although I am spending days in Rhelp and books. My idea was to create a separate data frame (stuff_want) to which the columns that satisfy the condition (coef(lm()) > 0) will be transferred. My initial thought was to use append() function and under the if condition write append(stuff_want, mirna, after = length(mirna)) followed by write.csv() function. The output of this is just NA filled .csv file.
Anyone able to explain to me why this is not working?
All the best,
Paulius

So here is one way (similar to #agstudy's comment), using the same made up data as in my previous answer
# make up some data
x <- seq(1,10,len=100)
set.seed(1) # for reproducible example
df <- data.frame(x,y1=1+2*x+rnorm(100),
y2=3-4*x+rnorm(100),
y3=2+0.001*x+rnorm(100))
# you start here...
result <- sapply(colnames(df)[-1],function(col){
form <- formula(paste(col,"x",sep="~"))
fit <- lm(form,df)
if(coef(fit)[2] > 0) TRUE else FALSE
})
cols <- names(result)[result]
cols
# [1] "y1" "y3"
This creates a named vector, result which elements have the same names as your response variables, and values = TRUE if that variable has positive slope, FALSE otherwise. Then
cols <- names(result)[result]
is a vector of the variable names with slope > 0. Finally, to extract the actual data, you would use:
stuff_want <- stuff[,cols]

Related

Print multiple Outputs stored in a vector with a Loop

I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)

For Loop Across Specific Column Range in R

I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.

R nested for-loop only producing result with last column in dataframe

CONTEXT:
I'm working with respondent-level survey data. Each row of my
data frame represents an individual person's survey responses.
My data frame consists of individual-level utility estimates from a Maximum Difference experiment AND
categorical variables indicating in which of several subgroups an individual survey respondent
resides.
Each subgroup variable is a single categorical variable with exactly two levels. However, in my
desired output, I'd like a data frame where each level of each subgroup has its own column.
OBJECTIVE:
I want to create a function that, for each user-defined subgroup, will conduct recursive T Tests over
every Maximum Difference item in the data frame, extract elements of the T Test output, and store the
elements in a data frame
Using T statistic results as an example, the end result should look like this:
Males_T_stat Females_T_stat
MD_item1 2.71 2.5
MD_item2 1.71 1.5
MD_item3 0.71 0.5
CURRENT CODE:
Right now, I'm focused on writing code to iteratively execute the T Tests and store each test's entire
output object in a list. The code I've used to, unsuccessfully, attempt this is below:
Create a test data frame:
dat <- data.frame(
md1 = 1:60,
gender = factor(rep(c("m", "f"), 30)),
generation = factor(rep(c("a", "b"), 30)),
md2 = 61:120
)
Specify the names of my respondent subgroup (i.e., the categorical variables).
groupnames <- c("gender", "generation")
item_vec <- dat %>% select(contains(("md")))
group_vec <- dat[groupnames]
Convert the subgroup name vectors to data frames. this step may be superfluous, but I'm more comfortable working with data frames.
item_vec <- data.frame(item_vec)
group_vec <- data.frame(group_vec)
So far, I've tried using nested for loops to run the T Tests and store each test output in a list. This code partially works; for each subgroup named in "group_vec", the code produces T Test results for the last item in "item_vec" only. However, I want the results for EVERY item in "item_vec", which is where I've currently stalled.
res <- list()
for (i in 1:length(group_vec)) {
res[[i]] <- list(test)
for (j in 1:length(item_vec)) {
test <- (t.test(item_vec[[j]] ~ group_vec[[i]]))
res[i] <- list(test)
}
}
res
Thank you in advance for any help you can provide!
In the nested loop, replace
res[i] <- list(test)
with
res[[i]][[j]] <- list(test)
as the 'j' is loop over the item_vec. If we just assign it to res[[i]] or res[i], for every item_vec in the 'group_vec', it just updates/replace the previous with the next and as there is nothing to update after the last, the last one remains for each 'group_vec'
Also, it may be better to initialize res as
res <- vector('list', length(group_vec))
and then make the changes as in the for loop
for (i in 1:length(group_vec)) {
res[[i]] <- list(test)
for (j in 1:length(item_vec)) {
test <- (t.test(item_vec[[j]] ~ group_vec[[i]]))
res[[i]][[j]] <- list(test)
}
}

How to match and store results from a long nested for loop into an empty column in a data frame in R

I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value

'R', 'mice', missing variable imputation - how to only do one column in sparse matrix

I have a matrix that is half-sparse. Half of all cells are blank (na) so when I try to run the 'mice' it tries to work on all of them. I'm only interested in a subset.
Question: In the following code, how do I make "mice" only operate on the first two columns? Is there a clean way to do this using row-lag or row-lead, so that the content of the previous row can help patch holes in the current row?
set.seed(1)
#domain
x <- seq(from=0,to=10,length.out=1000)
#ranges
y <- sin(x) +sin(x/2) + rnorm(n = length(x))
y2 <- sin(x) +sin(x/2) + rnorm(n = length(x))
#kill 50% of cells
idx_na1 <- sample(x=1:length(x),size = length(x)/2)
y[idx_na1] <- NA
#kill more cells
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
y2[idx_na2] <- NA
#assemble base data
my_data <- data.frame(x,y,y2)
#make the rest of the data
for (i in 3:50){
my_data[,i] <- rnorm(n = length(x))
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
my_data[idx_na2,i] <- NA
}
#imputation
est <- mice(my_data)
data2 <- complete(est)
str(data2[,1:3])
Places that I have looked for answers:
help document (link)
google of course...
https://stats.stackexchange.com/questions/99334/fast-missing-data-imputation-in-r-for-big-data-that-is-more-sophisticated-than-s
I think what you are looking for can be done by modifying the parameter "where" of the mice function. The parameter "where" is equal to a matrix (or dataframe) with the same size as the dataset on which you are carrying out the imputation. By default, the "where" parameter is equal to is.na(data): a matrix with cells equal to "TRUE" when the value is missing in your dataset and equal to "FALSE" otherwise. This means that by default, every missing value in your dataset will be imputed. Now if you want to change this and only impute the values in a specific column (in my example column 2) of your dataset you can do this:
# Define arbitrary matrix with TRUE values when data is missing and FALSE otherwise
A <- is.na(data)
# Replace all the other columns which are not the one you want to impute (let say column 2)
A[,-2] <- FALSE
# Run the mice function
imputed_data <- mice(data, where = A)
Instead of the where argument a faster way might be to use the method argument. You can set this argument to "" for the columns/variables you want to skip. Downside is that automatic determination of the method will not work. So:
imp <- mice(data,
method = ifelse(colnames(data) == "your_var", "logreg", ""))
But you can get the default method from the documentation:
defaultMethod
... By default, the method uses pmm, predictive mean matching (numeric data) logreg, logistic regression imputation (binary data, factor with 2 levels) polyreg, polytomous regression imputation for unordered categorical data (factor > 2 levels) polr, proportional odds model for (ordered, > 2 levels).
Your question isn't entirely clear to me. Are you saying you wish to only operate on two columns? In that case mice(my_data[,1:2]) will work. Or you want to use all the data but only fill in missing values for some columns? To do this, I'd just create an indicator matrix along the following lines:
isNA <- data.frame(apply(my_data, 2, is.na))
est <- mice(my_data)
mapply(function(x, isna) {
x[isNA == 1] <- NA
return(x)
}, <each MI mice return object column-wise>, isNA)
For your final question, "can I use mice for rolling data imputation?" I believe the answer is no. But you should double check the documentation.

Resources