Create new sequentially named variables and fill with mean of level - r

Warning: Multi-part question!
I realize parts of this have been answered elsewhere but am struggling to bring them together in a nice parsimonious bit of code....
I have a data frame with a number (24) of numeric columns of interest. For each column, I want to create a new variable in the same data frame (named sensibly) in which the values correspond to the mean of the sex-specific decile for that variable (sex is in a different column, coded 0/1).
New column names from an original column called 'WBC' would be, for example: 'WBC_meandec_women', and 'WBC_meandeac_men'.
I've tried various bits of code to first create new variables, then assign values related to the decile but none work well and can't figure out how to put it together. I just know there is a clever way to put all parts into the same code chunk, I'm just not fluent enough in R to get there...
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),WBC=rnorm(100),RBC=rnorm(100))
Trying to achieve:
goaldata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100),WBC_decmean_women=rep(NA,length(dummydata)),WBC_decmean_men=rep(NA,length(dummydata)),RBC_decmean_women=rep(NA,length(dummydata)),RBC_decmean_men=rep(NA,length(dummydata)))
...but obviously with the correct values instead of NAs, and for a list of about 24 original variables.
Any help greatly appreciated!

Depending on if I understood you right, I'll propose this giant ball of duct tape...
# fake data
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100))
# a function to calculate decile means
decilemean <- function(x) {
xrank <- rank(x)
xdec <- floor((xrank-1)/length(x)*10)+1
decmeans <- as.numeric(tapply(x,xdec,mean))
xdecmeans <- decmeans[xdec]
return(xdecmeans)
}
# looping thru your data columns and making new columns
newcol <- 5 # the first new column to create
for(j in c(3,4)) { # all of your colums to decilemean-ify
dummydata[,newcol] <- NA
dummydata[dummydata$sex==0,newcol] <- decilemean(dummydata[dummydata$sex==0,j])
names(dummydata)[newcol] <- paste0(names(dummydata)[j],"_decmean_women")
dummydata[,newcol+1] <- NA
dummydata[dummydata$sex==1,newcol+1] <- decilemean(dummydata[dummydata$sex==1,j])
names(dummydata)[newcol+1] <- paste0(names(dummydata)[j],"_decmean_men")
newcol <- newcol+2
}
I'd recommend testing it though ;)

Related

Selecting and arranging column data in R

I have this data set that requires some cleaning up. Is there a way to code in R such that it picks up columns with more than 3 different levels from the data set? Eg column C has the different education level and I would like it to be selected along with column D and F. While column E and G wont be picked up because it doesnt meet the more than 3 level requirement.
At the same time I need one of the columns to be arranged in a specific way? Eg Education, I would like PHD to be at the top. The other levels of education does not need to be in any order
Sorry i am really new to R and I attached a snapshot of a sample data i replicated from the original
All help is greatly appreciated
It is a bit complicated to replicate the data as it is an image, but you could use this function to select those columns of your dataframe that have at least 3 levels.
First I converted to factor those columns you are considering, in this case from column C or 3. Then with the for loop I identify those columns with more than 2 levels, and save the result in a vector and then filter the original data set according to these columns.
select_columns <- function(df){
df <- data.frame(lapply(df[,-c(1,2)], as.factor))
selectColumns <- c()
for (i in 1:length(df)) {
if((length(unique(df[,i])) > 3) ){
selectColumns[i] <- colnames(df[i])
}
}
selectColumns <- na.omit(selectColumns)
return(data %>% select(c(1:2),selectColumns))
}
select_columns(your_data_frame)

Print multiple Outputs stored in a vector with a Loop

I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)

How to loop through a dataframe and count the number of occurrences of different strings and add their counts to other columns

I am trying to create a set of functions that can create tables containing all combinations of different 8-sided gaming dice. My function creates a dataframe with all these combinations with the columns containing characters.
Now I want to do something like this:
Go through each row and every time I find a certain character, add 1 to the relevant new column.
I have tried writing a function to do this and using the apply functions, but I am new to R and have not been able to make this work.
This is the code I use to create the dataframe.
createDiceTable <- function(redDice=0,blackDice=0,blueDice=0){
red <- as.data.frame(c("Blank","Blank","Crit","Crit","Hit","Hit","Accuracy","HitHit"))
black <- as.data.frame(c("Blank","Blank","Hit","Hit","Hit","Hit","HitCrit","HitCrit"))
blue <- as.data.frame(c("Hit","Hit","Hit","Hit","Crit","Crit","Accuracy","Accuracy"))
result <- as.data.frame(expand.grid(c(replicate(redDice,red),replicate(blackDice,black),replicate(blueDice,blue))))
convertDice(result)
}
convert
convertDice <- function(df){
df[,"Damage"] <- NA
df[,"Accuracies"] <- NA
df[,"Crits"] <- NA
return(df)
}
I have tried this code for a function to apply to the dataframe that should count the number of dice having rolled "Hit" to the "Damage"-column
damageFunc <- function(x){if(x=="Hit") append(df["Damage"],1)}
I have tried applying it like this:
sapply(testdice,damageFunc)

How to match and store results from a long nested for loop into an empty column in a data frame in R

I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value

Removing rows in data frame using get function

Suppose I have following data frame:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
I want to get following data frame at the end:
mydataframe[-which(is.na(mydataframe$ID)),]
I need to do this kind of cleaning (and other similar manipulations) with many other data frames. So, I decided to assign a name to mydataframe, and variable of interest.
dbname <- "mydataframe"
varname <- "ID"
attach(get(dbname))
I get an error in the following line, understandably.
get(dbname) <- get(dbname)[-which(is.na(get(varname))),]
detach(get(dbname))
How can I solve this? (I don't want to assign to a new data frame, even though it seems only solution right now. I will use "dbname" many times afterwards.)
Thanks in advance.
There is no get<- function, and there is no get(colname) function (since colnames are not first class objects), but there is an assign() function:
assign(dbname, get(dbname)[!is.na( get(dbname)[varname] ), ] )
You also do not want to use -which(.). It would have worked here since there were some matches to the condition. It will bite you, however, whenever there are not any rows that match and instead of returning nothing as it should, it will return everything, since vec[numeric(0)] == vec. Only use which for "positive" choices.
As #Dason suggests, lists are made for this sort of work.
E.g.:
# make a list with all your data.frames in it
# (just repeating the one data.frame 3x for this example)
alldfs <- list(mydataframe,mydataframe,mydataframe)
# apply your function to all the data.frames in the list
# have replaced original function in line with #DWin and #flodel's comments
# pointing out issues with using -which(...)
lapply(alldfs, function(x) x[!is.na(x$ID),])
The suggestion to use a list of data frames is good, but I think people are assuming that you're in a situation where all the data frames are loaded simultaneously. This might not necessarily be the case, eg if you're working on a number of projects and just want some boilerplate code to use in all of them.
Something like this should fit the bill.
stripNAs <- function(df, var) df[!is.na(df[[var]]), ]
mydataframe <- stripNAs(mydataframe, "ID")
cars <- stripNAs(cars, "speed")
I can totally understand your need for this, since I also frequently need to cycle through a set of data frames. I believe the following code should help you out:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
#define target dataframe and varname
dbname <- "mydataframe"
varname <- "ID"
tmp.df <- get(dbname) #get df and give it a temporary name
col.focus <- which(colnames(tmp.df) == varname) #define the column of focus
tmp.df <- tmp.df[which(!is.na(tmp.df[,col.focus])),] #cut out the subset of the df where the column of focus is not NA.
#Result
ID score
1 1 11
2 2 12
4 4 14
5 5 15

Resources