Using for loop function in ggstratsplot in r - r

I have a csv-file with round about 180 columns (in my example called df). So far I managed to use the ggstatsplot::ggbetweenstats package to plot the data. One column called Group contains the information of the treatment condition and represents the x-axis. The y-axis is changing for each plot. (in the example below it's Bcells.CD45)
ggstatsplot::ggbetweenstats (df, x = Group, y = Bcells.CD45 , plot.type = "violin")
Now, I tried to use the for loop function to replace the value of the y-axis for each generated plot.
for (i in names(df) [1:ncol(df)]) { ggstatsplot::ggbetweenstats(df, x = Group, y = i , plot.type = "violin")}
R returns the following error:
can't subset columns that don't exist.
x The column i doesn't exist.
Run rlang::last_error() to see where the error occurred.
I have the impression that either the ggstatsplot package cant't handle i as placeholder for changing column-names or I'm making a mistake in defining i.
Thanks for your help!
Best Martin

Please have a look at the FAQ-vignette on ggstatsplot website, which documents how to use ggstatsplot functions in a for loop:
https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/faq.html
Relevant text:

Related

ggplot within a function does not return a scatterplot with datapoints, instead a plot with dataframe values. How to fix this?

I'm writing a function where I should get 2 ggplots objects returned to me in RStudio based on two different dataframes generated within my function. However, instead I get a plot with all the dataframe values "printed" in it returned and not a normal scatterplot.
I tried:
return(list(df1, df2))
Plots<- list(df1, df2), return(Plots)
View(df1) View(df2)
ggplot without storing it into an object
Just return a single ggplot and not using list() to return two.
Print() instead of return or view.
Every result has the same outcome (picture):
As you can see on the bottom right, I do not get a scatter plot. The console does show output [1] and [[2]], but nothing else. The code itself is working perfectly.
I ran debug, I've got no errors and above all when I replaced ggplot with plot(), this DID return the prefered scatterplot to me. So I assume the problem is not related to the code itself.
However, I am much more familiar with customizations with ggplot than plot(), so if anyone knows how to solve this issue it would be amazing. Provided below I added some sample data and some sample code, although I'm not sure whether that is relevant with this issue.
The code I used within my function to create and return the ggplots is:
MD_filter_trial<- function(dataframe, mz_col, a = 0.00112, b = 0.01953){
MZ<- mz_col
MZR<- trunc(mz_col, digits = 0)#Either floor() or trunc() can be used for this part.
MD<- as.numeric(MZ-MZR)
MD.limit<- b + a*mz_col
dataframe<- dataframe%>%
dplyr::mutate(MD, MZ, MD.limit)%>%
dplyr::select(MD, MZ, MD.limit)
highlight_df <- dataframe %>% filter(MD >= MD.limit) #Notice how this is the exact opposite from the
MD_plot<- ggplot(data=dataframe, aes(x=MZ, y=MD))+
geom_point()+
geom_point(data=highlight_df, aes(x=MZ,y=MD), color='red')+#I added this one, so the data which will be removed will be highlighted in red.
ggtitle(paste("Unfiltered MD data - ", dataframe))
filtered<- dataframe%>%
filter(MD <= MD.limit)# As I understood: Basically all are coordinates. The maxima equation basically gives coordinates
MD_plot_2<- ggplot(data=filtered, aes(x=MZ, y=MD))+ #Filtered is basically the second dataframe, #which subsets datapoints with an Y value (which is the MD), below the linear equation MD...
geom_point()+
ggtitle(paste("Filtered MD data - ", dataframe))
N_Removed_datapoints <- nrow(dataframe) - nrow(filtered)
print(paste("Number of peaks removed:", N_Removed_datapoints))
MD_PLOTS<-list(dataframe, filtered, MD_plot, MD_plot_2)
return(MD_PLOTS)
}
Sample data:
structure(list(mz_col= c(99.0001, 99.0056, 99.0079, 99.0097, 99.0105,
99.0116, 99.0158, 99.0169, 99.019, 99.0196, 99.0207, 99.0215,
99.0239, 99.0252, 99.026, 99.0269, 99.0288, 99.0295, 99.0302,
99.0311, 99.0318, 99.0332, 99.034, 99.0346, 99.0355, 99.0376,
99.039, 99.04, 99.0405, 99.0414, 99.0421, 99.043, 99.0444, 99.0473,
99.048, 99.0517, 99.0536, 99.0547, 99.0556, 99.057, 99.0575,
99.0586, 99.0599, 99.0606, 99.0621, 99.0637, 99.0652, 99.0661,
99.0668, 99.0686, 99.0694, 99.0699, 99.0707, 99.0714, 99.072,
99.075, 99.0762, 99.0794, 99.0808, 99.0836, 99.0888, 99.0901,
99.0911, 99.092, 99.095, 99.0962, 99.1001, 99.1064, 99.1173,
99.4889, 99.5059, 99.5084, 99.5126, 99.5158, 99.5165, 99.5173,
99.5183, 99.526, 99.5266, 99.5315, 99.5345, 99.5358, 99.5402,
99.543, 99.5472, 99.548, 99.5529, 99.5572, 99.5577, 99.9408,
99.9551, 99.9599, 99.9646, 99.9718, 99.9887)), row.names = c(NA,
-95L), class = c("tbl_df", "tbl", "data.frame"))
In your ggtitles calls perhaps you mean:
ggtitle(paste("Filtered MD data -", deparse(substitute(dataframe)))
Within a function this takes the name of the object passed to the dataframe argument and pastes it into a string, rather than putting the whole dataframe in.

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

How to check if a column has numeric or categorical levels in R?

I am trying to plot 9 barplots in a 3X3 matrix in R using base-R wrapped inside a for loop. (I am working on a workhorse solution for visualizing every column before I begin working on manipulating data) Below is the code:
library(ISLR);
library(ggplot2);
# load wage data
data(Wage)
par(mfrow=c(3,3))
for(i in 1:(dim(Wage)[2]-2)){
plot(Wage[,i],main = paste0(names(Wage)[i]),las = 2)
}
But unfortunately can't do properly for first 2 columns because they are numeric and actually needs a histogram. I get it that I need to fit if-else condition somewhere inside for() statement but that is giving me errors. below is the output where first 2 columns are plotted wrong. (Age and year are actually numeric and I may need to use them in X-axis instead of defaulting them to y).
Kindly requesting to suggest an edit/hack? I also learnt that I cant' use par() when I am wrapping ggplot inside for so I had to use base-R otherwise ggplot would have been great aesthetically.

Pairs in R - Re-order variables

I try to make a scatter-plot matrix with a dataframe(here it is http://statweb.stanford.edu/~tibs/ElemStatLearn/). However, the order of the variables is not the one that I wish and I would like to ignore the variable train.
Dataframe order:
lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa,train
The order I wish:
lpsa, lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45
For the moment, here is my code:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate=as.data.frame.matrix(prostate1)
pairs(prostate, col="purple")
I tried to add the arguments horInd and verInd, but I get the following warnings:
1: horInd" is not a graphical parameter
2: verInd" is not a graphical parameter
If anyone could help me, it would really be appreciated.
try this:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate = as.matrix(prostate1)
prostate.reordered = prostate[, c("lpsa", "lcavol", "lweight", "age", "lbph", "svi", "lcp", "gleason", "pgg45")]
pairs(prostate.reordered, col="purple")
The idea is to select the columns you want, in the order you want, using the column names for selection.
Of course, it would probably even more efficient not to convert everything from the data frame into a matrix, but only the required columns...

Custom function does not work in R 'ddply' function

I am trying to use a custom function inside 'ddply' in order to create a new variable (NormViability) in my data frame, based on values of a pre-existing variable (CelltiterGLO).
The function is meant to create a rescaled (%) value of 'CelltiterGLO' based on the mean 'CelltiterGLO' values at a specific sub-level of the variable 'Concentration_nM' (0.01).
So if the mean of 'CelltiterGLO' at 'Concentration_nM'==0.01 is set as 100, I want to rescale all other values of 'CelltiterGLO' over the levels of other variables ('CTSC', 'Time_h' and 'ExpType').
The normalization function is the following:
normalize.fun = function(CelltiterGLO) {
idx = Concentration_nM==0.01
jnk = mean(CelltiterGLO[idx], na.rm = T)
out = 100*(CelltiterGLO/jnk)
return(out)
}
and this is the code I try to apply to my dataframe:
library("plyr")
df.bis=ddply(df,
.(CTSC, Time_h, ExpType),
transform,
NormViability = normalize.fun(CelltiterGLO))
The code runs, but when I try to double check (aggregate or tapply) if the mean of 'NormViability' equals '100' at 'Concentration_nM'==0.01, I do not get 100, but different numbers. The fact is that, if I try to subset my df by the two levels of the variable 'ExpType', the code returns the correct numbers on each separated subset. I tried to make 'ExpType' either character or factor but I got similar results. 'ExpType has two levels/values which are "Combinations" and "DoseResponse", respectively. I can't figure out why the code is not working on the entire df, I wonder if this is due to the fact that the two levels of 'ExpType' do not contain the same number of levels for all the other variables, e.g. one of the levels of 'Time_h' is missing for the level "Combinations" of 'ExpType'.
Thanks very much for your help and I apologize in advance if the answer is already present in Stackoverflow and I was not able to find it.
Michele
I (the OP) found out that the function was missing one variable in the arguments, that was used in the statements. Simply adding the variable Concentration_nM to the custom function solved the problem.
THANKS
m.

Resources