Create an R function that normalizes data based on input values

Create an R function that normalizes data based on input values - r

I don't make to many complicated functions and typically stick with very basic ones. I have a question, how do I create a function that takes a dataset and normalizes based on desired normalization method and boxplots the output? Currently norm_method is different between the norm methods, was wondering if there is a way to call this in the start of function to pull through the correct method? Below is the code I created, but am stuck how to proceed.
library(reshape2) # for melt
library(cowplot)
demoData;
# target_deoData will need to be changed at some point
TestFunc <- function(demoData) {
# Q3 norm (75th percentile)
target_demoData <- normalize(demoData ,
norm_method = "quant",
desiredQuantile = .75,
toElt = "q_norm")
# Background normalization without spike
target_demoData <- normalize(demoData ,
norm_method = "neg",
fromElt = "exprs",
toElt = "neg_norm")
boxplot(assayDataElement(demoData[,1:10], elt = "q_norm"),
col = "red", main = "Q3",
log = "y", names = 1:10, xlab = "Segment",
ylab = "Counts, Q3 Normalized")
boxplot(assayDataElement(demoData[,1:10], elt = "neg_norm"),
col = "blue", main = "Neg",
log = "y", names = 1:10, xlab = "Segment",
ylab = "Counts, Neg. Normalized")
}

You might want to consider designing your normalize() and assayDataElement() functions to take ..., which provides more flexibility.
In lieu of that, given the examples above, you could make a simple configuration list, and elements of that configuration are passed to your normalize() and assayDataElement() functions, like this:
TestFunc <- function(demoData, method=c("quant", "neg")) {
method = match.arg(method)
method_config = list(
"quant" = list("norm_args" = list("norm_method" = "quant", desired_quantile = 0.75, "toElt" = "q_norm"),
"plot_args" = list("col"="red", main="Q3", ylab = "Counts, Q3 Normalized")),
"neg" = list("norm_args" = list("fromElt" = "exprs", "toElt" = "neg_norm"),
"plot_args" = list("col"="blue", main="Neg", ylab = "Counts, Neg Normalized"))
)
mcn = method_config[[method]][["norm_args"]]
mcp = method_config[[method]][["plot_args"]]
# normalize the data
target_demoData = do.call(normalize, c(list(data = demoData[1:10]), mcn))
# get the plot
boxplot(assayDataElement(
demoData[1:10], elt=mcp[["toElt"]],col = mcp[["col"],main = mcp[["main"]],
log = "y", names = 1:10, xlab = "Segment",ylab = mcp[["ylab"]]
)
}
Again, using this approach is not as flexible as ... (and consider splitting into two functions.. one that returns normalized data, and a second function that generates the plot..

Related

R convert string into argument for a function

I am trying to make plots using a function in which arguments are values of dataframes.
seniorPlot <- function(validityDate, seniorTotal, color){
par(new = T)
plot(validityDate,seniorTotal,
type = "l",
lwd = 2.5,
xlim = c(date_debut$validityDate,date_fin$validityDate),
ylim = c(1,nmax$seniorTotal),
col = color,
xlab = "",
ylab = "",
xaxt = "n",
yaxt = "n",
bty = "n",
)
}
For the purpose, the threee arguments of the function are results several dataframes 'seniorList' and the seniors df named with the senior initials ('AM', 'FB', 'GM'…)
To describe, you can see the senior dataframes below:
seniorList <- data.frame(seniorValidation=c('AM', 'FB', 'GM'),seniorTotal=c(72, 154, 137))
AM <- data.frame(validationDate=c('2022-01-25', '2022-01-26'), color=c('brown','brown')
FB <- data.frame(validationDate=c('2022-01-20', '2022-01-30'), color=c('green','green')
GM <- data.frame(validationDate=c('2022-01-24', '2022-01-28'), color=c('blue','blue')
Obviously, there is more than 3 lines in the seniorList, so I want to do a loop, using the value of the first column of seniorList (ie. the senior name) to call the write dataframe.
And, here is the issue : how can I convert the result of noquote(paste(noquote(seniorList[i,1]),'$validityDate', sep = '')) to the result '2022-01-25' (if i = 1)
for (i in 1:nrow(seniorList)) {
seniorPlot(validityDate = noquote(paste(noquote(seniorList[i,1]),'$validityDate', sep = '')),
seniorTotal = noquote(paste(noquote(seniorList[i,1]),'$seniorTotal', sep = '')),
color = noquote(paste(noquote(seniorList[i,1]),'$color', sep = '')))
}
Thank you for your help, I hope my english is easy to understand.
noquote(paste(noquote(seniorList[i,1]),'$validityDate', sep = '')) give AM$validityDate and not its result

It's unclear what exactly the plot should look like but this shows how use "tidy" data easily with ggplot2.
DF <- rbind(cbind(AM, data.frame(seniorTotal = 72, seniorValidation = "AM")),
cbind(FB, data.frame(seniorTotal = 154, seniorValidation = "GB")),
cbind(GM, data.frame(seniorTotal = 137, seniorValidation = "GM")))
DF$validationDate <- as.Date(DF$validationDate)
#this is how your data should look like:
print(DF)
library(ggplot2)
ggplot(data = DF, aes(x = validationDate, y = seniorTotal, color = seniorValidation)) +
geom_line()

Extension to answered question: R - Defining a function which recognises arguments not as objects, but as being part of the call

I've followed the excellent answer, as described here: https://stackoverflow.com/a/59987272/7493594
But how can I make it work with ggadjustedcurves?
myfun <- function(TimeVar, EventVar, CoxVar, CoxData){
TimeVar <- as.name(TimeVar)
EventVar <- as.name(EventVar)
CoxVar <- as.name(CoxVar)
CoxModel <- eval(bquote(coxph(Surv(.(TimeVar), .(EventVar)) ~.(CoxVar), data = .(CoxData))))
ggadjustedcurves(CoxModel,
variable = CoxVar,
xlab = "Years",
ylab = "Survival",
ggtheme = theme_survminer(),
size = 2, palette = "lancet",
data = CoxData)

I guess you're after something like this?
library(survminer)
library(survival)
myfun <- function(TimeVar, EventVar, CoxVar, CoxData) {
TimeVar <- as.name(TimeVar)
EventVar <- as.name(EventVar)
CoxVar_char <- CoxVar # Need to store `CoxVar` as string
CoxVar <- as.name(CoxVar)
CoxData <- as.name(CoxData)
CoxModel <- eval(bquote(
coxph(Surv(.(TimeVar), .(EventVar)) ~.(CoxVar), data = .(CoxData))))
ggadjustedcurves(
CoxModel,
variable = CoxVar_char, # `variable` needs to be a string
xlab = "Years",
ylab = "Survival",
ggtheme = theme_survminer(),
size = 2, palette = "lancet",
data = eval(CoxData)) # Eval `CoxData` as symbol
}
myfun("stop", "event", "size", "bladder")

How to programmatically determine the column indices of principal components using FactoMineR package?

Given a data frame containing mixed variables (i.e. both categorical and continuous) like,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
I perform unsupervised feature selection using the package FactoMineR
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
The variable df.princomp is a list.
Thereafter, to visualize the principal components I use
fviz_screeplot() and fviz_contrib() like,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
which gives the following Fig1
and Fig2
Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.
Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.
Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].
I've to manually specify the column indices [,2:3] and [,4].
What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?
I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.

Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.
If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)

There are a lot of ways to extract contributions of individual variables to PCs. For numeric input, one can run a PCA with prcomp and look at $rotation (I spoke to soon and forgot you've got factors here so prcomp won't work directly). Since you are using factoextra::fviz_contrib, it makes sense to check how that function extracts this information under the hood. Key factoextra::fviz_contrib and read the function:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
So it's really just calling facto_summarize from the same package. By analogy you can do the same thing, simply call:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
And that's the table corresponding to your figure 2. For PC2 use axes = 2 and so on.
Regarding "how to programmatically determine the column indices of the PCs", I'm not 100% sure I understand what you want, but if you just want to say for column "finalmark", grab its contribution to PC3 you can do the following:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
BTW I think ID in your example is treated as numeric instead of factor, but since it's just an example I'm not bothering with it.

Plotting function gives data must be of vector type, was 'NULL'

So I use the following functions for plotting most of the data I have to plot. I created it thanks to different chunks of code that I have found online. So far I have never encountered any issue with it.
Here is the plotting function first.
library(ggplot2)
library(reshape2)
#' Plot a given mean with error bars
#' #param resultTable The table with all the result to plot
#' #param techniques The name of the techniques in the form of a list/vector
#' #param nbTechs The number of given techniques
#' #param ymin The minimum value for y
#' #param ymax The maximum value for y
#' #param xAxisLabel The label for the x (vertical) axis
#' #param yAxisLable The label for the y (horizontal) axis
#' #return
#'
barChartTime <- function(resultTable, techniques, nbTechs = -1, ymin, ymax, xAxisLabel = "I am the X axis", yAxisLabel = "I am the Y Label"){
#tr <- t(resultTable)
if(nbTechs <= 0){
stop('Please give a positive number of Techniques, nbTechs');
}
tr <- as.data.frame(resultTable)
nbTechs <- nbTechs - 1 ; # seq will generate nb+1
#now need to calculate one number for the width of the interval
tr$CI2 <- tr$upperBound_CI - tr$mean_time
tr$CI1 <- tr$mean_time - tr$lowerBound_CI
#add a technique column
tr$technique <- factor(seq.int(0, nbTechs, 1));
breaks <- c(as.character(tr$technique));
print(tr)
g <- ggplot(tr, aes(x=technique, y=mean_time)) +
geom_bar(stat="identity",fill = I("#CCCCCC")) +
geom_errorbar(aes(ymin=mean_time-CI1, ymax=mean_time+CI2),
width=0, # Width of the error bars
size = 1.1
) +
#labs(title="Overall time per technique") +
labs(x = xAxisLabel, y = yAxisLabel) +
scale_y_continuous(limits = c(ymin,ymax)) +
scale_x_discrete(name="",breaks,techniques)+
coord_flip() +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),axis.title=element_text(size = rel(1.2), colour = "black"),axis.text=element_text(size = rel(1.2), colour = "black"),panel.grid.major = element_line(colour = "#DDDDDD"),panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())+
geom_point(size=4, colour="black") # dots
print(g)
}
Now, here is (a simplified version of the data) data that I am using (and that reproduces the error):
EucliP,AngularP,EucliR,AngularR,EucliSp,AngularSp,EucliSl,AngularSl
31.6536,30.9863,64.394,92.7838,223.478,117.555,44.7374,25.4852
12.3592,40.7639,70.2508,176.55,10.3927,145.909,143.025,126.667
14.572,8.98445,113.599,150.551,47.1545,54.3019,10.7038,47.7004
41.7957,20.9542,55.1732,67.1647,52.364,41.3655,62.7036,75.65
135.868,83.7135,14.0262,69.7183,44.987,35.9599,19.5183,66.0365
33.5359,17.2129,6.95909,47.518,224.561,91.4999,67.1279,31.4079
25.7285,33.6705,17.4725,58.45,43.1709,113.847,28.9496,20.0574
48.4742,127.588,75.0804,89.1176,31.4494,27.9548,38.4563,126.248
31.9831,80.0161,19.9592,145.891,55.2789,142.738,94.5126,136.099
17.4044,52.3866,49.9976,150.891,104.936,77.2849,232.23,35.6963
153.359,151.897,41.8876,46.3893,79.5218,75.2011,68.9786,91.8972
And here is the code that I am using:
data = read.table("*Path_to_file*.csv", header=T, sep=",")
data$EucliPLog = (data$EucliP) #Before here I used to use a log transform that I tried to remove for some testing
data$EucliRLog = (data$EucliR) #Same thing
data$EucliSpLog = (data$EucliSp) #Same thing
data$EucliSlLog = (data$EucliSl) #Same thing
a1 = t.test(data$EucliPLog)$conf.int[1]
a2 = t.test(data$EucliPLog)$conf.int[2]
b1 = t.test(data$EucliRLog)$conf.int[1]
b2 = t.test(data$EucliRLog)$conf.int[2]
c1 = t.test(data$EucliSpLog)$conf.int[1]
c2 = t.test(data$EucliSpLog)$conf.int[2]
d1 = t.test(data$EucliSlLog)$conf.int[1]
d2 = t.test(data$EucliSlLog)$conf.int[2]
analysisData = c()
analysisData$ratio = c("Sl","Sp","R","P")
analysisData$pointEstimate = c(exp(mean(data$EucliSlLog)),exp(mean(data$EucliSpLog)),exp(mean(data$EucliRLog)),exp(mean(data$EucliPLog)))
analysisData$ci.max = c(exp(d2), exp(c2),exp(b2), exp(a2))
analysisData$ci.min = c(exp(d1), exp(c1),exp(b1), exp(a1))
datatoprint <- data.frame(factor(analysisData$ratio),analysisData$pointEstimate, analysisData$ci.max, analysisData$ci.min)
colnames(datatoprint) <- c("technique", "mean_time", "lowerBound_CI", "upperBound_CI ")
barChartTime(datatoprint,analysisData$ratio ,nbTechs = 4, ymin = 0, ymax = 90, "", "Title")
So If I do use the log() that I mention in the comments of the last piece of code, everything works fine and I get my plots displayed. However, I tried removing the log and I get the famous
Error in matrix(value, n, p) :
'data' must be of a vector type, was 'NULL'
I have tried looking for null values in my data but there are none and I do not know where to look at next. Would love to get some help with that.
Thanks in advance
Edit: Here is the result of dput on datatoprint:
structure(list(technique = structure(c(3L, 4L, 2L, 1L), .Label = c("P",
"R", "Sl", "Sp"), class = "factor"), mean_time = c(1.04016257618464e+32,
1.64430609815788e+36, 7.5457775364611e+20, 3.85267453902928e+21
), lowerBound_CI = c(6.64977706609883e+50, 5.00358136618364e+57,
2.03872433045407e+30, 4.93863589006376e+35), `upperBound_CI ` = c(16270292584857.9,
540361462434140, 279286207454.44, 30055062.6409769)), .Names = c("technique",
"mean_time", "lowerBound_CI", "upperBound_CI "), row.names = c(NA,
-4L), class = "data.frame")
And the dput on analysisData:
structure(list(ratio = c("Sl", "Sp", "R", "P"), pointEstimate = c(1.04016257618464e+32,
1.64430609815788e+36, 7.5457775364611e+20, 3.85267453902928e+21
), ci.max = c(6.64977706609883e+50, 5.00358136618364e+57, 2.03872433045407e+30,
4.93863589006376e+35), ci.min = c(16270292584857.9, 540361462434140,
279286207454.44, 30055062.6409769)), .Names = c("ratio", "pointEstimate",
"ci.max", "ci.min"))

Without the log I don't have anything on display because the value are above 10^40++ whereas with the log it's below the upper limit (90).
I don' get the error you get though.

How to return a value if an error-free R function does not do so

My goal is to assign a plot produced by the pyramid package to a list. Later, I will have that plot and others inserted from the list into a document. But the pyramid function appears not to return a value. How can I assign the pyramid plot to an object?
install.packages("pyramid") # functions to draw a population pyramid
library(pyramid)
# create a mock data frame to comparing this plot to a counterpart from plotrix
df <- data.frame(level1 = c(9,9,4,3,34,28), levelsame = c(9,9,4,3,34,28),
title = c("Dir", "Exec. Dir", "Mgr", "Sr. Mgr", "Mgt Princ", "EVP+"))
# assign the plot (hopefully) to an object
empty <- pyramid(df, Laxis = seq(1,35,5), AxisFM = "g", Csize = 0.8, Cgap = .5, Llab = "",
Rlab = "", Clab = "Title", GL = F, Lcol = "blue", Rcol = "blue",
Ldens = -1, main = "Distribution of Levels")
> empty
NULL
Likewise, if I assign the pyramid call to my list, nothing happens. There is no value for the list returned by pyramid.
plotlist2[["pyramid"]] <- pyramid(df, Laxis = seq(1,35,5), AxisFM = "g", Csize = 0.8, Cgap = .5, Llab = "",
Rlab = "", Clab = "Title", GL = F, Lcol = "blue", Rcol = "blue",
Ldens = -1, main = "Distribution of Levels")
> plotlist2[1]
[[1]]
NULL
I fear I am blundering in some obvious mis-understanding, so I welcome being set aright. Thank you.

You can use the recordPlot() function to save the current plot to a variable.
In your case you could do:
#print the plot
pyramid(df, Laxis = seq(1,35,5), AxisFM = "g", Csize = 0.8, Cgap = .5, Llab = "",
Rlab = "", Clab = "Title", GL = F, Lcol = "blue", Rcol = "blue",
Ldens = -1, main = "Distribution of Levels")
#save the current printed plot
pyrPlot<-recordPlot()
#plot it again
pyrPlot
You might have to enable the displaylist using dev.control(displaylist ="enable") for this to work depending on the graphical device you are using

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create an R function that normalizes data based on input values - r

Related

R convert string into argument for a function

Extension to answered question: R - Defining a function which recognises arguments not as objects, but as being part of the call

How to programmatically determine the column indices of principal components using FactoMineR package?

Plotting function gives data must be of vector type, was 'NULL'

How to return a value if an error-free R function does not do so

Categories

Resources