Using lapply on a diverse list in R - r

I created a list containing sub-lists, each sub-list containing information for one task I want R to do.
df <- as.data.frame(matrix(1:6 , ncol =2 , nrow = 3))
colnames(df) <- c("Col1", "Col2")
myList <- list()
myList[["Dataset1"]] <- list()
myList[["Dataset1"]]["Function"] <- "mean"
myList[["Dataset1"]][["DataFrame"]] <- df
myList[["Dataset2"]] <- list()
myList[["Dataset2"]]["Function"] <- "lm"
myList[["Dataset2"]][["DataFrame"]] <- df*2
Now I want R to apply the Function to the Dataframe and Store the results in a new List. How do I do that best?
So far I had two ideas: I either use lapply to run through the list, each time accessing several items from the sub-list, supplying it to a new function
myResult <- lapply(myList, FUN = myList[["Dataset1"]]["Function"](x) , x = as.matrix(myList[["Dataset1"]][["DataFrame"]]))
But I dont know how to tell R how to cycle correctly through the sub-lists.
Second, I was hoping to be able to send the whole sublist to a function, but I could not get this to run either.
myFunction <- function(x){
TempData <- x[["DataFrame"]]
TempFunction <- x["Function"]
TempResult <- get(TempFunction)(TempData)
return(TempResult)
}
myResult <- lapply(myList, myFunction(x))
If someone could give me an idea how to solve this id be very happy.
Many thanks in advance!

Sounds more complicated than it should be.. And it's not clear how you would apply mean() or lm(). Below is an example where you store the function as an object in the list:
myList[["Dataset1"]] <- list()
myList[["Dataset1"]][["Function"]] <- function(x)mean(as.matrix(x))
myList[["Dataset1"]][["DataFrame"]] <- df
myList[["Dataset2"]] <- list()
myList[["Dataset2"]][["Function"]] <- function(x){lm(Col2~Col1,data=x)}
myList[["Dataset2"]][["DataFrame"]] <- df*2
So you iterate through elements in the list and you get the result of function(dataframe) for each element:
lapply(myList,function(i)i$Function(i$DataFrame))
$Dataset1
[1] 3.5
$Dataset2
Call:
lm(formula = Col2 ~ Col1, data = x)
Coefficients:
(Intercept) Col1
6 1

A first thing you can do is use the match.fun function that allows you to recover a function from a string. Applied to a loop, it returns the following solution :
list<-list()
for (i in 1:length(myList)){
TempFunc <- match.fun(myList[[i]][["Function"]])
Results <- TempFunc(myList[[i]][["DataFrame"]])
list[i]<-Results
}
> list
[[1]]
[1] NA
[[2]]
(Intercept) Col2
-6 1
Actually, the functions don't make sense because the functions are not appropriate.

The function call allows you to call a function by its name given as a character string, and you can evaluate this call with eval. Note though, that neither of your function calls make sense applied to a data frame:
lapply(myList, function(sublist) call(sublist$Function, sublist[["DataFrame"]]))
#> $Dataset1
#> mean(list(Col1 = 1:3, Col2 = 4:6))
#> $Dataset2
#> lm(list(Col1 = c(2, 4, 6), Col2 = c(8, 10, 12)))
So the results are kind of useless for the examples used:
lapply(myList, function(sublist) eval(call(sublist$Function, sublist[["DataFrame"]])))
#> $Dataset1
#> [1] NA
#>
#> $Dataset2
#>
#> Call:
#> lm(formula = structure(list(Col1 = c(2, 4, 6), Col2 = c(8, 10,
#> 12)), class = "data.frame", row.names = c(NA, -3L)))
#>
#> Coefficients:
#> (Intercept) Col2
#> -6 1
#>
#>
#> Warning message:
#> In mean.default(list(Col1 = 1:3, Col2 = 4:6)) :
#> argument is not numeric or logical: returning NA

Related

Passing col_names of a data frame in a function IN R

i want to create a function that calculte the rate of missing values of a data frame's column.
here's my code :
Pourcentage_NA = function(df, col_Name){
res=100*( length(df[col_Name])-length(na.omit(df[col_Name])) ) / length(df[col_Name])
}
when I call it like this : x = Pourcentage_NA( Data , "A" )
It shows that x=0 although I know there's some missing values. Anyone can help me pls ?
I try to change the formula but it keep saying the same thing
You can do:
Pourcentage_NA = function(df, col_Name){
100 * sum(is.na(df[col_Name])) / nrow(df)
}
Testing, we have:
dat <- data.frame(A = c(1, NA, 3, NA), B = c(NA, 2, 3, 4))
Pourcentage_NA(dat, "A")
#> [1] 50
Pourcentage_NA(dat, "B")
#> [1] 25
An alternative would be
Pourcentage_NA <- function(df, col_Name) 100 * colMeans(is.na(df))[col_Name]
Pourcentage_NA(dat, "A")
#> A
#> 50
Pourcentage_NA(dat, "B")
#> B
#> 25
Created on 2022-11-13 with reprex v2.0.2

enumerate in R over dataframe rows

I'm trying to modify a function so that if I put in a dataframe, I get the rownumber and row output.
These functions taken from Zip or enumerate in R? are a good starting point for me:
zip <- function(...) {
mapply(list, ..., SIMPLIFY = FALSE)
}
enumerate <- function(...) {
zip(k=seq_along(..1), ...)
}
I modified enumerate to work as I want when the input is a dataframe:
enumerate2 <- function(...){
mod <- ..1
if(is.data.frame(mod)){
mod = split(mod, seq(nrow(mod)))
}
zip(k = seq_along(mod), ...)
}
So for example:
g = data.frame(a = c(1, 2, 3), b = c(4, 5, 6))
enumerate2(v = g)
This will enumerate the rows of a dataframe, so I can do:
for(i in enumerate2(v = g)){
"rowNumber = %s, rowValues = %s" %>% sprintf(i$k, list(i$v)) %>% print
}
The problem is I get a warning:
Warning message:
In mapply(list, ..., SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
Also, I'd rather the dataframe still be a dataframe so that I can do things like i$v$b to return the value of row i$k column b from the dataframe.
How can I get rid of the warning, and how can I keep the dataframe structure after split?
edit:
example 1 - data frame input
output:
enumerate2(v = data.frame(A = c(1, 2), B = c(3, 4)))
[[1]]
[[1]]$k
[1] 1
[[1]]$v
A B
1 1 3
[[2]]
[[2]]$k
[1] 2
[[2]]$v
A B
1 2 4
example 2 - list input
output:
enumerate2(v = LETTERS[1:2])
[[1]]
[[1]]$k
[1] 1
[[1]]$v
[1] "A"
[[2]]
[[2]]$k
[1] 2
[[2]]$v
[1] "B"

Actively logging assignments in R

I'm trying to set up a way in R to print details of an each assignment while R code is run. So, for example, if the code x <- 1 is run then x has been assigned 1 will automatically be printed.
Is this possible?
I have two thoughts on how this might be done but can't figure out if either is possible.
redefine the = primitive so that it also prints a message
have an assignment trigger another function to run
one possible solution, but requires editing the code would be
# custom assignment function -----------------------------------------------------------------
`%<-%` <- function (lhs, rhs) {
cl <- match.call()
lhs <- substitute(lhs)
env <- parent.frame()
message("Info: `", lhs, "` defined as `", enquote(cl$rhs)[2], "`")
invisible(eval(assign(x = paste(lhs),
value = rhs,
envir = env))
)
}
# some tests ----------------------------------------------------------------------------------
ad %<-% c(1,2,33)
#> Info: `ad` defined as `c(1, 2, 33)`
ac %<-% 22
#> Info: `ac` defined as `22`
ad %<-% 22
#> Info: `ad` defined as `22`
df <- mtcars
df %<-% mtcars
#> Info: `df` defined as `mtcars`
If you don't want to modify files, you can define a modified source() function to replace the assignments with the newly defined %<-% function.
source_loudly <- function(filePath, ...) {
file_con <- file(filePath, open = "r")
txt <- readLines(file_con)
close(file_con)
txt_mod <- gsub(pattern = "<-", replace = "%<-%", x = txt)
source(textConnection(txt_mod), ...)
}
filePath <- "R/bits/example.R" # point to a local file on your pc
source_loudly(filePath = filePath, echo = T)
Created on 2021-03-19 by the reprex package (v1.0.0)
Here's a getter/setter hack that comes close without costing too much. While it does require you to change existing code, it has the benefit that you can change the initial assignment to list instead of tracer and everything continues to work unchanged.
tracer <- local({
.e <- NULL
function(..., name = "unk") {
.e <<- list(...)
.e$.name <<- name
`class<-`(.e, c("tracer", "environment"))
}
})
`[.tracer` <- `[[.tracer` <- `$.tracer` <- function(x, i) {
cat(sprintf("get: %s\n", deparse(substitute(i))))
NextMethod()
}
`[<-.tracer` <- `[[<-.tracer` <- `$<-.tracer` <- function(x, i, value) {
cat(sprintf("set: %s <- %s\n", deparse(substitute(i)),
substr(paste(deparse(substitute(value)), collapse = " "), 1, 80)))
NextMethod()
}
Notes:
deparse tends to split long lines into a vector of strings; this is mitigated here with paste(..., collapse=" ");
... but long literal values (e.g., frames) can be a bit annoying in the logs, so I arbitrarily chose substr(., 1, 80) as a reasonable size to log.
this hints at one problem I'll expand on below: this doesn't tell you which columns have been modified, just that the object has been updated.
Demonstration with "simple" objects:
quux <- tracer(a=1, b=2:3, d=list(pi, "a"), mt=mtcars[1:2,])
quux$a
# get: "a"
# [1] 1
quux$a <- 11
# set: "a" <- 11
quux$b
# get: "b"
# [1] 2 3
quux$b <- 2:5
# set: "b" <- 2:5
quux$b
# get: "b"
# [1] 2 3 4 5
So far, so good. Now onto the list:
quux$d
# get: "d"
# [[1]]
# [1] 3.141593
# [[2]]
# [1] "a"
quux$d[[1]]
# get: "d"
# [1] 3.141593
quux$d[[1]] <- pi^2
# get: "d"
# set: "d" <- list(9.86960440108936, "a")
The latter needs some explanation, notably about the order of operations. The assignment is really `[[<-`(quux$d, 1, pi^2), which is not traced. This adjusts the first element of the list, and then assigns this new list back to quux$d, where our $<-.tracer sees that full-list reassignment.
That is not completely unreasonable for small objects, but it becomes a little more annoying with larger objects:
quux$mt$cyl
# get: "mt"
# [1] 6 6
quux$mt$cyl <- quux$mt$cyl + 5
# get: "mt"
# get: "mt"
# set: "mt" <- structure(list(mpg = c(21, 21), cyl = c(11, 11), disp = c(160, 160), hp = c(110, 110), drat = c(3.9, 3.9), wt = c(2.62, 2.875 ), qsec = c(16.46, 17.02), vs = c(0, 0), am = c(1, 1), gear = c(4, 4), c
quux$mt$cyl
# get: "mt"
# [1] 11 11
Similarly, for an assignment we see both the first "get" step and then the whole-object-reassignment. (It is cutoff because I used substr(., 1, 80).)
Also, note that in both quux$d and quux$mt, the tracer functions never see the sub-element or column being adjusted. Since R orders the operations as it does, our tracer functions cannot reveal what is going on there (easily).
Now, when you're ready to remove this level of activity-logging, just replace your initial call to tracer(.) with list(.), and all operations continue to work but without logging.
quux <- list(a=1, b=2:3, d=list(pi, "a"), mt=mtcars[1:2,])
quux$a
# [1] 1
quux$a <- 11
quux$b
# [1] 2 3
quux$b <- 2:5
quux$b
# [1] 2 3 4 5
quux$d
# [[1]]
# [1] 3.141593
# [[2]]
# [1] "a"
quux$d[[1]]
# [1] 3.141593
quux$d[[1]] <- pi^2
quux$mt$cyl
# [1] 6 6
quux$mt$cyl <- quux$mt$cyl + 5
quux$mt$cyl
# [1] 11 11

Overwriting an object in the global environment after using deparse(substitute()) in a function call

This is my debut post here. So please bear with me if it doesn't live up to the high standards of clarity of more seasoned members.
I have 4 objects (representing 4 years) in my global environment that are lists consisting of 12 data-frames (one for each month in the year). They have a consistent structure, and the column names of the data-frames are all the same. I'm trying to change these column names of the data-frames in all 4 lists in one fell swoop using a function, and then overwrite all 4 objects in my global environment with new objects that have the data-frames with the new column names.
This is my function:
change.name <- function(data){
for (i in 1:length(data)){
names(data[[i]]) <- c("a", "b", "c", "d", "e")
}
assign(deparse(substitute(data)), value = data, envir = globalenv())
}
I use my function:
change.name(my_object1)
It works, except that I get this warning message:
Warning message:
In assign(deparse(substitute(data)), value = data,
envir = globalenv()) : only the first element is used as variable
name
And the object in my global environment is not overwritten. I get a new object with a name like this:
"list(Jan = structure(list(a = c(11, 34, 36, 49, 55, 68, "
I understand that this has to do with the way the function variable is stored in the new environment R creates when running a function (or something along those lines).
My question is simple: How do I remedy this?
You can fix this by using deparse(substitute(data)) before you do anything to data:
# Let's change your function just a bit
change.name <- function(data){
# call deparse(substutite()) *before* you do anything to data
object_name <- deparse(substitute(data))
for (i in 1:length(data)){
names(data[[i]]) <- c("a", "b", "c", "d", "e")
}
assign(object_name, value = data, envir = globalenv())
}
# Create sample data
my_object1 <- lapply(1:12, function(x) {
data.frame(u = 1, v = 2, x = 3, y = 4, z = 5)
})
names(my_object1) <- month.name
change.name(my_object1)
ls()
#> [1] "change.name" "my_object1"
head(my_object1, 2)
#> $January
#> a b c d e
#> 1 1 2 3 4 5
#>
#> $February
#> a b c d e
#> 1 1 2 3 4 5
Created on 2018-12-20 by the reprex package (v0.2.1)
A more idiomatic (and probably safer) way to approach this task might be to simply use lapply and setNames:
my_object1 <- lapply(1:12, function(x) {
data.frame(u = 1, v = 2, x = 3, y = 4, z = 5)
})
names(my_object1) <- month.name
change.name <- function(obj){
lapply(obj,function(x) setNames(x,letters[1:5]))
}
my_object1 <- change.name(my_object1)
Right you are, the problem lies in the way functions behave. Take a look at the following code, it might help
testFun1 <- function (val) {
a <<- val
assign("b",a)
}
testFun2 <- function (val) {
a <<- val
assign("b",a, pos = 1)
}
# environment pretty much empty apart from our functions
ls()
[1] "testFun1" "testFun2"
# run
set.seed(123)
testFun1(runif(1))
# less empty
ls()
[1] "a" "testFun1" "testFun2"
# still not quite it though
testFun2(runif(1))
# now that's better
ls()
[1] "a" "b" "testFun1" "testFun2"
For more information, take a look at the documentation (?assign), especially the pos argument.

Multiple ttests

I want to perform multiple ttests on data in the following format
first column is "id"
with values (for example) 1,1,1,2,2,2
second column is "ratios"
with values 0.2, 0.18, 0.3, 1.5, 1.4, 1.6
for each instance of "id" I want to test all ratio values against all the ratio values in the dataframe
Right now I have this
data <- read.delim("clipboard", stringsAsFactors=FALSE) ##data to test
dist <- as.numeric(readClipboard()) ##distribution to test against
data$Ratio.Mean.H.L <- NA
data$p.value <- NA
for (i in 1:nrow(data))
if (nrow(data) > 1)
{
#welsh t-test
t.test.result <- t.test(data$ratio[i],dist,
alternative = "two.sided",
mu = 0,
paired = FALSE,
var.equal = FALSE,
conf.level = 0.95)
#writes data into the data.frame
data$p.value[i] <- t.test.result$p.value
}
write.table(data, file="C:/R_Temp/t-test.txt", sep = "\t")
I know this does not work, for one I am not sure I am only testing rows that share the same "id". I am also manually entering the distribution to test against, which is all entries in the "ratio" column.
How do I do this correct? and add multiple testing correction (bonferroni)?
I suspect that MattParker's comment is going to be the biggest thing here: you are comparing a single number with a vector, and t.test will complain about that. Since you suggested that you want to perform tests per grouping variable (id), so in base R you probably want to use a function like by (or split). (There are great methods within dplyr and data.table as well.)
Using mtcars as sample data, I'll try to mimic your data:
dat <- mtcars[c("cyl", "mpg")]
colnames(dat) <- c("id", "ratio")
It isn't clear what you mean to use for dist, so I'll use the naïve
dist <- 1:10
Now you can do:
by(dat$ratio, dat$id, function(x) t.test(x, dist, paired = FALSE)$p.value)
# dat$id: 4
# [1] 2.660716e-10
# ------------------------------------------------------------
# dat$id: 6
# [1] 4.826322e-09
# ------------------------------------------------------------
# dat$id: 8
# [1] 2.367184e-07
If you want/need to deal with more than just ratio at a time, you can alternatively do this:
by(dat, dat$id, function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
# dat$id: 4
# [1] 2.660716e-10
# ------------------------------------------------------------
# dat$id: 6
# [1] 4.826322e-09
# ------------------------------------------------------------
# dat$id: 8
# [1] 2.367184e-07
The results from the call to by are a class "by", which is really just a repackaged list with some extra attributes:
res <- by(dat, dat$id, function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
class(res)
# [1] "by"
str(attributes(res))
# List of 4
# $ dim : int 3
# $ dimnames:List of 1
# ..$ dat$id: chr [1:3] "4" "6" "8"
# $ call : language by.data.frame(data = dat, INDICES = dat$id, FUN = function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
# $ class : chr "by"
So you can expand/access it however you would a list:
res[[1]]
# [1] 2.660716e-10
as.numeric(res)
# [1] 2.660716e-10 4.826322e-09 2.367184e-07
names(res)
# [1] "4" "6" "8"
(Realize that the different levels of dat$id are the integers 4, 6, and 8, so the names should correspond to your $id.)
Edit:
If you want the results in a data.frame, two options come to mind:
Repeat the p-value for each and every row, resulting in a lot of duplication. I discourage this method for several reasons; if you need it at some point, I suggest using option 2 and then merge.
Produce a data.frame with as many rows as unique id. Something like:
do.call(rbind.data.frame,
by(dat, dat$id, function(x) list(id=x$id[1], pv=t.test(x, dist, paired=F)$p.value)))
# id pv
# 4 4 1.319941e-03
# 6 6 2.877065e-03
# 8 8 6.670216e-05
OK, Sorry for the poorly defined question. I got help elsewhere and will post the script that worked for those that are interested. I want to calculate p-values for ratio changes in a proteomics experiment. To do this I make individual t-tests for all the ratio measurements for any given protein or PTM site.These measurements are compared to the median of all measurments (mu in the t.test function), or to the entire distribution of measurements. In one column I have "id"s which are unique for each entry, in the other column I have "values" (ratios). I will make t-tests comparing all "values" that occur with any given unique "id". for ease of use I paste the table into the script, rather than calling it from a file (it saves me a step).
data <- read.delim("clipboard", stringsAsFactors=FALSE) ##data to test(two columns "id" and "value") Log-transfrom ratios!!
summary(data)
med <- median(data$value)
# function for the id-grouped t-test
calc_id_ttest <- function(d) #col1: id, col2:values
{
colnames(d) <- c("id", "value") # reassign the column names
# calculate the number of values for each id
res_N <- as.data.frame(tapply(d$value, d$id, length))
colnames(res_N) <- "N"
res_N$id <- row.names(res_N)
# calculate the number of values for each id
res_med <- as.data.frame(tapply(d$value, d$id, median))
colnames(res_med) <- "med"
res_med$id <- row.names(res_med)
# calculate the pvalues
res_pval <- as.data.frame(tapply(d$value, d$id, function(x)
{
if(length(x) < 3)
{ # t test requires at least 3 samples
NA
}
else
{
t.test(x, mu=med)$p.value #t.test (Pearson)d$value with other distribution? alternative=less or greater
} #d$value to compare with entire distribution
#mu=med for median of values for 1-sided test
}))
colnames(res_pval) <- "pval" # nominal p value
res_pval$id <- row.names(res_pval)
res_pval$adj.pval <- p.adjust(res_pval$pval, method = "BH") #multiple testing correction also "bonferroni"
res <- Reduce(function(x,y)
{
merge(x,y, by = "id", all = TRUE)
},
list(res_N, res_med, res_pval))
return (res)
}
data_result <- calc_id_ttest(d = data)
write.table(data_result, file="C:/R_Temp/t-test.txt", quote = FALSE, row.names = FALSE, col.names = TRUE, sep = "\t")

Resources