I am using glm() to create a few different models based on the values in a vector I make (h1_lines). I want sapply to return a model for each value in the vector. Instead, my code is currently returning a list of lists where one part of the list is the model. It seems to be returning everything I do inside the sapply function.
train = data.frame(scores=train[,y_col], total=train[,4], history=train[,5], line=train[,6])
h1_lines<- c(65, 70, 75)
models <- sapply(h1_lines, function(x){
temp_set<-train
temp_set$scores<-ifelse(temp_set$scores>x,1,
ifelse(temp_set$scores<x,0,rbinom(dim(temp_set)[1],1,.5)))
mod<-glm(scores ~ total + history + line, data=temp_set, family=binomial)
})
I'd like the code to work so after these lines I can do:
predict(models[1,], test_case)
predict(models[2,], test_case)
predict(models[3,], test_case)
But right now I can't do it cause sapply is returning more than just the model... If I do print(dim(models)) it says models has 30 rows and 3 columns??
EDIT TO ADD QUESTION;
Using the suggestion below code works great, I can do predict(models[[1]], test_case) and it works perfectly. How can I return/save the models so I can access them with the key I used to create them? For example, using the h1_scores it could be something like the following:
predict(models[[65]], test_case))
predict(models[[key==65]], test_case)
You need to use lapply instead of sapply.
sapply simplifies too much. Try:
lapply(ListOfData, function(X) lm(y~x, X))
sapply(ListOfData, function(X) lm(y~x, X))
I don't know exactly the distinction, but if you're ever expect the output of each item of sapply to have extractable parts (i.e. Item$SubItem), you should use lapply instead.
Update
Answering your next question, you can do either:
names(models) <- h1_lines
names(h1_lines) <- h1_lines ## Before lapply
And call them by
models[["65"]]
Remember to use quotes around the numbers. As a side note, naming list items with numbers is not always the best idea. A workaround could be:
models[[which(h1_lines==65)]]
Related
I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.
my_function <- function(n){}
result = list()
for(i in 0:59){
result[i] = my_function(i)
}
write.csv(result, "result.csv")
New to R, read that for-loops are bad in R, so is there an alternative to what I'm doing? I'm basically trying to call my_function with a parameter that's increasing, and then write the results to a file.
Edit
Sorry I didn't specify that I wanted to use some function of i as a parameter for my_function, 12 + (22*i) for example. Should I create a list of values and then called lapply with that list of values?
for loops are fine in R, but they're syntactically inefficient in a lot of use cases, especially simple ones. The apply family of functions usually makes a good substitute.
result <- lapply(0:59, my_function)
write.csv(result, "result.csv")
Depending on what your function's output is, you might want sapply rather than lapply.
Edit:
Per your update, you could do it as you say, creating the vector of values first, or you could just do something like:
lapply(12+(22*0:59), my_function)
Let's say I have a function that accepts a vector of parameters and returns a vector of results (of the same length). And let's say I want to call this function 100 times always with the same parameter - a 100 elements long vector of 1 - ideally getting a list of vectors as a result.
The first thing that came to my mind was to use lapply, specifically to call lapply on a list of vectors. My testing on smaller data proved that it should work and that it returns data in required format. The problem is that I'm unable to generate the list of vectors I need as the argument.
All I found online was how to generate a vector which doesn't help me much as I already know how to do that. The problem is how to generate a list out of these vectors (using list(rep(1, 100), rep(1, 100), ...) is out of question as I'd have to repeat the rep(1, 100) part a hundred times.
The quickest way to do this is to use R's built in replicate function, like so:
replicate(100, rep(1, 100), simplify = FALSE)
where rep(1, 100) gets replaced by the vector you actually want a list of 100 copies of. An equivalent statement would be to use lapply and an anonymous function, like so:
lapply(1:100, function(x){ rep(1, 100) })
Essentially, what this is doing is writing a function that takes its input, throws it away, and outputs your vector of choice. In fact, that's not much different than what replicate does under the hood, according to the documentation:
replicate is a wrapper for the common use of sapply for repeated evaluation of an expression
The only difference from the standard use of replicate is that, by default, replicate returns your list of vectors simplified to an array. But as you can see it's easy enough to force it not to do that by passing simplify = FALSE.
I have a 58 column dataframe, I need to apply the transformation $log(x_{i,j}+1)$ to all values in the first 56 columns. What method could I use to go about this most efficiently? I'm assuming there is something that would allow me to do this rather than just using some for loops to run through the entire dataframe.
alexwhan's answer is right for log (and should probably be selected as the correct answer). However, it works so cleanly because log is vectorized. I have experienced the special pain of non-vectorized functions too frequently. When I started with R, and didn't understand the apply family well, I resorted to ugly loops very often. So, for the purposes of those who might stumble onto this question who do not have vectorized functions I provide the following proof of concept.
#Creating sample data
df <- as.data.frame(matrix(runif(56 * 56), 56, 56))
#Writing an ugly non-vectorized function
logplusone <- function(x) {log(x[1] + 1)}
#example code that achieves the desired result, despite the lack of a vectorized function
df[, 1:56] <- as.data.frame(lapply(df[, 1:56], FUN = function(x) {sapply(x, FUN = logplusone)}))
#Proof that the results are the same using both methods...
#Note: I used all.equal rather than all so that the values are tested using machine tolerance for mathematical equivalence. This is probably a non-issue for the current example, but might be relevant with some other testing functions.
#should evaluate to true
all.equal(log(df[, 1:56] + 1),as.data.frame(lapply(df[, 1:56], FUN = function(x) {sapply(x, FUN = logplusone)})))
You should be able to just refer to the columns you want, and do the operation, ie:
df[,1:56] <- log(df[,1:56]+1)
For each of 100 data sets, I am using lm() to generate 7 different equations and would like to extract and compare the p-values and adjusted R-squared values.
Kindly assume that lm() is in fact the best regression technique possible for this scenario.
In searching the web I've found a number of useful examples for how to create a function that will extract this information and write it elsewhere, however, my code uses paste() to label each of the functions by the data source, and I can't figure out how to include these unique pasted names in the function I create.
Here's a mini-example:
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
for (i in testrun)
{
assign(paste(i,"test",sep=""),lm(temp$LogPre~temp$labels))
assign(paste(i,"test2",sep=""),lm(temp$LogPre~temp$labels2))
}
I would then like to extract the coefficients of each equation
But the following doesn't work:
summary(paste(i,"test",sep="")$coefficients)
and neither does this:
coef(summary(paste(i,"test",sep="")))
Both generating the error :$ operator is invalid for atomic vectors
EVEN THOUGH
summary(XXtest)$coefficients
and
coef(summary(XXtest))
work just fine.
How can I use paste() within summary() to allow me to do this for AAtest, AAtest2, ABtest, ABtest2, etc.
Thanks!
Hard to tell exactly what your purpose is, but some kind of apply loop may do what you want in a simpler way. Perhaps something like this?
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
names(testrun) <- testrun
out <- lapply(testrun, function(i) {
list(test1=lm(temp$LogPre~temp$labels),
test2=lm(temp$LogPre~temp$labels2))
})
Then to get all the p-values for the slopes you could do:
> sapply(out, function(i) sapply(i, function(x) coef(summary(x))[2,4]))
XX
test1 0.02392516
test2 0.02389790
Just using paste results in a character string, not the object with that name. You need to tell R to get the object with that name by using get.
summary(get(paste(i,"test",sep="")))$coefficients