Apply a function to different dataframes - r

I am trying to run a function over different datasets, and can't seem to get it work. The variable names x and y are the same across datasets, but the dataset (argument z in my custom function) is different.
I have tried lapply but it is not working
Running the function over individual datasets works fine:
resultsmadrid <- customfunction (x=types, y=score, z=madrid)
resultsnavarra <- customfunction (x=types, y=score, z=navarra)
resultsaragon <- customfunction (x=types, y=score, z=aragon)
Trying to do it in one take is not working
regiones <- list(madrid, navarra, aragon) #Creates the list
resultregiones <- lapply(regiones, customfunction(x=types, y=score, z=regiones)) #Applies that to the list (?)
It's not looping the analysis across the dataframes in the list, the error message says there is a missing argument in the function.
I am not clear on how to call each dataframe from the function argument that does that (z, in my case). It seems the name of the comprehensive list object is not the right approach. Thanks for the help!

since types and scores are the same, you need to 'loop' throught the elements of you regiones list. Try it like this:
regiones <- list(madrid, navarra, aragon) #Creates the list
resultregiones <- lapply(regiones,function(X) customfunction(x=types, y=score, z=X))

Related

Output of a function in R change with the number of inputs

I am trying to run a function to download data from the USGS website using dataRetrieval package of R and a function I have created called getstreamflow. The code is the following:
siteNumber <- c("094985005","09498501","09489500","09489499","09498502")
Streamflow <- sapply(siteNumber, function(siteNumber) tryCatch(getstreamflow(siteNumber), error = function(e) message(paste("Error in station ", siteNumber))))
Streamflow <- Filter(NROW,Streamflow) #to delete empty data frames
I got the output I want that it is the one shown in the image below:
However, when I ran the same code but increase the number of stations in the input siteNumber
The output change and instead to produce several dataframes inside of a list. It generates a list for each data frame.
Does someone know why this happens? It is the same function only changes the number of stations in the siteNumber
Based on the image showed in the new data, each element in the list is nested as a list. We can extract the list element (of length 1) with [[1]] and then apply the Filter
out <- Filter(NROW, lapply(Streamflow, function(x) x[[1]]))
As we used NROW, it passed the test for list as well where it returns 1 for length attribute of list and thus all the elements meet the condition TRUE. Also, in the previous step, OP uses sapply and sapply is one function which can sometimes simplify the output. Instead of sapply use lapply (or specify simplify = FALSE)

Store multiple plots into a list using a function

I am trying to store multiple plots produced by ggplot2 into a list.
I am attempting to use the list function suggested in a previous thread, however I am having difficulty creating my own function to meet my needs.
First, I split a dataframe based on a factor into a list with the following code:
heatlist.germ <- split(heatlist.germ, f=as.factor(heatlist.germ$plot))
Afterwhich, I attempt to create a list function that I can later use lapply with.
plot_data_fcn <- function (heatlist.germ) {
ggplot(heatlist.germ[[i]], aes(x=posX, y=posY, fill=germ_bin)) +
geom_tile(aes(fill=germ_bin)) +
geom_text(aes(label=germ_bin)) +
scale_fill_gradient(low = "gray90", high="darkolivegreen4") +
ggtitle(plot) +
scale_x_continuous("Position X", breaks=seq(1,30)) +
scale_y_continuous("Position Y (REVERSED)", breaks=seq(1,20))
}
heatlist.test <- lapply(heatlist.germ[[i]], plot_data_fcn)
Two main things I am trying to accomplish:
Store the 12 ggplots (hence 12 factors of plot) in a list.
Create a title called "Plot [i] Germination".
Any help would be appreciated.
I don't have your data, so I'll simplify the plotting mechanism.
The first problem is that you should not use your [[i]] referencing in your function. Just have your function deal with data as-is, it really doesn't know that its argument is (in another environment) an element with a list. It knows just the object itself.
# a simple plot function
myfunc <- function(x) ggplot(x, aes_string(names(x)[1], names(x)[2])) + geom_point()
# a list of frames, nothing fancy here
datalist <- replicate(3, mtcars, simplify = FALSE)
# just call it ...
myplots <- lapply(datalist, myfunc)
class(myplots[[1]])
# [1] "gg" "ggplot"
When myfunc is called, its argument x is just a data.frame, the function has no idea that x is the first (or second or third) frame in a list of frames.
If you want to include the nth frame with an index indicating which element it is, this is in my view "zipping" data together, so I suggest Map. (You can also use purrr::imap or related tidyverse functions.)
myfunc2 <- function(x, title = "") ggplot(x, aes_string(names(x)[1], names(x)[2])) + geom_point() + labs(title = title)
myplots <- Map(myfunc2, datalist, sprintf("Plot number %s", seq_along(datalist)))
class(myplots[[1]])
# [1] "gg" "ggplot"
To understand how Map relates to lapply, then understand that lapply(datalist, myfunc) is "unrolled" to something like:
myfunc(datalist[[1]])
myfunc(datalist[[2]])
myfunc(datalist[[3]])
With Map, however, it takes one function that must accept one or more arguments in each call. With that, Map accepts as many lists (or vectors) as the function accepts arguments. The two functions are synonomously
lapply(datalist, myfunc) # data first, function second
Map(myfunc, datalist) # function first, data second
and a more complicated call unrolls like thus:
titles <- sprintf("Plot number %d", seq_along(datalist)) # "Plot number 1", ...
Map(myfunc2, datalist, titles)
# equivalent to
myfunc2(datalist[[1]], titles[[1]])
myfunc2(datalist[[2]], titles[[2]])
myfunc2(datalist[[3]], titles[[3]])
It doesn't really matter if each of the arguments is a true list (as in datalist) or a vector (as in titles), as long as they are the same length (or length 1).

get() not working for column in a data frame in a list in R (phew)

I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.

Creating a function that names objects created based on a function in R

I have a question about creating a function that creates ggplots. I want to create my own function to graph values in multiple data frames quickly instead of writing a whole ggplot with each argument filled out each time. What I want to do is to input a vector of the names of the data frames, have the function create the graphs and have each saved as a new object with a different name. Example of my idea is…
myfunction <- function(x) {
ggplot(x, aes(x = time, y = result)) +
geom_point()
}
I want to be able to do something like
myfunction(c(testtype1, testtype2, testtype3))
and have the function create objects plot1, plot2, plot3. As of now, I can only do
plot1 <- myfunction(testtype1)
plot2 <- myfunction(testtype2)
plot3 <- myfunction (testtype3)
I don’t want to keep typing that over and over, especially if I have a lot of test types. Is there a way that the function can be modified to use the function to name the objects according to some formula?
With this, you can provide any number of (appropriate) data frames, and the l_my_fun would return a list containing the plots.
l_my_fun <- function(x, ...) {
l <- list(x, ...)
ps <- lapply(l, myfunction)
ps
}
out <- l_my_fun(testtype1, testtype2, testtype3)
For example, now access the second plot as
out[[2]]

Using dplyr for exploratory plots

I regularly used d_ply to produce exploratory plots.
A trivial example:
require(plyr)
plot_species <- function(species_data){
p <- qplot(data=species_data,
x=Sepal.Length,
y=Sepal.Width)
print(p)
}
d_ply(.data=iris,
.variables="Species",
function(x)plot_species(x))
Which produces three separate plots, one for each species.
I would like to reproduce this behaviour using functions in dplyr.
This seems to require the reassembly of the data.frame within the function called by summarise, which is often impractical.
require(dplyr)
iris_by_species <- group_by(iris,Species)
plot_species <- function(Sepal.Length,Sepal.Width){
species_data <- data.frame(Sepal.Length,Sepal.Width)
p <- qplot(data=species_data,
x=Sepal.Length,
y=Sepal.Width)
print(p)
}
summarise(iris_by_species, plot_species(Sepal.Length,Sepal.Width))
Can parts of the data.frame be passed to the function called by summarise directly, rather than passing columns?
I believe you can work with do for this task with the same function you used in d_ply. It will print directly to the plotting window, but also saves the plots as a list within the resulting data.frame if you use a named argument (see help page, this is essentially like using dlply). I don't fully grasp all that do can do, but if I don't use a named argument I get an error message but the plots still print to the plotting window (in RStudio).
plot_species <- function(species_data){
p <- qplot(data=species_data,
x=Sepal.Length,
y=Sepal.Width)
print(p)
}
group_by(iris, Species) %>%
do(plot = plot_species(.))

Resources