How to process a dataframe row by row, passing the columns as args to a function, *as a single call to function* - r

Still quite new to R. Its quite possible my question is due to gaps in my thinking about this problem, but after few hours of googling, I'm still stuck.
The problem:
I have a dataframe(tibble) that contains 6 rows, and 3 columns.
The columns are Filename, Metadata1, Metadata2.
I want to call a function for each row, as follows:
function(Filename, Metadata1, Metadata2).
In other languages, this would be a simple for loop, but I am completely stuck how to do this in R, both looking at base, and tidyverse ways to do this. All the answers I've come across are variations of calling the function on every element in the dataframe or matrix, whereas I want to effectively pass the whole row to the function, as individual args.
Its probably blindly obvious, but I would really appreciate some guidance.
EDIT:
I ran across mapply, and it seems to do the job I need, but I have no idea if this is the only or best method. This what I'm working with currently:
testfunc <- function(a,b,c){
str(a)
str(b)
str(c)
}
discard <- mapply(testfunc, a=files_sorted$file, b=files_sorted$AppID, c=files_sorted$server)

Moments after I posted the last edit to my question, I hit the exact issue that #mrflick mentioned where my function was not vectorised.
In the end, I did end up using a for loop, this is what I settled on:
overall_data <- tibble()
for(a in transpose(files_sorted)){
df <- processFile(file=a[1]$list_files, srv=a[2]$server, tap=a[3]$AppID )
#view(df)
overall_data <- bind_rows(overall_data, df)
}
files_sorted:
I'm sure I'll learn better ways to tackle this in future, but leaving this here

Related

Filtering with dplyr not working as expected

how are you?
I have the next problem, that is very weird because the task it is very simple.
I want to filter one of my factor variables in R, but the outcome is an empty dataframe.
So my data frame is called "data_2022", if i execute this code:
sum(data_2022$CANALDEVENTA=="WEB")
The result is 2704800 that is the number of times that this filter is TRUE.
a= data_2022 %>% filter(CANALDEVENTA=="WEB")
This returns an empty data frame.
I know i am not an expert in R, but i have done the last thing a million times and i never had this error before.
Do you have a clue about whats the problem with this?
Sorry i did not make a reproducible example.
Already thank you.
you could use subset function:
a<-subset(data_2022, CANALDEVENTA=="WEB")
using tidyverse, make sure you are using the function from dplyr::filter. filter is looking for a logical expression but probably you apply it to a data.frame. Try this code too:
my_names<-c("WEB")
a<-dplyr::filter(data_2022, CANALDEVENTA %in% my_names)
Hope it works.

How to automatize listing many elements within a command line in R? [duplicate]

Currently, i have multiple dataframes with the same name and in running order (foo1, foo2, foo3, foo4, foo5... etc). I am trying to create a large dataframe containing all the rows of the above dataframes with rbind(). Is there an elegant way to do it which would be the equivalent of rbind(foo1, foo2, foo3, foo4, foo5...)?
I have tried do.call(rbind, paste0("foo",i)) where i=c(1,2,3...) to no avail.
There is a solution mentioned here, which is:
do.matrix <- do.call(rbind, lapply( paste0("variable", 1:10) , get) )
However, the answer mysteriously says "That is the wrong way to handle related items. Better to use a list or dataframe, but you will probably find out why in due course."
Why would that be the wrong way to do this, and what would be the "right" way?
Thanks.
Always try to rigorously capture relations between related instances of data, or related data and methods, or related methods. This generally helps ease aggregate manipulation such as your rbind requirement.
For your case, you should have defined your related data.frames as a single list from the beginning:
foo <- list(data.frame(...), data.frame(...), ... );
And then your requirement could be satisfied thusly:
do.call(rbind, foo );
If it's too late for that, then the solution involving repeated calls to get(), as described in the article to which you linked, can do the job.

Calculate e.g. a mean in a list with multi-column data.frames

I have a list of several data.frames. Each data.frame has several columns.
By using
mean(mylist$first_dataframe$a
I can get the mean for a in this one data.frame.
However I do not know how to calculate over all the data.frames stored in my list or how for specific data.frames.
I could use a loop but I was told that
apply() and its variations are better
I tried using several solutions I found via search but somehow it just doesn't work.
I assume I need to use
unlist()
Could you provide an example of how to calculate e.g. a mean for a data structure like mine.
A list with several data.frames containing several columns.
Update:
I'm sorry for the confusion. I wanted the grand mean for a specific column in all dataframes.
Thanks to Thomas for providing a working solution for calculating a grand mean for a specific column in all dataframes and to psychometriko for providing a useful solution for calculating means over all columns in all dataframes (& even for the case when not numeric data is involved).
Thanks!
Is this what you are looking for?
set.seed(42)
mylist <- list(a=data.frame(foo=rnorm(10),
bar=rnorm(10)),
b=data.frame(foo=rnorm(10),
bar=rnorm(10)),
c=data.frame(foo=rnorm(10),
bar=rnorm(10)))
sapply(do.call("rbind",mylist),mean)
foo bar
0.1163340 -0.1696556
Note: do.call("rbind",mylist) returns something similar to what you referred to above with the unlist function, and then sapply, as referred to by Roland in his answer, just calls the function mean on each component (column) of the data.frame that results from the above do.call function.
Edit: In response to the question of how to deal with non-numeric data.frame components, the below solution admittedly isn't very elegant and I'm sure better ones exist, but here's the first thing I was able to think of:
set.seed(42)
mylist <- list(a=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
b=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
c=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)))
sapply(do.call("rbind",mylist),function(x) {
if (is.numeric(x)) mean(x)
})
$rand
[1] -0.02470602
$lets
NULL
This basically just creates a custom function that first tests whether each component is numeric and, if it is, returns the mean. If it isn't, it skips it.
The whole do.call('rbind', List) thing can be quite slow and prone to mishaps. If there is only one column you need the mean for, the best way is:
mean(sapply(mylist, function(X) X$rand))
It's about 10x faster the the do.call method.

returning different data frames in a function - R

Is it possible to return 4 different data frames from one function?
Scenario:
I am trying to read a file, parse it, and return some parts of the file.
My function looks something like this:
parseFile <- function(file){
carFile <- read.table(file, header=TRUE, sep="\t")
carNames <- carFile[1,]
carYear <- colnames(carFile)
return(list(carFile,carNames,carYear))
}
I don't want to have to use list(carFile,carNames,carYear). Is there a way return the 3 data frames without returning them in a list first?
R does not support multiple return values. You want to do something like:
foo = function(x,y){return(x+y,x-y)}
plus,minus = foo(10,4)
yeah? Well, you can't. You get an error that R cannot return multiple values.
You've already found the solution - put them in a list and then get the data frames from the list. This is efficient - there is no conversion or copying of the data frames from one block of memory to another.
This is also logical, the return from a function should conceptually be a single entity with some meaning that is transferred to whatever function is calling it. This meaning is also better conveyed if you name the returned values of the list.
You could use a technique to create multiple objects in the calling environment, but when you do that, kittens die.
Note in your example carYear isn't a data frame - its a character vector of column names.
There are other ways you could do that, if you really really want, in R.
assign('carFile',carFile,envir=parent.frame())
If you use that, then carFile will be created in the calling environment. As Spacedman indicated you can only return one thing from your function and the clean solution is to go for the list.
In addition, my personal opinion is that if you find yourself in such a situation, where you feel like you need to return multiple dataframes with one function, or do something that no one has ever done before, you should really revisit your approach. In most cases you could find a cleaner solution with an additional function perhaps, or with the recommended (i.e. list).
In other words the
envir=parent.frame()
will do the job, but as SpacedMan mentioned
when you do that, kittens die
The zeallot package does what you need in a similar that Python can unpack variables from a function. Reproducible example below.
parseFile <- function(){
carMPG <- mtcars$mpg
carName <- rownames(mtcars)
carCYL <- mtcars$cyl
return(list(carMPG,carName,carCYL))
}
library(zeallot)
c(myFile, myName, myYear) %<-% parseFile()

Cumulative sum for n rows

I have been trying to produce a command in R that allows me to produce a new vector where each row is the sum of 25 rows from a previous vector.
I've tried making a function to do this, this allows me to produce a result for one data point.
I shall put where I haver got to; I realise this is probably a fairly basic question but it is one I have been struggling with... any help would be greatly appreciated;
example<-c(1;200)
fun.1<-function(x)
{sum(x[1:25])}
checklist<-sapply(check,FUN=fun.1)
This then supplies me with a vector of length 200 where all values are NA.
Can anybody help at all?
Your example is a bit noisy (e.g., c(1;200) has no meaning, probably you want 1:200 there, or, if you would like to have a list of lists then something like rep, there is no check variable, it should have been example, etc.).
Here's the code what I think you need probably (as far as I was able to understand it):
x <- rep(list(1:200), 5)
f <- function(y) {y[1:20]}
sapply(x, f)
Next time please be more specific, try out the code you post as an example before submitting a question.

Resources