Is there a way to apply plyr's count() function to every column individually? - r

Similar to this question but for R. I want to get a summary count of every variable in each column of a data frame.
Currently, doing something like plyr::count(df[,1:10]) checks for how many times every variable in a row match. Instead, I just want a quick way of printing out what all my variables even are, though. I know this can be done with C-style recursion, but I'm hoping for a more elegant/simpler solution.

You can use lapply:
lapply(df, plyr::count)

Alternatively, keeping everything in base R you can use table with stack to get similar output
lapply(df, function(x) stack(table(x)))

Related

Picking out first non missing variable by row in data.table R

I want to extract first non missing variable in my data.table for each row.
function_non_missing<-function(x){
x<-x[!is.na(x)]
#Then apply some other transformations such as
#x<-x[x!=""]
#x<-x[x!="some random thing"]
if (length(x)>0){
x[1]
} else{
NA
}
}
Now I just want to apply this function row by row. I searched for previous answers and then tried things like:
data<-data[,non_missing_var:=function_non_missing(.SD),by=1:nrow(data)]
I also tried other permutations of the same idea but nothing seems to work. More generally can somebody point towards some tutorial to learn about the most efficient ways to apply data.table ideas (in particular how to use Map and Reduce) row by row using as arguments columns specified in .SDcols. In practice what I often want to do is something like:
data<-data[,my_new_var:=random_function(.SD),.SDcols=c("var_1","var_2","var_3"),by=1:nrow(data)]
and random_function is operating on a vector.
Apparently this will work:
data<-data[,non_missing_var:=function_non_missing(unlist(.SD)),by=1:nrow(data)]
could somebody more familiar with data.table comment why this works and why do I need to put unlist.
I suggest using the apply function instead. Try
apply(data, 1, function_non_missing)
1refers to applying the function row-wise.

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

Multiple inputs to a function

I might just be missing a very obvious solution, but here's my question:
I have a function that takes a few inputs. I'm generating these inputs by getting values from a dataframe. What's the cleanest way to input my values into the function?
Let's say I have the function
sampleFunction<-function(input1, input2, input3){
return((input1+input2)-input3)
}
And an input that consists of a few columns of a row of a dataframe
sampleInput <- c(1,2,3)
I'd like to input the three values of my sampleInput into my sampleFunction. Is there a cleaner way than just doing
sampleFunction(sampleInput[1], sampleInput[2], sampleInput[3])
?
I would consider using the package data.table.
It doesn't directly answer the question of how to pass in a vector, however it does help address the greater contextual question of using the function for the rows in a table in a typing-efficient manner. You could do it in data.frame but your code would still look similar to what you're trying to avoid.
If you made your data.frame a data.table then your code would look like:
library(data.table)
sampleFunction<-function(input1, input2, input3){
return((input1+input2)-input3)
}
mydt[,sampleFunction(colA,colB,colC)] # do for all rows
mydt[1,sampleFunction(colA,colB,colC)] # do it for just row 1
You could then add that value as a column, return it independently etc
try
do.call(sampleFunction, sampleInput)
P.S sampleInput must be a list. If it is vector use as.list
do.call(sampleFunction, as.list(sampleInput))

Calculate e.g. a mean in a list with multi-column data.frames

I have a list of several data.frames. Each data.frame has several columns.
By using
mean(mylist$first_dataframe$a
I can get the mean for a in this one data.frame.
However I do not know how to calculate over all the data.frames stored in my list or how for specific data.frames.
I could use a loop but I was told that
apply() and its variations are better
I tried using several solutions I found via search but somehow it just doesn't work.
I assume I need to use
unlist()
Could you provide an example of how to calculate e.g. a mean for a data structure like mine.
A list with several data.frames containing several columns.
Update:
I'm sorry for the confusion. I wanted the grand mean for a specific column in all dataframes.
Thanks to Thomas for providing a working solution for calculating a grand mean for a specific column in all dataframes and to psychometriko for providing a useful solution for calculating means over all columns in all dataframes (& even for the case when not numeric data is involved).
Thanks!
Is this what you are looking for?
set.seed(42)
mylist <- list(a=data.frame(foo=rnorm(10),
bar=rnorm(10)),
b=data.frame(foo=rnorm(10),
bar=rnorm(10)),
c=data.frame(foo=rnorm(10),
bar=rnorm(10)))
sapply(do.call("rbind",mylist),mean)
foo bar
0.1163340 -0.1696556
Note: do.call("rbind",mylist) returns something similar to what you referred to above with the unlist function, and then sapply, as referred to by Roland in his answer, just calls the function mean on each component (column) of the data.frame that results from the above do.call function.
Edit: In response to the question of how to deal with non-numeric data.frame components, the below solution admittedly isn't very elegant and I'm sure better ones exist, but here's the first thing I was able to think of:
set.seed(42)
mylist <- list(a=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
b=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
c=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)))
sapply(do.call("rbind",mylist),function(x) {
if (is.numeric(x)) mean(x)
})
$rand
[1] -0.02470602
$lets
NULL
This basically just creates a custom function that first tests whether each component is numeric and, if it is, returns the mean. If it isn't, it skips it.
The whole do.call('rbind', List) thing can be quite slow and prone to mishaps. If there is only one column you need the mean for, the best way is:
mean(sapply(mylist, function(X) X$rand))
It's about 10x faster the the do.call method.

Sort a data.frame by multiple columns whose names are contained in a single object?

I want to sort a data.frame by multiple columns, ideally using base R without any external packages (though if necessary, so be it). Having read How to sort a dataframe by column(s)?, I know I can accomplish this with the order() function as long as I either:
Know the explicit names of each of the columns.
Have a separate object representing each individual column by which to sort.
But what if I only have one vector containing multiple column names, of length that's unknown in advance?
Say the vector is called sortnames.
data[order(data[, sortnames]), ] won't work, because order() treats that as a single sorting argument.
data[order(data[, sortnames[1]], data[, sortnames[2]], ...), ] will work if and only if I specify the exact correct number of sortname values, which I won't know in advance.
Things I've looked at but not been totally happy with:
eval(parse(text=paste("data[with(data, order(", paste(sortnames, collapse=","), ")), ]"))). Maybe this is fine, but I've seen plenty of hate for using eval(), so asking for alternatives seemed worthwhile.
I may be able to use the Deducer library to do this with sortData(), but like I said, I'd rather avoid using external packages.
If I'm being too stubborn about not using external packages, let me know. I'll get over it. All ideas appreciated in advance!
You can use do.call:
data<-data.frame(a=rnorm(10),b=rnorm(10))
data<-data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10))
sortnames <- c("a", "b")
data[do.call("order", data[sortnames]), ]
This trick is useful when you want to pass multiple arguments to a function and these arguments are in convenient named list.

Resources