I might just be missing a very obvious solution, but here's my question:
I have a function that takes a few inputs. I'm generating these inputs by getting values from a dataframe. What's the cleanest way to input my values into the function?
Let's say I have the function
sampleFunction<-function(input1, input2, input3){
return((input1+input2)-input3)
}
And an input that consists of a few columns of a row of a dataframe
sampleInput <- c(1,2,3)
I'd like to input the three values of my sampleInput into my sampleFunction. Is there a cleaner way than just doing
sampleFunction(sampleInput[1], sampleInput[2], sampleInput[3])
?
I would consider using the package data.table.
It doesn't directly answer the question of how to pass in a vector, however it does help address the greater contextual question of using the function for the rows in a table in a typing-efficient manner. You could do it in data.frame but your code would still look similar to what you're trying to avoid.
If you made your data.frame a data.table then your code would look like:
library(data.table)
sampleFunction<-function(input1, input2, input3){
return((input1+input2)-input3)
}
mydt[,sampleFunction(colA,colB,colC)] # do for all rows
mydt[1,sampleFunction(colA,colB,colC)] # do it for just row 1
You could then add that value as a column, return it independently etc
try
do.call(sampleFunction, sampleInput)
P.S sampleInput must be a list. If it is vector use as.list
do.call(sampleFunction, as.list(sampleInput))
Related
I am writing this post to ask for some advice for looping code to rename columns by index.
I have a data set that has scale item columns positioned next to each other. Unfortunately, they are oddly named.
I want to re-name each column in this format: SimRac1, SimRac2, SimRac3.... and so on. I know the location of the columns (Columns number 30 to 37). I know these scale items are ordered in such a way that they can be named and numbered in increased order from left to right.
The code I currently have works, but is not efficient. There are other scales, in different locations, that also need to be renamed in a similar fashion. This would result in dozens of code rows.
See below code.
names(Total)[30] <- "SimRac1"
names(Total)[31] <- "SimRac2"
names(Total)[32] <- "SimRac3"
names(Total)[33] <- "SimRac4"
names(Total)[34] <- "SimRac5"
names(Total)[35] <- "SimRac6"
names(Total)[36] <- "SimRac7"
names(Total)[37] <- "SimRac8"
I want to loop this code so that I only have a chunk of code that does the work.
I was thinking perhaps a "for loop" would help.
Hence, the below code
for (i in Total[,30:37]){
names(Total)[i] <- "SimRac(1:8)"
}
This, unfortunately does not work. This chunk of code runs without error, but it doesn't do anything.
Do advice.
In the OP's code, "SimRac(1:8)" is a constant. To have dynamic names, use paste0.
We do not need a loop here. We can use a vectorized function to create the names, then assign the names to a subset of names(Total)
names(Total)[30:37]<-paste0('SimRac', 1:8)
Similar to this question but for R. I want to get a summary count of every variable in each column of a data frame.
Currently, doing something like plyr::count(df[,1:10]) checks for how many times every variable in a row match. Instead, I just want a quick way of printing out what all my variables even are, though. I know this can be done with C-style recursion, but I'm hoping for a more elegant/simpler solution.
You can use lapply:
lapply(df, plyr::count)
Alternatively, keeping everything in base R you can use table with stack to get similar output
lapply(df, function(x) stack(table(x)))
I want to extract first non missing variable in my data.table for each row.
function_non_missing<-function(x){
x<-x[!is.na(x)]
#Then apply some other transformations such as
#x<-x[x!=""]
#x<-x[x!="some random thing"]
if (length(x)>0){
x[1]
} else{
NA
}
}
Now I just want to apply this function row by row. I searched for previous answers and then tried things like:
data<-data[,non_missing_var:=function_non_missing(.SD),by=1:nrow(data)]
I also tried other permutations of the same idea but nothing seems to work. More generally can somebody point towards some tutorial to learn about the most efficient ways to apply data.table ideas (in particular how to use Map and Reduce) row by row using as arguments columns specified in .SDcols. In practice what I often want to do is something like:
data<-data[,my_new_var:=random_function(.SD),.SDcols=c("var_1","var_2","var_3"),by=1:nrow(data)]
and random_function is operating on a vector.
Apparently this will work:
data<-data[,non_missing_var:=function_non_missing(unlist(.SD)),by=1:nrow(data)]
could somebody more familiar with data.table comment why this works and why do I need to put unlist.
I suggest using the apply function instead. Try
apply(data, 1, function_non_missing)
1refers to applying the function row-wise.
Ok, I'm stuck in a dumbness loop. I've read thru the helpful ideas at How to sort a dataframe by column(s)? , but need one more hint. I'd like a function that takes a matrix with an arbitrary number of columns, and sorts by all columns in sequence. E.g., for a matrix foo with N columns,
does the equivalent of foo[order(foo[,1],foo[,2],...foo[,N]),] . I am happy to use a with or by construction, and if necessary define the colnames of my matrix, but I can't figure out how to automate the collection of arguments to order (or to with) .
Or, I should say, I could build the entire bloody string with paste and then call it, but I'm sure there's a more straightforward way.
The most elegant (for certain values of "elegant") way would be to turn it into a data frame, and use do.call:
foo[do.call(order, as.data.frame(foo)), ]
This works because a data frame is just a list of variables with some associated attributes, and can be passed to functions expecting a list.
I have a list of several data.frames. Each data.frame has several columns.
By using
mean(mylist$first_dataframe$a
I can get the mean for a in this one data.frame.
However I do not know how to calculate over all the data.frames stored in my list or how for specific data.frames.
I could use a loop but I was told that
apply() and its variations are better
I tried using several solutions I found via search but somehow it just doesn't work.
I assume I need to use
unlist()
Could you provide an example of how to calculate e.g. a mean for a data structure like mine.
A list with several data.frames containing several columns.
Update:
I'm sorry for the confusion. I wanted the grand mean for a specific column in all dataframes.
Thanks to Thomas for providing a working solution for calculating a grand mean for a specific column in all dataframes and to psychometriko for providing a useful solution for calculating means over all columns in all dataframes (& even for the case when not numeric data is involved).
Thanks!
Is this what you are looking for?
set.seed(42)
mylist <- list(a=data.frame(foo=rnorm(10),
bar=rnorm(10)),
b=data.frame(foo=rnorm(10),
bar=rnorm(10)),
c=data.frame(foo=rnorm(10),
bar=rnorm(10)))
sapply(do.call("rbind",mylist),mean)
foo bar
0.1163340 -0.1696556
Note: do.call("rbind",mylist) returns something similar to what you referred to above with the unlist function, and then sapply, as referred to by Roland in his answer, just calls the function mean on each component (column) of the data.frame that results from the above do.call function.
Edit: In response to the question of how to deal with non-numeric data.frame components, the below solution admittedly isn't very elegant and I'm sure better ones exist, but here's the first thing I was able to think of:
set.seed(42)
mylist <- list(a=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
b=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
c=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)))
sapply(do.call("rbind",mylist),function(x) {
if (is.numeric(x)) mean(x)
})
$rand
[1] -0.02470602
$lets
NULL
This basically just creates a custom function that first tests whether each component is numeric and, if it is, returns the mean. If it isn't, it skips it.
The whole do.call('rbind', List) thing can be quite slow and prone to mishaps. If there is only one column you need the mean for, the best way is:
mean(sapply(mylist, function(X) X$rand))
It's about 10x faster the the do.call method.