Picking out first non missing variable by row in data.table R - r

I want to extract first non missing variable in my data.table for each row.
function_non_missing<-function(x){
x<-x[!is.na(x)]
#Then apply some other transformations such as
#x<-x[x!=""]
#x<-x[x!="some random thing"]
if (length(x)>0){
x[1]
} else{
NA
}
}
Now I just want to apply this function row by row. I searched for previous answers and then tried things like:
data<-data[,non_missing_var:=function_non_missing(.SD),by=1:nrow(data)]
I also tried other permutations of the same idea but nothing seems to work. More generally can somebody point towards some tutorial to learn about the most efficient ways to apply data.table ideas (in particular how to use Map and Reduce) row by row using as arguments columns specified in .SDcols. In practice what I often want to do is something like:
data<-data[,my_new_var:=random_function(.SD),.SDcols=c("var_1","var_2","var_3"),by=1:nrow(data)]
and random_function is operating on a vector.

Apparently this will work:
data<-data[,non_missing_var:=function_non_missing(unlist(.SD)),by=1:nrow(data)]
could somebody more familiar with data.table comment why this works and why do I need to put unlist.

I suggest using the apply function instead. Try
apply(data, 1, function_non_missing)
1refers to applying the function row-wise.

Related

Assign a Value based on the numbers in a separate columns in R

So I kind of already know the possible solution but I don't know how to exactly go about it so please give me a bit of grace here.
I have a dataset for youtube trends that I want to read the values from two columns (likes and dislikes) and based off their contents I want an entry to be made in the new column. If the likes are higher than the dislikes I want it to be said as a 'positive' video and if it has more dislikes it should be 'negative'.
I'm primarily not sure how to go about this since most of the previous asks are based off of one column rather than two. I know some mentioned using cut, but would it still work the same?
all help is appreciated, thanks.
You can use a simple ifelse :
df$new_col <- ifelse(df$likes > df$dislikes, 'positive', 'negative')
This can also be written without ifelse as :
df$new_col <- c('negative', 'positive')[as.integer(df$likes > df$dislikes) + 1]
You can use Vectorize to create a vectorized version of a function. vfunc <- Vectorize(func) will allow you to call df$newcol <- vfunc(df$likes, df$dislikes) if your function takes two arguments and then return the result for each row in a vector that's assigned to a new column.

Is there a way to apply plyr's count() function to every column individually?

Similar to this question but for R. I want to get a summary count of every variable in each column of a data frame.
Currently, doing something like plyr::count(df[,1:10]) checks for how many times every variable in a row match. Instead, I just want a quick way of printing out what all my variables even are, though. I know this can be done with C-style recursion, but I'm hoping for a more elegant/simpler solution.
You can use lapply:
lapply(df, plyr::count)
Alternatively, keeping everything in base R you can use table with stack to get similar output
lapply(df, function(x) stack(table(x)))

Selecting rows of a dataframe fullfilling an specific condition in R

First of all, I have to say that this is my first post. Despite of having look for the answer using the search toolbox it might be possible that I passed over the right topic without realizing myself, so just in case sorry for that.
Having said that, my problem is the following one:
I have a data table composed by several columns.
I have to select the
rows that are fullfilling one specific condition ex.
which(DT_$var>value, arr.ind = T)) or which(DT_$var>value &&
DT_$var2>value2, arr.ind = T))
I have to keep these columns in a new
data frame.
My approach was the following one but it is not working, probably because I did not understand the loops correctly:
while (i in nrow(DT)) {
if(DT$var[i]>value){
DT_aux[i]=DT[i]
i<-i+1
}
}
Error in if (DT$value[i] > 45) { : argument is of length zero
I hope that you can help me
There is a very good chance that you want to use dplyr and it's filter function. It would work like this:
library(dplyr)
DT %>% filter(var>value && var2>value2)
You don't need to use DT$var and DT$var2 here; dplyr knows what you mean when you refer to variables.
You can, of course, do the same with base R, but this kind of work is exactly what dplyr was made for, so sticking with base R, in this case, is just masochism.

Multiple inputs to a function

I might just be missing a very obvious solution, but here's my question:
I have a function that takes a few inputs. I'm generating these inputs by getting values from a dataframe. What's the cleanest way to input my values into the function?
Let's say I have the function
sampleFunction<-function(input1, input2, input3){
return((input1+input2)-input3)
}
And an input that consists of a few columns of a row of a dataframe
sampleInput <- c(1,2,3)
I'd like to input the three values of my sampleInput into my sampleFunction. Is there a cleaner way than just doing
sampleFunction(sampleInput[1], sampleInput[2], sampleInput[3])
?
I would consider using the package data.table.
It doesn't directly answer the question of how to pass in a vector, however it does help address the greater contextual question of using the function for the rows in a table in a typing-efficient manner. You could do it in data.frame but your code would still look similar to what you're trying to avoid.
If you made your data.frame a data.table then your code would look like:
library(data.table)
sampleFunction<-function(input1, input2, input3){
return((input1+input2)-input3)
}
mydt[,sampleFunction(colA,colB,colC)] # do for all rows
mydt[1,sampleFunction(colA,colB,colC)] # do it for just row 1
You could then add that value as a column, return it independently etc
try
do.call(sampleFunction, sampleInput)
P.S sampleInput must be a list. If it is vector use as.list
do.call(sampleFunction, as.list(sampleInput))

Calculate e.g. a mean in a list with multi-column data.frames

I have a list of several data.frames. Each data.frame has several columns.
By using
mean(mylist$first_dataframe$a
I can get the mean for a in this one data.frame.
However I do not know how to calculate over all the data.frames stored in my list or how for specific data.frames.
I could use a loop but I was told that
apply() and its variations are better
I tried using several solutions I found via search but somehow it just doesn't work.
I assume I need to use
unlist()
Could you provide an example of how to calculate e.g. a mean for a data structure like mine.
A list with several data.frames containing several columns.
Update:
I'm sorry for the confusion. I wanted the grand mean for a specific column in all dataframes.
Thanks to Thomas for providing a working solution for calculating a grand mean for a specific column in all dataframes and to psychometriko for providing a useful solution for calculating means over all columns in all dataframes (& even for the case when not numeric data is involved).
Thanks!
Is this what you are looking for?
set.seed(42)
mylist <- list(a=data.frame(foo=rnorm(10),
bar=rnorm(10)),
b=data.frame(foo=rnorm(10),
bar=rnorm(10)),
c=data.frame(foo=rnorm(10),
bar=rnorm(10)))
sapply(do.call("rbind",mylist),mean)
foo bar
0.1163340 -0.1696556
Note: do.call("rbind",mylist) returns something similar to what you referred to above with the unlist function, and then sapply, as referred to by Roland in his answer, just calls the function mean on each component (column) of the data.frame that results from the above do.call function.
Edit: In response to the question of how to deal with non-numeric data.frame components, the below solution admittedly isn't very elegant and I'm sure better ones exist, but here's the first thing I was able to think of:
set.seed(42)
mylist <- list(a=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
b=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
c=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)))
sapply(do.call("rbind",mylist),function(x) {
if (is.numeric(x)) mean(x)
})
$rand
[1] -0.02470602
$lets
NULL
This basically just creates a custom function that first tests whether each component is numeric and, if it is, returns the mean. If it isn't, it skips it.
The whole do.call('rbind', List) thing can be quite slow and prone to mishaps. If there is only one column you need the mean for, the best way is:
mean(sapply(mylist, function(X) X$rand))
It's about 10x faster the the do.call method.

Resources