Selecting rows of a dataframe fullfilling an specific condition in R - r

First of all, I have to say that this is my first post. Despite of having look for the answer using the search toolbox it might be possible that I passed over the right topic without realizing myself, so just in case sorry for that.
Having said that, my problem is the following one:
I have a data table composed by several columns.
I have to select the
rows that are fullfilling one specific condition ex.
which(DT_$var>value, arr.ind = T)) or which(DT_$var>value &&
DT_$var2>value2, arr.ind = T))
I have to keep these columns in a new
data frame.
My approach was the following one but it is not working, probably because I did not understand the loops correctly:
while (i in nrow(DT)) {
if(DT$var[i]>value){
DT_aux[i]=DT[i]
i<-i+1
}
}
Error in if (DT$value[i] > 45) { : argument is of length zero
I hope that you can help me

There is a very good chance that you want to use dplyr and it's filter function. It would work like this:
library(dplyr)
DT %>% filter(var>value && var2>value2)
You don't need to use DT$var and DT$var2 here; dplyr knows what you mean when you refer to variables.
You can, of course, do the same with base R, but this kind of work is exactly what dplyr was made for, so sticking with base R, in this case, is just masochism.

Related

Assign a Value based on the numbers in a separate columns in R

So I kind of already know the possible solution but I don't know how to exactly go about it so please give me a bit of grace here.
I have a dataset for youtube trends that I want to read the values from two columns (likes and dislikes) and based off their contents I want an entry to be made in the new column. If the likes are higher than the dislikes I want it to be said as a 'positive' video and if it has more dislikes it should be 'negative'.
I'm primarily not sure how to go about this since most of the previous asks are based off of one column rather than two. I know some mentioned using cut, but would it still work the same?
all help is appreciated, thanks.
You can use a simple ifelse :
df$new_col <- ifelse(df$likes > df$dislikes, 'positive', 'negative')
This can also be written without ifelse as :
df$new_col <- c('negative', 'positive')[as.integer(df$likes > df$dislikes) + 1]
You can use Vectorize to create a vectorized version of a function. vfunc <- Vectorize(func) will allow you to call df$newcol <- vfunc(df$likes, df$dislikes) if your function takes two arguments and then return the result for each row in a vector that's assigned to a new column.

Adressing columns based on only parts of the name in order to simplify lines

My first question here and I am not very experienced, however I hope this question is easy enough to answer since I only want to know if what I describe in the title is possible.
I have multiple dataframes taken from online capacity tests participants did.
For all Items I have response, score, and durationvariables among others.
Now I want to delete rows where all responsevariables are NA. So I can't just use a command to delete rows with where all is NA but there are also to many columns to do it by hand. And I also want to keep the dataframe together while doing it in order to really drop the complete rows, so just extracting all responsevariables doesn't sound like a good option.
However, besides a 3digit number based on the specific items the responsevariablenames are basically the same.
So instead of writing a very long impractical line mentioning all responsevariables and to drop the row if they all contain NA is there a way to not use the full anme of a variable but only use the end of the name for example so R checks the condition for all variables ending that way?
simplified e.g: instead of
newdf <- olddf[!(olddf$item123response != NA & olddf$item131response != NA & etc),]
Can I just do something like newdf <- olddf[!(olddf$xxxresponse != NA),] ?
I tried to google an answer but I didn't know how to frame my question effectively.
Thanks in advance!
Try This
newdf <- olddf[complete.cases(olddf[, grep('response', names(olddf))]), ]

Is there a way to apply plyr's count() function to every column individually?

Similar to this question but for R. I want to get a summary count of every variable in each column of a data frame.
Currently, doing something like plyr::count(df[,1:10]) checks for how many times every variable in a row match. Instead, I just want a quick way of printing out what all my variables even are, though. I know this can be done with C-style recursion, but I'm hoping for a more elegant/simpler solution.
You can use lapply:
lapply(df, plyr::count)
Alternatively, keeping everything in base R you can use table with stack to get similar output
lapply(df, function(x) stack(table(x)))

ifelse with no else

Basically in SAS I could just do an if statement without an else. For example:
if species='setosa' then species='regular';
there is no need for else.
How to do it in R? This is my script below which does not work:
attach(iris)
iris2 <- iris
iris2$Species <- ifelse(iris2$Species=='setosa',iris2$Species <- 'regular',iris2$Species <- iris2$Species)
table(iris2$Species)
A couple options. The best is to just do the replacement, this is nice and clean:
iris2$Species[iris2$Species == 'setosa'] <- 'regular'
ifelse returns a vector, so the way to use it in cases like this is to replace the column with a new one created by ifelse. Don't do assignment inside ifelse!
iris2$Species <- ifelse(iris2$Species=='setosa', 'regular', iris2$Species)
But there's rarely need to use ifelse if the else is "stay the same" - the direct replacement of the subset (the first line of code in this answer) is better.
New factor levels
Okay, so the code posted above doesn't actually work - this is because iris$Species is a factor (categorical) variable, and 'regular' isn't one of the categories. The easiest way to deal with this is to coerce the variable to character before editing:
iris2$Species <- as.character(iris2$Species)
iris2$Species[iris2$Species == 'setosa'] <- 'regular'
Other methods work as well, (editing the factor levels directly or re-factoring and specifying new labels), but that's not the focus of your question so I'll consider it out of scope for the answer.
Also, as I said in the comments, don't use attach. If you're not careful with it you can end up with your columns out of sync creating annoying bugs. (In the code you post, you're not using it anyway - the rest runs just as well if you delete the attach line.)
I would recommend looking at the base R documentation for help with this. You can find the documentation of if, else, and ifelse here. For use of if and else, refer to ?Control.
Regular control flow in code is done with the basic if and else statements, as in most languages. ifelse() is used for vectorized operations--it will return the same shape as your vector based on the test. Regular if and else expressions do not necessarily have those properties.

Picking out first non missing variable by row in data.table R

I want to extract first non missing variable in my data.table for each row.
function_non_missing<-function(x){
x<-x[!is.na(x)]
#Then apply some other transformations such as
#x<-x[x!=""]
#x<-x[x!="some random thing"]
if (length(x)>0){
x[1]
} else{
NA
}
}
Now I just want to apply this function row by row. I searched for previous answers and then tried things like:
data<-data[,non_missing_var:=function_non_missing(.SD),by=1:nrow(data)]
I also tried other permutations of the same idea but nothing seems to work. More generally can somebody point towards some tutorial to learn about the most efficient ways to apply data.table ideas (in particular how to use Map and Reduce) row by row using as arguments columns specified in .SDcols. In practice what I often want to do is something like:
data<-data[,my_new_var:=random_function(.SD),.SDcols=c("var_1","var_2","var_3"),by=1:nrow(data)]
and random_function is operating on a vector.
Apparently this will work:
data<-data[,non_missing_var:=function_non_missing(unlist(.SD)),by=1:nrow(data)]
could somebody more familiar with data.table comment why this works and why do I need to put unlist.
I suggest using the apply function instead. Try
apply(data, 1, function_non_missing)
1refers to applying the function row-wise.

Resources