Using compound inequalities within ifelse() in R - r

I am trying to create a new factor variable that is conditional based on a numeric variable within my dataframe. my ifelse argument works perfectly when I supply it a single inequality, but for the two categories that require a compound inequality, it doesn't work. I'm familiar with logic and conditions in other languages, and I'm thinking just my syntax is off?
Also, there seems to be times when you can use '&' and times to use '&&'. I'm used to always using '&&', so what is the difference here?
data$bin<-as.factor(ifelse(data$internet<=2.3225,"one",
ifelse(data$internet>2.3225 && data$internet<=4.425,"two",
ifelse(data$internet>4.425 && data$internet<=6.5275,"three",
ifelse(data$internet>6.5275,"four",NA)))))

&& doesn't vectorize. Unless you want to compare a single element to a single element, you should use &.

Related

Can we use apply function along with some user defined function or if/while loops in R to conditionally work it on selective rows?

I know that while and if functions in R are not vectorised. while and if functions help us selectively work on some rows based on some condition. I also know that the apply function in R is used to apply over the columns and hence it operates on all rows of columns that we wish to put apply on. Can I use apply() along with user defined functions and/or with while/if loop to conditionally use it over some rows rather than all rows as apply function usually does.
Note :- This core issue here is to bypass the drawback on non-vectorization of while/if loops in R.
You can supply user defined functions to apply using an argument FUN = function(x) user_defined_function(x) {}. And apply is "vectorized" in sense that as argument it accept vectors, not scalars (but its implementation is heavily using for and if loops, type apply without arguments in your console). So for and apply are of the same perfomance.
However you can break the execution of user defined function throwing exception with stop and wrapping in tryCatch it is a non-recommended technique (it influences environements, call stacks, scopes etc., make debugging difficult and lead to errors which are difficult to identify).
Better to use for and if and very often it is the most easiest and effective way (to write a recursive function, taking in consideration that (tail) recursion is not really optimized for R, or fully refactor your algorithm quite difficult and time consuming).

Cleaning data which to use (Grep) or (str_extract_all)

I need to extract from the dataset all the elements that mention "mean" and "std" which is standard deviation.
example of how it is written in feat, the column 2, the variables.
Goal> I am trying to extract only the elements that have this written.
"tBodyAcc-mean()-Z"
"tBodyAcc-std()-X"
feat<-read.table("features.txt")
I assumed that using
grep("mean"&"std",feat[,2])
would work
But does not work, I have this error:
"operations are possible only for numeric, logical or complex types"
I found someone who has used this:
meansd<-grep("-(mean|std)\\(\\)",feat[,2])
It worked fine but I do not understand the meaning of the backlashes.
I don't understand what it exactly means and I don't want to use it.
What you need is an alternation operator | in a regex pattern. grep allows using literal values (when fixed=TRUE is used) or a regular expression (by default).
Now, you found:
meansd<-grep("-(mean|std)\\(\\)",feat[,2])
The -(mean|std)\(\) regex matches a -, then either mean or std (since (...) is a grouping construct that allows enumerating alternatives inside a bigger expression), then ( and then ) (these must be escaped with a \ literal symbol - that is why it is doubled in the R code).
If you think the expression is an overkill, and you only want to find entries with either std or mean as substrings, you can use a simpler
meansd<-grep("mean|std",feat[,2])
Here, no grouping construct is necessary since you only have two alternatives in the expression.

How to grep two words in string data?

So I have a data frame where the one of the columns is of type character, consisting of strings. I want to find those rows where "foo" and "bar" both occur but bar can also occur before foo. Basically like an AND operator for regular expressions. How shall I do that?
You may try
rowIndx <- grepl('foo', df$yourcol) & grepl('bar', df$yourcol)
rowIndx returns a logical TRUE/FALSE which can be used for subsetting the col. (comments from #Konrad Rudolph). If you need the numeric index, just wrap it with which i.e. which(rowIndx)
Regular expressions are bad at logical operations. Your particular case, however, can be trivially implemented by the following expression:
(foo.*bar)|(bar.*foo)
However, this is a very inefficient regex and I strongly advise against using it. In practice, you’d use akrun’s solution from the comment: grep for them individually and intersect the result (or do a logical grepl and & the results, which is semantically exchangeable).

find indexes in R by not using `which`

Is there a faster way to search for indices rather than which %in% R.
I am having a statement which I need to execute but its taking a lot of time.
statement:
total_authors<-paper_author$author_id[which(paper_author$paper_id%in%paper_author$paper_id[which(paper_author$author_id%in%data_authors[i])])]
How can this be done in a faster manner?
Don't call which. R accepts logical vectors as indices, so the call is superfluous.
In light of sgibb's comment, you can keep which if you are sure that you will also get at least one match. (If there are no matches, then which returns an empty vector and you get everything instead of nothing. See Unexpected behavior using -which() in R when the search term is not found.)
Secondly, the code looks a little cleaner if you use with.
Thirdly, I think you want a single index with & rather than a double index.
total_authors <- with(
paper_author,
author_id[paper_id %in% paper_id & author_id %in% data_authors[i]
)

Remove values from a dataset based on a vector of those values

I have a dataset that looks like this, except it's much longer and with many more values:
dataset <- data.frame(grps = c("a","b","c","a","d","b","c","a","d","b","c","a"), response = c(1,4,2,6,4,7,8,9,4,5,0,3))
In R, I would like to remove all rows containing the values "b" or "c" using a vector of values to remove, i.e.
remove<-c("b","c")
The actual dataset is very long with many hundreds of values to remove, so removing values one-by-one would be very time consuming.
Try:
dataset[!(dataset$grps %in% remove),]
There's also subset:
subset(dataset, !(grps %in% remove))
... which is really just a wrapper around [ that lets you skip writing dataset$ over and over when there are multiple subset criteria. But, as the help page warns:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
‘[’, and in particular the non-standard evaluation of argument
‘subset’ can have unanticipated consequences.
I've never had any problems, but the majority of my R code is scripting for my own use with relatively static inputs.
2013-04-12
I have now had problems. If you're building a package for CRAN, R CMD check will throw a NOTE if you have use subset in this way in your code - it will wonder if grps is a global variable, even though subset is evaluating it within dataset's environment (not the global one). So if there's any possiblity your code will end up in a package and you feel squeamish about NOTEs, stick with Rcoster's method.

Resources