An elegant way to count number of negative elements in a vector? - r

I have a data vector with 1024 values and need to count the number of negative entries. Is there an elegant way to do this without looping and checking if an element is <0 and incrementing a counter?

You want to read 'An Introduction to R'. Your answer here is simply
sum( x < 0 )
which works thanks to vectorisation. The x < 0 expression returns a vector of booleans over which sum() can operate (by converting the booleans to standard 0/1 values).

There is a good answer to this question from Steve Lianoglou How to identify the rows in my dataframe with a negative value in any column?
Let me just replicate his code with one small addition (4th point).
Imagine you had a data.frame like this:
df <- data.frame(a = 1:10, b = c(1:3,-4, 5:10), c = c(-1, 2:10))
This will return you a boolean vector of which rows have negative values:
has.neg <- apply(df, 1, function(row) any(row < 0))
Here are the indexes for negative numbers:
which(has.neg)
Here is a count of elements with negative numbers:
length(which(has.neg))

The above solutions prescribed need to be tweaked in-order to apply this for a df.
The below command helps get the count of negative or any other symbolic logical relationship.
Suppose you have a dataframe:
df <- data.frame(x=c(2,5,-10,NA,7), y=c(81,-1001,-1,NA,-991))
In-order to get count of negative records in x:
nrow(df[df$x<0,])

Related

How to subtract a value from specific values in a column on R

So I am working on a data frame on a column that should say hours of sleep per night however using difftime() function has given values which show the number of hours sleep in negative values for some and the number of hours awake in positive values for others. I want to subtract 24 from just those who are above 0 (non negative numbers) so I have done:
data$Sleep.time <- with(data = data,
difftime(Bed.time, Waking.up.time, units = "hours"))
data$Sleep.time <- as.numeric(data$Sleep.time)
data$subtract <- (24)
data$Sleep.time <- if (data$Sleep.time>0) {data$Sleep.time - data$subtract}
So this just takes 24 away from all of the values so my values that are already negative are completely wrong. I'm not quite sure how to use the if function so this works properly any help would be great!
if is not vectorized i.e. it expects a logical expression with length 1. The 'Sleep.time' column will have more than one element. We may either use ifelse or create an index and use it to subtract and assign
i1 <- data$Sleep.time> 0
data$Sleep.time[i1] <- data$Sleep.time[i1] - data$subtract[i1]
You could try using ifelse
something like this
data$Sleep.time <- ifelse(data$Sleep.time > 0, data$Sleep.time - 24, data$Sleep.time)
Syntax: ifelse(condition, if true, else) returns a vector if the condition is applied on a vector.
Hope it helps and this is vectorized, so much faster than a loop.

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

Is it possible to create a countif like function in R using ranges?

I've already read this question with an approach to counting entries in R:
how to realize countifs function (excel) in R
I'm looking for a similar approach, except that I want to count data that is within a given range.
For example, let's say I have this dataset:
data <- data.frame( values = c(1,1.2,1.5,1.7,1.7,2))
Following the approach on the linked question, we would develop something like this:
count <- data$values == 1.5
sum(count)
Problem is, I want to be able to include in the count anything that varies 0.2 from 1.5 - that is, all possible number from 1.3 to 1.7.
Is there a way to do so?
sum(data$values>=1.3 & data$values<=1.7)
As the explanation in the question you linked to points out, when you just write out a boolean condition, it generates a vector of TRUEs and FALSEs the same length as your original dataframe. TRUE equals 1 and FALSE equals 0, so summing across it gives you a count. So it simply becomes a matter of putting your condition as a boolean phrase. In the case of more than one condition, you connect them with & or | (or) -- much the same way that you could do in excel (only in excel you have to do AND() or OR()).
(For a more general solution, you can use dplyr::between - it's also supposed to be faster since it's implemented in C++. In this case, it would be sum(between(data$values,1.3,1.7).)
Like #doviod writes, you can use a compound logical condition.
My approach is different, I wrote a function that takes the vector and as range the center point value and the distance delta.
After a suggestion by #doviod, I have set a default value delta = 0, so that if only value is passed, the function returns
a count of cases where the values equal the value the user provides.
(doviod, in the comment)
countif <- function(x, value, delta = 0)
sum(value - delta <= x & x <= value + delta)
data <- data.frame( values = c(1,1.2,1.5,1.7,1.7,2))
countif(data$values, 1.5, 0.2)
#[1] 3
which identifies the location of all values in your vector that satisfy your criterion, and length subsequently counts the 'hits'.
length( which(data$values>=1.3 & data$values<=1.7) )
[1] 3

Make vector with 2 elements with equal chance in R

I want to create an R vector with two repeat elements. A length of the array is 200.
But each element can be either 'x' or 'y'.
an element can be x or y with equal chance.
Is there any grammatical function in R to do above task?
Please someone help.
A possible way to do it is to use rbinom. Step by step, generate first a vecotr of 0 and 1, then change it into x and y:
vec = ifelse(rbinom(200, 1, 0.5)==0,"x","y"))
We need a little bit more information to be helpful, but if you want a vector of 200 values, 100 x's and 100 y's, then just do this:
t <- rep(c('X','Y'), 100)
If you want this in a random order:
t <- sample(t)

Retrieving minimum non-numeric value

This might be too simple question, but I'm still familiarising with R syntax.
I have a data frame with 2 columns and 3 rows:
The first column is a numeric vector from 1 to 3.
The second column is a character vector with values: best, good, worse.
Which function should I be using in order to obtain the minimum non-numeric value (i.e. "worse")?
Another solution would be to use an ordered factor for the character variable. This way min will know what to do:
dat <- data.frame(a=1:3, b=c("worst","good","best"))
dat$b <- ordered(dat$b, levels=c("worst","good","best"))
min(dat$b)
Result:
> min(dat$b)
[1] worst
Levels: worst < good < best

Resources