I am very new in using R and trying to get my hear around different commands. I have this simple code:
setwd("C:/Research")
tempdata=read.csv("temperature_humidity.csv")
Thour=tempdata$t
RHhour=tempdata$RH
weather=data.frame(cbind(hour,Thour,RHhour))
head(weather)
if (Thour>25) {
y=0 else {
y=3
}
x=Thour+y*2
x
I simply want the code to read the Thour(temperature) from CSV file and if it is higher than 25 then uses y=0 in the formula, if its lower than 25 then uses y=3
I tried ifelse but it doesn't work as well.
Thanks for your help.
I've said that too many times today already, but avoid ifelse statements as much as possible (very inefficient and unnecessary in most cases), try this instead:
c(3, 0)[(Thour >= 25) + 1]
This solution will return a logical vector of TRUE/FALSE which will be coerced to 0/1 when added to 1 and become 1/2 which will be the indexes in c(3, 0)
Or even better solution (posted by #BondedDust in comments) would be:
3*(Thour <= 25)
This solution will return a logical vector of TRUE/FALSE which will be coerced to 0/1 when multiplied by 3
Benchmark comparison:
Thour <- sample(1:100000)
require(microbenchmark)
microbenchmark(ifel = {ifelse(Thour < 25 , 0 , 3)}, Bool = {3*(Thour >= 25)})
Unit: microseconds
expr min lq median uq max neval
ifel 38633.499 41643.768 41786.978 55153.050 59169.69 100
Bool 901.135 1848.091 1972.434 2010.841 20754.74 100
This should work for you. Just replace what you're naming Thour with the appropriate code.
Thour <- sample(1:100, 1)
Thour
# [1] 8
y <- ifelse(Thour >= 25, 0, 3)
y
# [1] 3
And:
Thour <- sample(1:100, 1)
Thour
# [1] 37
y <- ifelse(Thour >= 25, 0, 3)
y
# [1] 0
You may need to change the logical operator (>=) to match your exact circumstance since it's unclear, which if any of the higher or lower range you want to be inclusive.
R has a very flexible syntax. So you can write this in many ways:
# ifelse() function
y <- ifelse(Thour > 25, 0, 3)
# more ifelse()
y <- 3 * ifelse(Thour > 25, 0, 1)
# The simpler way:
y <- 3 * (Thour > 25)
By the way, use <- instead of = for assignment... it's the "preferred" style
Related
Given a numeric vector, I'd like to find the smallest absolute difference in combinations of size 2. However, the point of friction comes with the use of combn to create the matrix holding the pairs. How would one handle issues when a matrix/vector is too large?
When the number of resulting pairs (number of columns) using combn is too large, I get the following error:
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
This post states that the size limit of a matrix is roughly one billion rows and two columns.
Here is the code I've used. Apologies for the use of cat in my function output -- I'm solving the Minimum Absolute Difference in an Array Greedy Algorithm problem in HackerRank and R outputs are only counted as correct if they're given using cat:
minimumAbsoluteDifference <- function(arr) {
combos <- combn(arr, 2)
cat(min(abs(combos[1,] - combos[2,])))
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
# This fails
inputFail <- rpois(10e4, 1)
minimumAbsoluteDifference(inputFail)
#Error in matrix(r, nrow = len.r, ncol = count) :
# invalid 'ncol' value (too large or NA)
TL;DR
No need for combn or the like, simply:
min(abs(diff(sort(v))))
The Nitty Gritty
Finding the difference between every possible combinations is O(n^2). So when we get to vectors of length 1e5, the task is burdensome both computationally and memory-wise.
We need a different approach.
How about sorting and taking the difference only with its neighbor?
By first sorting, for any element vj, the difference min |vj - vj -/+ 1| will be the smallest such difference involving vj. For example, given the sorted vector v:
v = -9 -8 -6 -4 -2 3 8
The smallest distance from -2 is given by:
|-2 - 3| = 5
|-4 - -2| = 2
There is no need in checking any other elements.
This is easily implemented in base R as follows:
getAbsMin <- function(v) min(abs(diff(sort(v))))
I'm not going to use rpois as with any reasonably sized vector, duplicates will be produces, which will trivially give 0 as an answer. A more sensible test would be with runif or sample (minimumAbsoluteDifference2 is from the answer provided by #RuiBarradas):
set.seed(1729)
randUnif100 <- lapply(1:100, function(x) {
runif(1e3, -100, 100)
})
randInts100 <- lapply(1:100, function(x) {
sample(-(1e9):(1e9), 1e3)
})
head(sapply(randInts100, getAbsMin))
[1] 586 3860 2243 2511 5186 3047
identical(sapply(randInts100, minimumAbsoluteDifference2),
sapply(randInts100, getAbsMin))
[1] TRUE
options(scipen = 99)
head(sapply(randUnif100, getAbsMin))
[1] 0.00018277206 0.00020549633 0.00009834766 0.00008395873 0.00005299225 0.00009313226
identical(sapply(randUnif100, minimumAbsoluteDifference2),
sapply(randUnif100, getAbsMin))
[1] TRUE
It's very fast as well:
library(microbenchmark)
microbenchmark(a = getAbsMin(randInts100[[50]]),
b = minimumAbsoluteDifference2(randInts100[[50]]),
times = 25, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
a 1.0000 1.0000 1.0000 1.0000 1.00000 1.00000 25
b 117.9799 113.2221 105.5144 107.6901 98.55391 81.05468 25
Even for very large vectors, the result is instantaneous;
set.seed(321)
largeTest <- sample(-(1e12):(1e12), 1e6)
system.time(print(getAbsMin(largeTest)))
[1] 3
user system elapsed
0.083 0.003 0.087
Something like this?
minimumAbsoluteDifference2 <- function(x){
stopifnot(length(x) >= 2)
n <- length(x)
inx <- rep(TRUE, n)
m <- NULL
for(i in seq_along(x)[-n]){
inx[i] <- FALSE
curr <- abs(x[i] - x[which(inx)])
m <- min(c(m, curr))
}
m
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
minimumAbsoluteDifference2(input0) #returns 3
set.seed(2020)
input1 <- rpois(1e3, 1)
minimumAbsoluteDifference(input1) #returns 0
minimumAbsoluteDifference2(input1) #returns 0
inputFail <- rpois(1e5, 1)
minimumAbsoluteDifference(inputFail) # This fails
minimumAbsoluteDifference2(inputFail) # This does not fail
Actually a really nice problem to which I came up with a solution (see below), which is, however, not beautiful:
Assume you have a vector x and a matrix A which contains the start of an interval in the first column and the end of the interval in the second.
How can I get the elements of A, which fall into the intervals given by A?
x <- c(4, 7, 15)
A <- cbind(c(3, 9, 14), c(5, 11, 16))
Expected output:
[1] 4 15
You could you the following information, if this would be helpful for increasing the performance:
Both, the vector and the rows of the matrix are ordered and the intervals don't overlap. All intervals have the same length. All numbers are integers, but can be huge.
Now I did not want to be lazy and came up with the following solution, which is too slow for long vectors and matrices:
x <- c(4, 7, 15) # Define input vector
A <- cbind(c(3, 9, 14), c(5, 11, 16)) # Define matrix with intervals
b <- vector()
for (i in 1:nrow(A)) {
b <- c(b, A[i, 1]:A[i, 2])
}
x[x %in% b]
I know that loops in R can be slow, but I did not know how to write the operation without one (maybe there is a way with apply).
We can use sapply to loop over each element of x and find if it lies in the range of any of those matrix values.
x[sapply(x, function(i) any(i > A[, 1] & i < A[,2]))]
#[1] 4 15
In case, if length(x) and nrow(A) are same then we don't even need the sapply loop and we can use this comparison directly.
x[x > A[, 1] & x < A[,2]]
#[1] 4 15
Here is a method that does not use an explicit loop or an apply function. outer is sometimes much faster.
x[rowSums(outer(x, A[,1], `>=`) & outer(x, A[,2], `<=`)) > 0]
[1] 4 15
This answer is late, but today I had the same problem to solve and my answer is maybe helpful for future readers. My solution was the following:
f3 <- function(x,A) {
Reduce(f = "|",
x = lapply(1:NROW(A),function(k) x>A[k,1] & x<A[k,2]),
init = logical(length(x)))
}
This function return a logical vector of length(x) indicating whether the corresponding value in x can be found in the intervals or not. If I want to get the elements I simply have to write
x[f3(x,A)]
I did some benchmarks and my function seems to work very well, also while testing with larger data.
Lets define the other solutions suggested here in this post:
f1 <- function(x,A) {
sapply(x, function(i) any(i > A[, 1] & i < A[,2]))
}
f2 <- function(x,A) {
rowSums(outer(x, A[,1], `>`) & outer(x, A[,2], `<`)) > 0
}
Now they are also returning a logical vector.
The benchmarks on my machine are following:
x <- c(4, 7, 15)
A <- cbind(c(3, 9, 14), c(5, 11, 16))
microbenchmark::microbenchmark(f1(x,A), f2(x,A), f3(x,A))
#Unit: microseconds
# expr min lq mean median uq max neval
#f1(x, A) 21.5 23.20 25.023 24.30 25.40 61.8 100
#f2(x, A) 18.8 21.20 23.606 22.75 23.70 75.4 100
#f3(x, A) 13.9 15.85 18.682 18.30 19.15 52.2 100
It seems like there is no big difference, but the follwoing example will make it more obvious:
x <- seq(1,100,length.out = 1e6)
A <- cbind(20:70,(20:70)+0.5)
microbenchmark::microbenchmark(f1(x,A), f2(x,A), f3(x,A), times=10)
#Unit: milliseconds
# expr min lq mean median uq max neval
#f1(x, A) 4176.172 4227.6709 4419.6010 4484.2946 4539.9668 4569.7412 10
#f2(x, A) 1418.498 1511.5647 1633.4659 1571.0249 1703.6651 1987.8895 10
#f3(x, A) 614.556 643.4138 704.3383 672.5385 770.7751 873.1291 10
That the functions all return the same result can be checked e.g. via:
all(f1(x,A)==f3(x,A))
As this is my first time asking a question on SO, I apologize in advance for any improper formatting.
I am very new to R and am trying to create a function that will return the row value of a data frame column once a running total in another column has met or exceeded a given value (the row that the running sum begins in is also an argument).
For example, given the following data frame, if given a starting parameter of x=3 and stop parameter of y=17, the function should return 5 (the x value of the row where the sum of y >= 17).
X Y
1 5
2 10
3 5
4 10
5 5
6 10
7 5
8 10
The function as I've currently written it returns the correct answer, but I have to believe there is a much more 'R-ish' way to accomplish this, instead of using loops and incrementing temporary variables, and would like to learn the right way, rather than form bad habits that I will have to correct later.
A very simplified version of the function:
myFunction<-function(DataFrame,StartRow,Total){
df<-DataFrame[DataFrame[[1]] >= StartRow,]
i<-0
j<-0
while (j < Total) {
i<-i+1
j<-sum(df[[2]][1:i])
}
x<-df[[1]][i]
return(x)
}
All the solutions posted so far compute the cumulative sum of the entire Y variable, which can be inefficient in cases where the data frame is really large but the index is near the beginning. In this case, a solution with Rcpp could be more efficient:
library(Rcpp)
get_min_cum2 = cppFunction("
int gmc2(NumericVector X, NumericVector Y, int start, int total) {
double running = 0.0;
for (int idx=0; idx < Y.size(); ++idx) {
if (X[idx] >= start) {
running += Y[idx];
if (running >= total) {
return X[idx];
}
}
}
return -1; // Running total never exceeds limit
}")
Comparison with microbenchmark:
get_min_cum <-
function(start,total)
with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_dt <- function(start, total)
dt[X >= start, X[cumsum(Y) >= total][1]]
set.seed(144)
dat = data.frame(X=1:1000000, Y=abs(rnorm(1000000)))
dt = data.table(dat)
get_min_cum(3, 17)
# [1] 29
get_min_dt(3, 17)
# [1] 29
get_min_cum2(dat$X, dat$Y, 3, 17)
# [1] 29
library(microbenchmark)
microbenchmark(get_min_cum(3, 17), get_min_dt(3, 17),
get_min_cum2(dat$X, dat$Y, 3, 17))
# Unit: milliseconds
# expr min lq median uq max neval
# get_min_cum(3, 17) 125.324976 170.052885 180.72279 193.986953 418.9554 100
# get_min_dt(3, 17) 100.990098 149.593250 162.24523 176.661079 399.7531 100
# get_min_cum2(dat$X, dat$Y, 3, 17) 1.157059 1.646184 2.30323 4.628371 256.2487 100
In this case, it's about 100x faster to use the Rcpp solution than other approaches.
Try this for example, I am using cumsum and vectorized logical subsetting:
get_min_cum <-
function(start,total)
with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_cum(3,17)
5
Here you go (using data.table because of ease of syntax):
library(data.table)
dt = data.table(df)
dt[X >= 3, X[cumsum(Y) >= 17][1]]
#[1] 5
Well, here's one way:
i <- 3
j <- 17
min(df[i:nrow(df),]$X[cumsum(df$Y[i:nrow(df)])>j])
# [1] 5
This takes df$X for rows i:nrow(df) and indexes that based on cumsum(df$Y) > j, starting also at row i. This returns all df$X for which the cumsum > j. min(...) then returns the smallest value.
with(df, which( cumsum( (x>=3)*y) >= 17)[1] )
I have a column in a data table with NaN values. Something like:
my.dt <- data.table(x = c(NaN, NaN, NaN, .1, .2, .2, .3), y = c(2, 4, 6, 8, 10, 12, 14))
setkey(my.dt, x)
I can use the J() function to find all instances where the x column is equal to .2
> my.dt[J(.2)]
x y
1: 0.2 10
2: 0.2 12
But if I try to do the same thing with NaN it doesn't work.
> my.dt[J(NaN)]
x y
1: NaN NA
I would expect:
x y
1: NaN 2
2: NaN 4
3: NaN 6
What gives? I can't find anything in the data.table documentation to explain why this is happening (although it may just be that I don't know what to look for). Is there any way to get what I want? Ultimately, I'd like to replace all of the NaN values with zero, using something like my.dt[J(NaN), x := 0]
Update: This has been fixed a while back, in v1.9.2. From NEWS:
NA, NaN, +Inf and -Inf are now considered distinct values, may be in keys, can be joined to and can be grouped. data.table defines: NA < NaN < -Inf. Thanks to Martin Liberts for the suggestions, #4684, #4815 and #4883.
require(data.table) ## 1.9.2+
my.dt[J(NaN)]
# x y
# 1: NaN 2
# 2: NaN 4
# 3: NaN 6
This issue is part design choice, part bug. There are several questions on SO and a few emails on the listserv exploring NA's in data.table key.
The main idea is outlined in the FAQ in that NA's are treated as FALSE
Please feel free chime in on the conversation in the mailing list. There was a conversation started by #Arun,
http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html
Also, you can read more in the answers and comments to any of the following questions on SO:
subsetting a data.table using !=<some non-NA> excludes NA too
NA in `i` expression of data.table (possible bug)
DT[!(x == .)] and DT[x != .] treat NA in x inconsistently
In the meantime, your best bet is to use is.na.
While it is slower than a radix search, it is still faster than most vector searches in R, and certainly much, much faster than any fancy workarounds
library(microbenchmark)
microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)])
# Unit: milliseconds
expr median
my.dt[.(1)] 1.309948
my.dt[is.na(ID)] 3.444689 <~~ Not bad
my.dt[ID == 1] 4.005093
my.dt[!(!(!(ID)))] 10.038134
### using the following for my.dt
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)
Here's a fast workaround that relies a lot on what's actually happening internally (making the code a bit fragile imo). Because internally NaN is just a very very negative number, it will always be at the front of your data.table when you setkey. We can use that property to isolate those entries like so:
# this will give the index of the first element that is *not* NaN
my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T]
# this is equivalent to my.dt[!is.nan(x)], but much faster
my.dt[seq_len(my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T] - 1)]
Here's a benchmark for Ricardo's sample data:
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)
# NOTE: I have to use integer max here - because this example has integers
# instead of doubles, so I'll just add simple helper function (that would
# likely need to be extended for other cases, but I'm just dealing with the ones here)
minN = function(x) if (is.integer(x)) -.Machine$integer.max else -.Machine$double.xmax
library(microbenchmark)
microbenchmark(normalJ = my.dt[J(1)],
naJ = my.dt[seq_len(my.dt[J(minN(ID)), roll = -Inf, which = T] - 1)])
#Unit: milliseconds
# expr min lq median uq max neval
# normalJ 1.645442 1.864812 2.120577 2.863497 5.431828 100
# naJ 1.465806 1.689350 2.030425 2.600720 10.436934 100
In my tests the following minN function also covers character and logical vectors:
minN = function(x) {
if (is.integer(x)) {
-.Machine$integer.max
} else if (is.numeric(x)) {
-.Machine$double.xmax
} else if (is.character(x)) {
""
} else if (is.logical(x)) {
FALSE
} else {
NA
}
}
And you will want to add mult = 'first', e.g.:
my.dt[seq_len(my.dt[J(minN(colname)), roll = -Inf, which = T, mult = 'first'] - 1)]
See if this is helpful.
my.dt[!is.finite(x),]
x y
1: NaN 2
2: NaN 4
3: NaN 6
How can I check whether an integer vector is "sequential", i.e. that the difference between subsequent elements is exactly one. I feel like I am missing something like "is.sequential"
Here's my own function:
is.sequential <- function(x){
all(diff(x) == rep(1,length(x)-1))
}
There's no need for rep since 1 will be recicled:
Edited to allow 5:2 as true
is.sequential <- function(x){
all(abs(diff(x)) == 1)
}
To allow for diferent sequences
is.sequential <- function(x){
all(diff(x) == diff(x)[1])
}
So, #Iselzer has a fine answer. There are still some corner cases though: rounding errors and starting value. Here's a version that allows rounding errors but checks that the first value is (almost) an integer.
is.sequential <- function(x, eps=1e-8) {
if (length(x) && isTRUE(abs(x[1] - floor(x[1])) < eps)) {
all(abs(diff(x)-1) < eps)
} else {
FALSE
}
}
is.sequential(2:5) # TRUE
is.sequential(5:2) # FALSE
# Handle rounding errors?
x <- ((1:10)^0.5)^2
is.sequential(x) # TRUE
# Does the sequence need to start on an integer?
x <- c(1.5, 2.5, 3.5, 4.5)
is.sequential(x) # FALSE
# Is an empty vector a sequence?
is.sequential(numeric(0)) # FALSE
# What about NAs?
is.sequential(c(NA, 1)) # FALSE
This question is quite old by now, but in certain circumstances it is actually quite useful to know whether a vector is sequential.
Both of the OP answers are quite good, but as mentioned by Tommy the accepted answer has some flaws. It seems natural that a 'sequence' is any 'sequence of numbers, which are equally spaced'. This would include negative sequences, sequences with a starting value outside different from 0 or 1, and so forth.
A very diverse and safe implementation is given below, which accounts for
negative values (-3 to 1) and negative directions (3 to 1)
sequences with none integer steps (3.5, 3.6, 3.7...)
wrong input types such as infinite values, NA and NAN values, data.frames etc.
is.sequence <- function(x, ...)
UseMethod("is.sequence", x)
is.sequence.default <- function(x, ...){
FALSE
}
is.sequence.numeric <- function(x, tol = sqrt(.Machine$double.eps), ...){
if(anyNA(x) || any(is.infinite(x)) || length(x) <= 1 || diff(x[1:2]) == 0)
return(FALSE)
diff(range(diff(x))) <= tol
}
is.sequence.integer <- function(x, ...){
is.sequence.numeric(x, ...)
}
n <- 1236
#Test:
is.sequence(seq(-3, 5, length.out = n))
# TRUE
is.sequence(seq(5, -3, length.out = n))
# TRUE
is.sequence(seq(3.5, 2.5 + n, length.out = n))
# TRUE
is.sequence(LETTERS[1:7])
Basically the implementation checks if the max and min of the differences are exactly equal.
While using the S3 class methods makes the implementation slightly more complicated it simplifies checks for wrong input types, and allows for implementations for other classes. For example this makes it simple to extend this method to say Date objects, which would require one to consider if a sequence of only weekdays (or work days) is also a sequence.
Speed comparison
This implementation is very safe, but using S4 classes adds some overhead. For small length vectors the benefit is the diversity of the implementation, while it is around 15 % slower at worst. For larger vectors it is however slightly faster as shown in the microbenchmark below.
Note that the median time is better for comparison, as the garbage cleaner may add uncertain time to the benchmark.
ss <- seq(1, 1e6)
microbenchmark::microbenchmark(is.sequential(ss),
is.sequence(ss), #Integer calls numeric, adding a bit of overhead
is.sequence.numeric(ss))
# Unit: milliseconds
# expr min lq mean median uq max neval
# is.sequential(ss) 19.47332 20.02534 21.58227 20.45541 21.23700 66.07200 100
# is.sequence(ss) 16.09662 16.65412 20.52511 17.05360 18.23958 61.23029 100
# is.sequence.numeric(ss) 16.00751 16.72907 19.08717 17.01962 17.66150 55.90792 100