Sum every nth points - r

I have a vector and I need to sum every n numbers and return the results. This is the way I plan on doing it currently. Any better way to do this?
v = 1:100
n = 10
sidx = seq.int(from=1, to=length(v), by=n)
eidx = c((sidx-1)[2:length(sidx)], length(v))
thesum = sapply(1:length(sidx), function(i) sum(v[sidx[i]:eidx[i]]))
This gives:
thesum
[1] 55 155 255 355 455 555 655 755 855 955

unname(tapply(v, (seq_along(v)-1) %/% n, sum))
# [1] 55 155 255 355 455 555 655 755 855 955

UPDATE:
If you want to sum every n consecutive numbers use colSums
If you want to sum every nth number use rowSums
as per Josh's comment, this will only work if n divides length(v) nicely.
rowSums(matrix(v, nrow=n))
 [1] 460 470 480 490 500 510 520 530 540 550
colSums(matrix(v, nrow=n))
[1] 55 155 255 355 455 555 655 755 855 955

Update
The olde version don't work. Here a ne awnser that use rep to create the grouping factor. No need to use cut:
n <- 5
vv <- sample(1:1000,100)
seqs <- seq_along(vv)
tapply(vv,rep(seqs,each=n)[seqs],FUN=sum)
You can use tapply
tapply(1:100,cut(1:100,10),FUN=sum)
or to get a list
by(1:100,cut(1:100,10),FUN=sum)
EDIT
In case you have 1:92, you can replace your cut by this :
cut(1:92,seq(1,92,10),include.lowest=T)

One way is to convert your vector to a matric then take the column sums:
colSums(matrix(v, nrow=n))
[1] 55 155 255 355 455 555 655 755 855 955
Just be careful: this implicitly assumes that your input vector can in fact be reshaped to a matrix. If it can't, R will recycle elements of your vector to complete the matrix.

v <- 1:100
n <- 10
cutpoints <- seq( 1 , length( v ) , by = n )
categories <- findInterval( 1:length( v ) , cutpoints )
tapply( v , categories , sum )

I will add one more way of doing it without any function from apply family
v <- 1:100
n <- 10
diff(c(0, cumsum(v)[slice.index(v, 1)%%n == 0]))
## [1] 55 155 255 355 455 555 655 755 855 955

Here are some of the main variants offered so far
f0 <- function(v, n) {
sidx = seq.int(from=1, to=length(v), by=n)
eidx = c((sidx-1)[2:length(sidx)], length(v))
sapply(1:length(sidx), function(i) sum(v[sidx[i]:eidx[i]]))
}
f1 <- function(v, n, na.rm=TRUE) { # 'tapply'
unname(tapply(v, (seq_along(v)-1) %/% n, sum, na.rm=na.rm))
}
f2 <- function(v, n, na.rm=TRUE) { # 'matrix'
nv <- length(v)
if (nv %% n)
v[ceiling(nv / n) * n] <- NA
colSums(matrix(v, n), na.rm=na.rm)
}
f3 <- function(v, n) { # 'cumsum'
nv = length(v)
i <- c(seq_len(nv %/% n) * n, if (nv %% n) nv else NULL)
diff(c(0L, cumsum(v)[i]))
}
Basic test cases might be
v = list(1:4, 1:5, c(NA, 2:4), integer())
n = 2
f0 fails with the final test, but this could probably be fixed
> f0(integer(), n)
Error in sidx[i]:eidx[i] : NA/NaN argument
The cumsum approach f3 is subject to rounding error, and the presence of an NA early in v 'poisons' later results
> f3(c(NA, 2:4), n)
[1] NA NA
In terms of performance, the original solution is not bad
> library(rbenchmark)
> cols <- c("test", "elapsed", "relative")
> v <- 1:100; n <- 10
> benchmark(f0(v, n), f1(v, n), f2(v, n), f3(v, n),
+ columns=cols)
test elapsed relative
1 f0(v, n) 0.012 3.00
2 f1(v, n) 0.065 16.25
3 f2(v, n) 0.004 1.00
4 f3(v, n) 0.004 1.00
but the matrix solution f2 seems to be both fast and flexible (e.g., adjusting the handling of that trailing chunk of fewer than n elements)
> v <- runif(1e6); n <- 10
> benchmark(f0(v, n), f2(v, n), f3(v, n), columns=cols, replications=10)
test elapsed relative
1 f0(v, n) 5.804 34.141
2 f2(v, n) 0.170 1.000
3 f3(v, n) 0.251 1.476

One way is to use rollapply from zoo:
rollapply(v, width=n, FUN=sum, by=n)
# [1] 55 155 255 355 455 555 655 755 855 955
And in case length(v) is not a multiple of n:
v <- 1:92
rollapply(v, width=n, FUN=sum, by=n, partial=T, align="left")
# [1] 55 155 255 355 455 555 655 755 855 183

A little late to the party, but I don't see a rowsum() answer yet. rowsum() is proven more efficient than tapply() and I think it would also be very efficient relative to a few of the other responses as well.
rowsum(v, rep(seq_len(length(v)/n), each=n))[,1]
# 1 2 3 4 5 6 7 8 9 10
# 55 155 255 355 455 555 655 755 855 955
Using #Josh O'Brien's grouping technique would likely improve efficiency even more.
rowsum(v, (seq_along(v)-1) %/% n)[,1]
# 0 1 2 3 4 5 6 7 8 9
# 55 155 255 355 455 555 655 755 855 955
Simply wrap in unname() to drop the group names.

Related

How to convert decimal (base 10) numbers to ternary (base 3)

I was wondering if there is a way to convert decimal numbers to ternary, given that there is a function intToBits for converting to binary.
I actually need to convert a character string like
> S0 <- c("Hello Stac")
to base 3. I thought to first convert it to decimal with
> S01 <- utf8ToInt(S0)
> S01
## [1] 72 101 108 108 111 32 83 116 97 99
then convert the result to base 3. I want to obtain something like this:
> S1
## [1] 2200 10202 11000 11010 11022 1012 10002 11022 10121 10200
For practice, I guess you can try to write your own converter function like below
f <- function(x, base = 3) {
q <- c()
while (x) {
q <- c(x %% base, q)
x <- x %/% base
}
# as.numeric(paste0(q, collapse = ""))
sum(q * 10^(rev(seq_along(q) - 1)))
}
or with recursion
f <- function(x, base = 3) {
ifelse(x < base, x, f(x %/% base) * 10 + x %% base)
}
then you can run
> sapply(utf8ToInt(S0),f)
[1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200
Nice programming exercise. I have vectorized #ThomasIsCoding's answer to avoid expensive loops over strings and characters within strings. The idea is to loop over digits instead, since Unicode code points do not exceed 21 digits in any base, whereas the total number of characters in a character vector can be orders of magnitude greater.
The function below takes as arguments a character vector x, a base b (from 2 to 10), and a logical flag double. It returns a list res such that res[[i]] is an nchar(x[i])-length vector giving the base-b representation of x[i]. The list elements are double vectors or character vectors depending on double.
utf8ToBase <- function(x, b = 10, double = TRUE) {
## Do some basic checks
stopifnot(is.character(x), !anyNA(x),
is.numeric(b), length(b) == 1L,
b %% 1 == 0, b >= 2, b <= 10)
## Require UTF-8 encoding
x <- enc2utf8(x)
## Operate on concatenation to avoid loop over strings
xx <- paste(x, collapse = "")
ixx <- utf8ToInt(xx)
## Handle trivial case early
if (length(ixx) == 0L) {
el <- if (double) base::double(0L) else character(0L)
res <- rep.int(list(el), length(x))
names(res) <- names(x)
return(res)
}
## Use common field width determined from greatest integer
width <- as.integer(floor(1 + log(max(ixx, 1), base = b)))
res <- rep.int(strrep("0", width), length(ixx))
## Loop over digits
pos <- 1L
pow <- b^(width - 1L)
while (pos <= width) {
quo <- ixx %/% pow
substr(res, pos, pos) <- as.character(quo)
ixx <- ixx - pow * quo
pos <- pos + 1L
pow <- pow %/% b
}
## Discard leading zeros
if (double) {
res <- as.double(res)
if (b == 2 && any(res > 0x1p+53)) {
warning("binary result not guaranteed due to loss of precision")
}
} else {
res <- sub("^0+", "", res)
}
## Return list
res <- split(res, rep.int(gl(length(x), 1L), nchar(x)))
names(res) <- names(x)
res
}
x <- c(foo = "Hello Stack Overflow!", bar = "Hello world!")
utf8ToBase(x, 2)
$foo
[1] 1001000 1100101 1101100 1101100 1101111 100000
[7] 1010011 1110100 1100001 1100011 1101011 100000
[13] 1001111 1110110 1100101 1110010 1100110 1101100
[19] 1101111 1110111 100001
$bar
[1] 1001000 1100101 1101100 1101100 1101111 100000
[7] 1110111 1101111 1110010 1101100 1100100 100001
utf8ToBase(x, 3)
$foo
[1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200
[11] 10222 1012 2221 11101 10202 11020 10210 11000 11010 11102
[21] 1020
$bar
[1] 2200 10202 11000 11000 11010 1012 11102 11010 11020 11000
[11] 10201 1020
utf8ToBase(x, 10)
$foo
[1] 72 101 108 108 111 32 83 116 97 99 107 32 79 118 101
[16] 114 102 108 111 119 33
$bar
[1] 72 101 108 108 111 32 119 111 114 108 100 33
Some caveats:
For efficiency, the function concatenates the strings in x rather than looping over them. It throws an error if the concatenation would exceed 2^31-1 bytes, which is the maximum string size allowed by R.
x <- strrep(letters[1:2], 0x1p+30)
log2(sum(nchar(x))) # 31
utf8ToBase(x, 3)
Error in paste(x, collapse = "") : result would exceed 2^31-1 bytes
The largest Unicode code point is 0x10FFFF. The binary representation of this number exceeds 2^53 when interpreted as decimal, so it cannot be stored in a double vector without loss of precision:
x <- sub("^0+", "", paste(rev(as.integer(intToBits(0x10FFFF))), collapse = ""))
x
## [1] "100001111111111111111"
sprintf("%.0f", as.double(x))
## [1] "100001111111111114752"
As a defensive measure, the function warns if 2^53 is exceeded when b = 2 and double = TRUE.
utf8ToBase("\U10FFFF", b = 2, double = TRUE)
[[1]]
[1] 1.000011e+20
Warning message:
In utf8ToBase("\U{10ffff}", b = 2, double = TRUE) :
binary result not guaranteed due to loss of precision
utf8ToBase("\U10FFFF", b = 2, double = FALSE)
[[1]]
[1] "100001111111111111111"
You can use cwhmisc::int2B:
library(cwhmisc)
int2B(utf8ToInt(S0), 3)[[1]] |> as.numeric()
# [1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200

Replacing NA with mean using loop in R

I have to solve this problem using loop in R (I am aware that you can do it much more easily without loops, but it is for school...).
So I have vector with NAs like this:
trades<-sample(1:500,150,T)
trades<-trades[order(trades)]
trades[sample(10:140,25)]<-NA
and I have to create a FOR loop that will replace NAs with mean from 2 numbers before the NA and 2 numbers that come after the NA.
This I am able to do, with loop like this:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T) {
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]), na.rm = T)
}
}
But there is another part to the homework. If there is NA within the 2 previous or 2 following numbers, then you have to replace the NA with mean from 4 previous numbers and 4 following numbers (I presume with removing the NAs). But I just am not able to crack it... I have the best results with this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T && is.na(trades[c(i-1:2)]==T || is.na(trades[c(i+1:2)]==T))) {
trades[i] <- mean(c(trades[c(i-1:4)], trades[c(i+1:4)]), na.rm = T)
}else if (is.na(trades[i])==T){
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]))
}
}
But it still misses some NAs.
Thank you for your help in advance.
We can use na.approx from zoo
library(zoo)
na.approx(trades)
Here is another solution using a loop. I did shortcut some code by using lead and lag from dplyr. First we use 2 recursive functions to calculate the lead and lag sums. Then we use conditional statements to determine if there are any missing data. Lastly, we fill the missing data using either the output of the recursive or the sum of the previous and following 4 (with NA removed). I would note that this is not the way that I would go about this issue, but I tried it out with a loop as requested.
library(dplyr)
r.lag <- function(x, n){
if (n == 1) return(lag(x = x, n = 1))
else return( lag(x = x, n = n) + r.lag(x = x, n = n-1))
}
r.lead <- function(x, n){
if (n == 1) return(lead(x = x, n = 1))
else return( lead(x = x, n = n) + r.lead(x = x, n = n-1))
}
lead.vec <- r.lead(trades, 2)
lag.vec <- r.lag(trades, 2)
output <- vector(length = length(trades))
for(i in 1:length(trades)){
if(!is.na(trades[[i]])){
output[[i]] <- trades[[i]]
}
else if(is.na(trades[[i]]) & !is.na(lead.vec[[i]]) & !is.na(lag.vec[[i]])){
output[[i]] <- (lead.vec[[i]] + lag.vec[[i]])/4
}
else
output[[i]] <- mean(
c(trades[[i-4]], trades[[i-3]], trades[[i-2]], trades[[i-1]],
trades[[i+4]], trades[[i+3]], trades[[i+2]], trades[[i+1]]),
na.rm = T
)
}
tibble(
original = trades,
filled = output
)
#> # A tibble: 150 x 2
#> original filled
#> <int> <dbl>
#> 1 7 7
#> 2 7 7
#> 3 12 12
#> 4 18 18
#> 5 30 30
#> 6 31 31
#> 7 36 36
#> 8 NA 40
#> 9 43 43
#> 10 50 50
#> # … with 140 more rows
So it seems that posting to StackOverflow helped me solve the problem.
trades<-sample(1:500,25,T)
trades<-trades[order(trades)]
trades[sample(1:25,5)]<-NA
which gives us:
[1] NA 20 24 30 NA 77 188 217 238 252 264 273 296 NA 326 346 362 368 NA NA 432 451 465 465 490
and if you run this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])== T) {
test1 <- c(trades[c(i+1:2)])
if (any(is.na(test1))==T) {
test2 <- c(trades[abs(c(i-1:4))], trades[c(i+1:4)])
trades[i] <- round(mean(test2, na.rm = T),0)
}else {
test3 <- c(trades[abs(c(i-1:2))], trades[c(i+1:2)])
trades[i] <- round(mean(test3, na.rm = T),0)
}
}
}
it changes the NAs to this:
[1] 22 20 24 30 80 77 188 217 238 252 264 273 296 310 326 346 362 368 387 410 432 451 465 465 490
So it works pretty much as expected.
Thank you for all your help.

How to combine the n arguments in c() [R]?

I have generate a random matrix d, then make some matrix operation.
Finally, I need to store the result in vector B. Code is below
set.seed(42)
n <- 3
m <- 4
d <- matrix(sample(0:255, n*m, replace=T), nrow = n, ncol = m)
# some matrix operation
B <-c(d[1,], d[2,], d[3,])
> d
[,1] [,2] [,3] [,4]
[1,] 234 212 188 180
[2,] 239 164 34 117
[3,] 73 132 168 184
> B
[1] 234 212 188 180 239 164 34 117 73 132 168 184
>
Could some one please explain me how to rewrite last
line via a function in order to combine the n arguments in one vector?
I have tried
B <- sapply(1:n, FUN=function(i) B<-c(d[i,]))
Thank!
This function should do it (overkill, since c(t(d)) as suggested by #joran works fine):
vectorizeByRow <- function(IN) {
OUT <- rep(NA_real_, length(IN))
nc <- ncol(IN)
nr <- nrow(IN)
a <- seq(1, length(IN), nc)
b <- a + nc - 1
for (n in 1:length(a)) {
OUT[a[n]:b[n]] <- IN[n,]
}
OUT
}
Use:
vectorizeByRow(d)
Produces:
[1] 234 212 188 180 239 164 34 117 73 132
[11] 168 184
This is from the HandyStuff package. Disclaimer: I am the author.

Number of unique values within a range in data-frame

From a data-frame, I want to extract the number of unique values (of X) within a certain range of Y (e.g. for every 0-100, 101-200, 201-300, etc. up to 3000).
Example df
X Y
169 183
546 64
154 148
593 203
60 243
568 370
85 894
168 169
154 148
83 897
…
A time consuming way would be to run the following code for each range:
junk<-subset(df, Y > 0 & Y < 100)
length(unique(junk$record.no))
But I have to ask the experts - there must be a better way?
You can do it with by() and cut():
data <- data.frame(X=ceiling(rnorm(10000, 500, 10)), Y=runif(10000, 0, 3000))
data$Groups <- cut(data$Y, seq(0, 3000, 100)) # Create a categorical variable for each range
by(data$X, data$Group, function(x) length(unique(x)))
This seems valid:
aggregate(DF$X, list(cut(DF$Y, seq(0, 1000, 100))), function(x) unique(x))
# Group.1 x #or length(unique(x))
#1 (0,100] 546
#2 (100,200] 169, 154, 168
#3 (200,300] 593, 60
#4 (300,400] 568
#5 (800,900] 85, 83
You can run a for loop based on the range you want and the size of the dataframe and then count the number of levels by converting to factor:
range <- 100 #based on example
loops <- nrow(df)/range
lvlMatrix <- matrix(nrow=0,ncol=2,dimnames=list(NULL,c("range","unique values")))
for(a in 1:loops){
sub <- df[((a-1)*range):(range*a),]
lvls<-nlevels(factor(sub$X))
lvlMatrix <- rbind(lvlMatrix,cbind(paste(as.character((a-1)*range),"-",as.character(range*a),sep=""),lvls))
}

creating functions to calculate the technical error and the coefficient of variation of the error

I have this equation (which can be accessed through this link):
I would like to create two functions by using r. The first one is by using the first equation provided.
The second function is to create a mathematical formula in which the first function is substituted. Here is the formula:
(http://i43.tinypic.com/b6vq5j.jpg)
THis is the head of my data: (data_1)
sex age seca1 chad1 DL alog1 dig1 scifirst1 crimetech1
1 F 20 1754 1750 175 95 95 432 429
2 F 19 1594 1596 158 56 55 420 417
3 F 20 1556 1558 156 74 72 435 437
4 F 18 1648 1640 167 67 65 431 434
5 F 19 1780 1780 178 99 67 433 431
6 F 19 1610 1620 165 56 54 423 425
After doing this as #janos suggested:
f1 <- function(x, y) {sqrt(sum((x - y) ^ 2) / 2 / length(x))}
now, as i need to run f1 on data_1$alog1 vs data_1$dig1... here's what i did:
f1(data_1$alog1, data_1$dig1)
which gives: 4.3
Next, I tried to implement the 2nd formula like this:
f2 <- function(x, y){(f1 / ((x + y) / 2)) * 100}
but then, when I run it on data_1$alog1 vs data_1$dig1 to calculate the coefficient of variation of the error for these data I get:
> f2(data_1$alog1, data_1$dig1)
Error in f1/((x + y)/2) : non-numeric argument to binary operator
Could anyone please comment on the steps performed to create the first function, the second function and the way i run the functions on "alog1 vs dig1" ?
Thanks all!!
If I understood correctly, here you go:
f1 <- function(x, y) {
sqrt(sum((x - y) ^ 2) / 2 / length(x))
}
f1(1:3, 4:6)
This will output:
[1] 2.12132
The function assumes that x and y are both vectors of the same length.
You can do the same for the 2nd function, with some simplification:
f2 <- function(x, y) {
200 * f1(x, y) / (x + y)
}
f2(1:3, 3:5)
To check that two vectors have the same length, you can use the length method. It can be also useful to halt execution if this assumption fails, like this:
stopifnot(length(x) == length(y))

Resources