Applying function over certain values in vector (R) - r

I would like to know if there is some elegant solution to this problem:
Let's say I have a vector of values
a <- c(1,2,3,3.1,3.2,5,6,7,7.1,7.2,9)
and I want to apply some function (e.g. mean) only to values fulfilling certain condition, which in this case is to have the difference between values smaller than 0.5 .
So the values that should be averaged are (3,3.1,3.2) and (7,7.1,7.2) and the function should return vector
b <- c(1,2,3.1,5,6,7.1,9)
Edit: One approach I've tried (not sure if right) is to binarize the vector a (1 meaning the difference between values is <0.5; 0 meaning the diff is >0.5), so I got vector
bin <– c(0,0,1,1,0,0,0,1,1,0)
but I don't know how to apply mean to the separate groups of ones. So the main problem for me is to distinguish the groups of needed values and apply the mean to them separately. Any ideas?
I am new here so if anything is unclear, please let me know. Thank you in advance.

This doesn't qualify as elegant, but I think that it works in the case you provide. I use rle (base R) to identify runs where diffs are less than 0.5.
a <- c(1, 2, 3, 3.1, 3.2, 5, 6, 7, 7.1, 7.2, 9)
crit <- diff(a) < 0.5
crit <- c(head(crit, 1), crit) | c(crit, tail(crit, 1))
run <- rle(crit)
aa <- split(a, rep(seq(length(run$lengths)), times=run$lengths))
myFun <- function(crit, val) {
if (crit) {
mean(val)
}
else {
val
}
}
unlist(mapply(FUN=myFun, crit=run$values, val=aa, USE.NAMES=FALSE))
Yields:
> unlist(mapply(FUN=myFun, crit=run$values, val=aa, USE.NAMES=FALSE))
[1] 1.0 2.0 3.1 5.0 6.0 7.1 9.0
Maybe someone can build a cleaner solution from this.
Update: OP points out that this fails on a sequence like {3, 3.1, 3.2, 7, 7.1, 7.2} because the code above lumps this into one run and averages across the whole sequence. Here's a more robust solution.
a <- c(1, 2, 3, 3.1, 3.2, 7, 7.1, 7.2, 10)
run <- unclass(rle(diff(a) < 0.5))
len <- run$lengths
val <- run$values
pos <- seq_along(len)
last <- pos == max(pos)
len <- len + val - c(0, head(val, -1)) + (last * !val)
prevLen <- c(0, head(cumsum(len), -1))
myFun <- function(l, v, pl, x) {
if (l == 0) {
NULL
} else {
seg <- seq(l) + pl
if (v == TRUE) {
mean(x[seg])
} else {
x[seg]
}
}
}
unlist(mapply(FUN=myFun, l=len, v=val, pl=prevLen, MoreArgs=list(x=a)))
Now whenever it comes across a small difference run (i.e., val == TRUE) it adds more one to the length of that small difference run (i.e., len + val), but that additional element comes from the next run, but it can't steal from the last run if it's not a small difference run (i.e., last * !val).

Maybe I overcomplicated the problem:
a <- c(1,2,3,3.1,3.2,5,6,7,7.1,7.2,9)
thr <- 0.5
## create a correct binary vector
d <- diff(a)
d <- c(d[1], d)
rd <- abs(diff(rev(a)))
rd <- c(rd[1], rd)
dc <- d < thr | rd < thr
# [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## use rle to count continous values
r <- rle(dc)
r
# Run Length Encoding
# lengths: int [1:5] 2 3 2 3 1
# values : logi [1:5] FALSE TRUE FALSE TRUE FALSE
## create grouping vector
groups <- double(length(a))
groups[!dc] <- seq(sum(!dc))
groups[dc] <- sum(!dc)+rep(seq(sum(r$values)), r$lengths[r$values])
groups
# [1] 1 2 6 6 6 3 4 7 7 7 5
## create mean for each group
m <- tapply(a, groups, FUN=mean)
m
# 1 2 3 4 5 6 7
# 1.0 2.0 5.0 6.0 9.0 3.1 7.1
## recreate origin order
m[order(unique(groups))] <- m
m
# 1 2 3 4 5 6 7
# 1.0 2.0 3.1 5.0 6.0 7.1 9.0

Another possibility based on ave
# find id on which mean should be calculated
id1 <- which(diff(a) < 0.5)
id2 <- sort(union(id1, id1 + 1))
id2
# [1] 3 4 5 8 9 10
# group the id
grp <- cumsum(c(1, diff(id2)) - 1)
grp
# [1] 0 0 0 2 2 2
# calulate mean per group and insert into original vector
a[id2] <- ave(a[id2], grp)
a
# [1] 1.0 2.0 3.1 3.1 3.1 5.0 6.0 7.1 7.1 7.1 9.0
# remove duplicated means, i.e. remove index of duplicated values of grp
a[-id2[as.logical(ave(grp, grp, FUN = function(x) duplicated(x)))]]
# [1] 1.0 2.0 3.1 5.0 6.0 7.1 9.0

Related

R - Calculate rolling mean of previous k non-NA values

I'm trying to calculate the rolling mean of the previous k non-NA values within the dplyr/tidyverse framework. I've written a function that seems to work but was wondering if there's already a function from some package (which will probably be much more efficient than my attempt) doing exactly this. An example dataset:
tmp.df <- data.frame(
x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)
Let's say I want the rolling mean of the previous 3 non-NA values. Then the output y should be:
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
The first 5 elements of y are NAs because the first time x has 3 previous non-NA values is on row 6 and the average of those 3 elements is 2. The next y elements are self-explanatory. Row 9 gets a 4 because the 3 previous non-NA values of x are in rows 5, 6, and 7 and so on.
My attempt is this:
roll_mean_previous_k <- function(x, k){
require(dplyr)
res <- NA
lagged_vector <- dplyr::lag(x)
lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
previous_k_values <- tail(lagged_vector_without_na, k)
if (length(previous_k_values) >= k) res <- mean(previous_k_values)
res
}
to be used as follows (using the slide_dbl function from the slider package):
library(dplyr)
tmp.df %>%
mutate(
y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
)
which gives the desired output. However, I'm wondering if there's a ready-made, and (as mentioned before) more efficient way of doing this. I should mention that I know of rollmean and roll_mean from the zoo and RcppRoll packages respectively, but unless I'm mistaken, they seem to work on a fixed rolling window with the option of dealing with NA values (e.g ignoring them). In my case, I want to "extend" my window to include k non-NA values.
Any thoughts/suggestions are welcome.
EDIT - SIMULATION RESULTS
Thank you to all contributors. First of all, I had not mentioned that my datasets are indeed much larger and run often so any performance improvements are most welcome. I therefore ran the following simulation to check execution times, before deciding which answer to accept. Note, that some of the answers needed small tweaks to return the desired output, but if you feel that your solution is misrepresented (and therefore is less efficient than intended) feel free to let me know and I'll edit accordingly. I've used G. Grothendieck's trick from his answer below, to remove the need for if-else checks regarding the length of the lagged , non-NA vector.
So here's the simulation code:
library(tidyverse)
library(runner)
library(zoo)
library(slider)
library(purrr)
library(microbenchmark)
set.seed(20211004)
test_vector <- sample(x = 100, size = 1000, replace = TRUE)
test_vector[sample(1000, size = 250)] <- NA
# Based on GoGonzo's answer and the runner package
f_runner <- function(z, k){
runner(
x = z,
f = function(x) {
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
}
)
}
# Based on my inital answer (but simplified), also mentioned by GoGonzo
f_slider <- function(z, k){
slide_dbl(
z,
function(x) {
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
},
.before = Inf
)
}
# Based on helios' answer. Return the correct results but with a warning.
f_helios <- function(z, k){
reduced_vec <- na.omit(z)
unique_means <- rollapply(reduced_vec, width = k, mean)
start <- which(!is.na(z))[k] + 1
repeater <- which(is.na(z)) + 1
repeater_cut <- repeater[(repeater > start-1) & (repeater <= length(z))]
final <- as.numeric(rep(NA, length(z)))
index <- start:length(z)
final[setdiff(index, repeater_cut)] <- unique_means
final[(start):length(final)] <- na.locf(final)
final
}
# Based on G. Grothendieck's answer (but I couldn't get it to run with the performance improvements)
f_zoo <- function(z, k){
rollapplyr(
z,
seq_along(z),
function(x, k){
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
},
k)
}
# Based on AnilGoyal's answer
f_purrr <- function(z, k){
map_dbl(
seq_along(z),
~ ifelse(
length(tail(na.omit(z[1:(.x -1)]), k)) == k,
mean(tail(na.omit(z[1:(.x -1)]), k)),
NA
)
)
}
# Check if all are identical #
all(
sapply(
list(
# f_helios(test_vector, 10),
f_purrr(test_vector, 10),
f_runner(test_vector, 10),
f_zoo(test_vector, 10)
),
FUN = identical,
f_slider(test_vector, 10),
)
)
# Run benchmarking #
microbenchmark(
# f_helios(test_vector, 10),
f_purrr(test_vector, 10),
f_runner(test_vector, 10),
f_slider(test_vector, 10),
f_zoo(test_vector, 10)
)
And the results:
Unit: milliseconds
expr min lq mean median uq max neval cld
f_purrr(test_vector, 10) 31.9377 37.79045 39.64343 38.53030 39.65085 104.9613 100 c
f_runner(test_vector, 10) 23.7419 24.25170 29.12785 29.23515 30.32485 98.7239 100 b
f_slider(test_vector, 10) 20.6797 21.71945 24.93189 26.52460 27.67250 32.1847 100 a
f_zoo(test_vector, 10) 43.4041 48.95725 52.64707 49.59475 50.75450 122.0793 100 d
Based on the above, and unless the code can be further improved, it seems as the slider and runner solutions are faster. Any final suggestions are more than welcome.
Many thanks for your time!!
With runner it will be something like mean of 3-elements tail window of non-na values. You can achive the same result with slider
library(runner)
tmp.df <- data.frame(
x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)
# using runner
tmp.df$y_runner <- runner(
x = tmp.df$x,
f = function(x) {
mean(
tail(
x[!is.na(x)],
3
)
)
}
)
# using slider
tmp.df$y_slider <- slider::slide_dbl(
tmp.df$x,
function(x) {
mean(
tail(
x[!is.na(x)],
3
)
)
},
.before = Inf
)
tmp.df
# x y_runner y_slider
# 1 NA NaN NaN
# 2 1 1.0 1.0
# 3 2 1.5 1.5
# 4 NA 1.5 1.5
# 5 3 2.0 2.0
# 6 4 3.0 3.0
# 7 5 4.0 4.0
# 8 NA 4.0 4.0
# 9 NA 4.0 4.0
# 10 NA 4.0 4.0
# 11 6 5.0 5.0
# 12 7 6.0 6.0
# 13 NA 6.0 6.0
rollapplyr. Regarding the comment about rollmean in the question, zoo also has rollappy and rollapplyr (right aligned) and those allow different widths (and offsets) for each component of the input by specifying a vector (as we do here) or list for width -- see ?rollapply for more info. We use a relatively naive vector of widths below and also show some improved width vectors which run faster.
Operation Create a Mean function which takes a vector, removes the last element and all NA's and takes the last k elements of what is left extending it to k elements with NA's as needed. Finally take the mean of that. We use rollapplyr to apply that to x with a width of seq_along(x).
Performance improvements. With this small data the following likely don't make much difference but if you have larger data you could try these which might improve the speed:
replace na.omit with na_rm from the collapse package
replace the second argument of rollapplyr with the code shown here.
The idea here is that the sum of the lengths of the k+1 longest runs of NA plus k+1 forms a bound on the number of elements that we need to consider. This (plus using na_rm) ran about 25% faster as the code in the question on a problem when I tried it with 1300 rows (formed from 100 copies of the data in the question) and does not add much extra code.
pmin(with(rle(is.na(x)), sum(tail(sort(lengths[values]), k+1)))+k+1, seq_along(x))
replace the second argument of rollapplyr with w where w is shown here. The idea here is to use findInterval to find the element k non-NA's back which provides an even tigher bound. This one (plus using na_rm) ran nearly twice as fast as the code in the question when tried with the same 1300 rows at the expense of adding 2 more lines of code.
tt <- length(x) - rev(cumsum(rev(!is.na(x))))
w <- seq_along(tt) - findInterval(tt - k - 1, tt)
Code. With the data in the question the code below (not using the above improvements) ran slightly faster (not a lot) than the code in the question based on my benchmarking and it is only two lines of code.
library(dplyr)
library(zoo)
Mean <- function(x, k) mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
tmp.df %>% mutate(y = rollapplyr(x, seq_along(x), Mean, k = 3))
giving:
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
Since I am not aware of a ready-made way of computing your output in any standard library, I came up with the implementation roll_mean_k_efficient below, which seems to speed up your computations considerably. Note that this implementation makes use of the rollapply and the na.locf methods from the zoo package.
rm(list = ls())
library("zoo")
library("rbenchmark")
library("dplyr")
x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
# your sample (extended)
tmp.df <- data.frame(
x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
)
# enhanced implementation
roll_mean_k_efficient <- function(x, k){
reduced_vec = na.omit(x)
unique_means = rollapply(reduced_vec, width=k, mean)
start = which(!is.na(x))[k] + 1
repeater = which(is.na(x)) + 1
repeater_cut = repeater[(repeater > start-1) & (repeater <= length(x))]
final <- as.numeric(rep(NA, length(x)))
index = start:length(x)
final[setdiff(index, repeater_cut)] <- unique_means
final[(start):length(final)] <- na.locf(final)
final
}
# old implementation
roll_mean_previous_k <- function(x, k){
res <- NA
lagged_vector <- dplyr::lag(x)
lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
previous_k_values <- tail(lagged_vector_without_na, k)
if (length(previous_k_values) >= k) res <- mean(previous_k_values)
res
}
# wrapper function for the benchmarking below
roll_mean_benchmark = function(){
res = tmp.df %>%
mutate(
y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
)
return(res)
}
# some benchmarking
benchmark(roll_mean_k_efficient(x = x, k=3),
roll_mean_benchmark(),
columns=c('test','elapsed','replications'),
replications = 100)
Furthermore, I extended your example vector x to get some more reliable benchmark results via the benchmark function from the rbenchmark package.
In my case the benchmark runtimes that are printed after running the code are:
test elapsed replications
2 roll_mean_benchmark() 4.463 100
1 roll_mean_k_efficient(x = x, k = 3) 0.039 100
Without using zoo. In tidyverse fashion, you can also do it using purrr::map
tmp.df %>% mutate(y = map(seq_along(x), ~ ifelse(length(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)) ==3,
mean(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)),
NA)))
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6

Create a vector from another selecting some of its values at random but in order and with a minimum distance between selected ones?

I have a vector of numbers and I would like to select some of them at random but in order. How could I do it?
For example:
vector <- runif(10, min=0, max=101)
vector
[1] 35.956732 67.608039 20.099881 23.184217 9.157408 34.105185 97.459770 25.805254 74.537667 18.865662
Which code can I use to create a new vector containing, for example, four out of the 10 values with the requirement that those four values are in the same order than the original vector? That is, the vector can not be 9.157408 67.608039 74.537667 97.459770 but 67.608039 9.157408 97.459770 74.537667.
Any help would be great. Thanks in advance.
Second part (updated)
What if I want a certain number of steps among consecutive selected values?
That is, if I have for instance this vector:
[1] 2.1 3.4 1.6 8.9 2.3 5.4 6.4 1.3 10.8 3.7 13.4 2.4 5.4 6.8
How can I select 3 out of that 14 values with the additional condition that there has to be at least 3 non-selected values between two selected ones. For example, a selected vector could be 2.1 5.4 6.8 but it couldn't be 1.6 5.4 10.8.
We can sample 4 elements from the vector, then match to get the index and subset the vector
v1 <- sample(vector, 4)
vector[match(v1, vector)]
If we need to sample 1 element every 4, we could use rollapply by specifying the width and by
library(zoo)
rollapply(v2, 4, by = 4, FUN = function(x) sample(x, 1))
#[1] 1.6 1.3 2.4
Or use a loop
out <- c()
flag <- TRUE
i <- 1
while(flag) {
if((i + 4) > length(v2)) {
break
flag <- FALSE
}
i1 <- i:(i + 2)
tmp <- sample(i1, 1)
out <- c(out, tmp)
i <- tmp + 3
}
out
#[1] 3 7 11
data
v2 <- c(2.1, 3.4, 1.6, 8.9, 2.3, 5.4, 6.4, 1.3, 10.8, 3.7, 13.4, 2.4,
5.4, 6.8)
Is this what you're looking for? Simply use the sort function to put in order.
vector <- runif(10, min=0, max=101)
n <- 5
i <- sort(sample(seq_along(vector),n))
vector[i]
One option is to use the createDataPartition() function from the caret package, e.g.
library(caret)
vector <- runif(10, min=0, max=101)
vector
#>[1] 49.12759 37.39169 99.31837 39.22023 23.15373 62.95305 13.79056 97.71442
#>[9] 52.02225 16.47010
sampling_index <- createDataPartition(y = vector, times = 1,
p = 0.3, list = FALSE)
vector[sampling_index]
#>[1] 49.12759 39.22023 23.15373 97.71442
Try sample like
vector[sort(sample(length(vector),4))]
or
vector[head(which(sample(c(TRUE,FALSE),length(vector),replace = TRUE)),4)]
Update
If you have constraints regarding the minimum spacing between random indices, you can try the code below:
non-optimized method
f1 <- function(vec,n, min_spacing = 4) {
idx <- seq_along(vec)
repeat {
k <- sort(sample(idx,n))
if (all(diff(k)>=min_spacing)) break
}
vec[k]
}
optimized method
f2 <- function(vec, n, min_spacing = 4) {
u <- unname(tapply(vec, ceiling(seq_along(vec) / min_spacing), sample, size = 1))
head(u[seq(1, length(u), by = 2)], n)
}

Replace values in vector based on previous and next correct value?

If I have a vector like
x = c(1, 2, -1, -2, 5, 6, 7, -1, -2, -3, 8, 9)
I would like, for each negative value, to look ahead and count how many negative values there are ahead, including the current value. So the result would look like
y = c(0, 0, 2, 1, 0, 0, 0, 3, 2, 1, 0, 0)
My ultimate goal is to use these results to create replacement values for the negative values based on the mean of the closest positive values. In this case, I want the result to be:
result = {1,2,(2+5)/2,(2+5)/2,5,6,7,(7+8)/2,(7+8)/2,(7+8)/2,8,9}
Now, I can do this using mutate and lead with different offsets but there must be a simpler way?
Here's another way inspired by #Khashaa's answer:
approx(replace(x, x < 0, NA), xout = seq_along(x),
method = "constant", f = 0.5, rule = 2)$y
# [1] 1.0 2.0 3.5 3.5 5.0 6.0 7.0 7.5 7.5 7.5 8.0 9.0
How it works in pseudocode. Let X = replace(x, x < 0, NA).
X = 1 2 NA NA 5 6 7 NA NA NA 8 9
We are interpolating the function X(i) where i takes values in xout = 1..12, using data
{[i, X(i)] : X(i) not NA, i = 1..12}
= {[i, X(i)] : i = 1, 2, 5, 6, 7, 11, 12}
Since we already have data on X(i) where it is not NA, we only need to fill intervals of NAs, that is i = 3, 4, 8, 9, 10.
method = "constant" fills each interval of NAs, i = 3-4 & i = 8-10, with a single value computed based on the two nearest observations
f = 0.5 weights the two observations equally
rule = 2 fills missing intervals at the start or end of the vector based on the nearest observation (not applicable for this example)
For full documentation, see ?approx.
(Thanks to #thothal for explanation and correction in comments.)
1) na.locf0 Replace the negative values with NA and then use na.locf0 both forwards and backwards taking the average of the two.
library(zoo)
x.na <- replace(x, x < 0, NA)
(na.locf0(x.na) + na.locf0(x.na, fromLast = TRUE)) / 2
## [1] 1.0 2.0 3.5 3.5 5.0 6.0 7.0 7.5 7.5 7.5 8.0 9.0
2) cummax/cummin If the non-negative numbers are non-decreasing as in the question's sample data then this would work using only base R. We take cummax forward and cummin backward and average the two (except on the backward pass we replace negative values with Inf).
( cummax(x) + rev(cummin(rev(replace(x, x < 0, Inf)))) ) / 2
## [1] 1.0 2.0 3.5 3.5 5.0 6.0 7.0 7.5 7.5 7.5 8.0 9.0
To throw in a pure base solution:
rl <- rle(x < 0)
unlist(mapply(rl$length, rl$values, FUN = function(l, v) if (v) seq(l, 1) else rep(0, l)))
# [1] 0 0 2 1 0 0 0 3 2 1 0 0
Instead of finding the number of consecutive negative integers, I have tried to achieve your final goal. It tries to find the series of negative numbers and replace it with the adjacent mean of the positive numbers and move ahead. I have added comments for better understanding, hope it helps. Cheers!
x = c(1, 2, -1, -2, 5, 6, 7, -1, -2, -3, 8, 9)
i <- 1 #iterator
while (i < length(x)){
if (x[i]>0 & x[i+1]<0){ #check for the 1st negative number
temp <- i+1
while (x[temp]<0){ #check for all consectuive negative numbers
temp <- temp + 1
}
mean <- (x[temp] + x[i])/2 #find mean of the nearest positive numbers
i <- i + 1
while (i<temp){ #replacing all negative with the mean
x[i] = mean
i = i + 1
}
}
i = i+1
}
print(x)
ave(x, with(rle(sign(x)), rep(seq_along(values), lengths)),
FUN = function(x) rev(seq_along(x)) * (x < 0))
# [1] 0 0 2 1 0 0 0 3 2 1 0 0
If you just wanted to replace the values
indsn = which(x < 0)
indsp = which(x > 0)
replace(x = x,
list = x < 0,
values = sapply(indsn, function(n){
i = indsp[tail(which(indsp < n), 1)]
j = indsp[head(which(indsp > n), 1)]
mean(x[c(i, j)])
}))
# [1] 1.0 2.0 3.5 3.5 5.0 6.0 7.0 7.5 7.5 7.5 8.0 9.0

R: Round the contents within each row so that the row total is equal to a number I specify

I have 170 rows of numbers with decimals that need to be rounded to whole numbers. However, the rows total has to equal a number I specify.
As a very basic illustration, let’s say I have a matrix (1x4) with cell contents (1.2, 3.4, 7.7, 5.3). But let’s say that these numbers represent individuals so I need to round them to whole numbers, such that the group populations are equal to a total population of 18 individuals. If I simply round the matrix contents, which gives me (1, 3, 8, 5), my total population is 17 and I need it to equal 18 (see R commands below).
m <- c(1.2, 3.4, 7.7, 5.3)
m.2 <- round(m)
m.2
[1] 1 3 8 5
sum(m.2)
[1] 17
After the numbers are rounded, I need R to then choose the next number that was closest to rounding up (i.e. 3.4) and round it to 4 instead of 3.
This would give me a matrix of (1, 4, 8, 5) = 18.
Dr. John Fox had helped me out with a simple recursive function to solve the problem:
Round <- function(x, target){
r.x <- round(x)
diff.x <- round(x) - x
if ((s <- sum(r.x)) == target) return(r.x)
else if (s > target) {
select <- seq(along=x)[diff.x > 0]
which <- which.max(diff.x[select])
x[select[which]] <- r.x[select[which]] - 1
Round(x, target)
}
else{
select <- seq(along=x)[diff.x < 0]
which <- which.min(diff.x[select])
x[select[which]] <- r.x[select[which]] + 1
Round(x, target)
}
}
This is very useful for individual rows. But I have 170 rows in my dataset. So that means repeating a process like this (see below) 170 times:
paste(STATA[['b']], collapse=", ")
B <- c(46.8310012817383, 19.9720001220703, 265.837005615234, 95.0400009155273, 6.88700008392334, 190.768997192383, 22.7269992828369, 764.453002929688, 53.0299987792969, 333.329010009766, 55.0960006713867, 84.0210037231445, 28.2369995117188, 2207.27099609375, 86.7760009765625, 50045.46875, 103.304000854492, 413.217987060547, 4.13199996948242, 2.75500011444092, 183.88200378418, 65.4260025024414, 0.689000010490417, 2248.59204101562, 0, 1.37699997425079, 16.5289993286133, 4.13199996948242, 4.13199996948242, 2.75500011444092, 4.13199996948242, 1.37699997425079, 0, 39.9440002441406, 2.75500011444092, 28.2369995117188, 0, 0, 5.51000022888184, 0, 48.8969993591309, 17.9060001373291, 485.531005859375, 1.37699997425079, 59.9169998168945, 221.759994506836, 28.2369995117188, 4.13199996948242, 65.4260025024414, 11.0190000534058, 38.5670013427734, 3.44300007820129, 8.95300006866455, 2.75500011444092, 23.4160003662109, 4.13199996948242, 50.5750015258789, 11.7080001831055, 19.2830009460449, 48.8969993591309, 0, 13.7740001678467, 92.9739990234375)
varB <- (Round(B, 58701))
ROUND2012$varB <- varB
^In this case, I had used the transpose of my dataset in Excel because I found it easier to attach columns to datasets in R as compared to attaching rows. But ideally I wouldn't have to do this and rows would be my territories and columns are group identity population data. Here, 'b' is the name of the column I am calling and 58701 is the population total that the numbers need to add up to after they are rounded.
In short, I'm looking for a function that is helpful for an entire dataset as opposed to individual rows. Ideally I'd be able to call the columns with the numbers to be rounded as well as call the column with the population totals that I need the rounded numbers to equal to.
Updated Info
As a more illustrative example. Let's say I have two racial groups in my population.
B
race1 race2 total
place1 1.2 2.1 3.4
place2 3.4 3.6 7.0
place3 7.7 0.8 8.5
place4 5.3 1.4 6.7
I need these numbers to equal my total registered voters population. The totals are 3.4, 7.0, 8.5, 6.7, but I need the contents within each place row to be rounded such that my place(1-4) totals are 4.0, 7.0, 8.0, and 7.0. So that means for place1, I need the contents to be rounded so that 1.2 becomes 2.0 and 2.1 becomes 2.0. Equals 4.0, my registered voter population. For place2, the total is already at 7 so we're okay. For place3 7.7 would become 7.0 and 0.8 would become 1, giving me 8 in total. Finally for place4, I would need 5.3 to be rounded to 5 and 1.4 to be rounded to 2.0, giving me 7 in total. What I want is:
B
race1 race2 total
place1 2.0 2.0 4.0
place2 3.0 4.0 7.0
place3 7.0 1.0 8.0
place4 5.0 2.0 7.0
Currently the round function pasted above allows me to call one series of numbers at a time, and manually entering in what total they need to be rounded to. But I am looking for a function that could do this all simultaneously. I want to call all the race columns to be rounded, and call a column containing all the necessary population totals.
(note: in practice I had taken the transpose of the matrix in excel and re-imported it back into R because, as a fairly new R user, I found that attaching new columns to the dataset was easier than attaching new rows. But I absolutely do not need to do that step and, indeed, would prefer not to.)
There are several ways you could do this, but taking my comment from above:
Round <- function(x, target) {
r.x <- round(x)
diff.x <- round(x) - x
if ((s <- sum(r.x)) == target) {
return(r.x)
} else if (s > target) {
select <- seq(along=x)[diff.x > 0]
which <- which.max(diff.x[select])
x[select[which]] <- r.x[select[which]] - 1
Round(x, target)
} else {
select <- seq(along=x)[diff.x < 0]
which <- which.min(diff.x[select])
x[select[which]] <- r.x[select[which]] + 1
Round(x, target)
}
}
dat <- read.table(header = TRUE, row.names = paste0('place', 1:4),
text="race1 race2 total
1.2 2.1 3.4
3.4 3.6 7.0
7.7 0.8 8.5
5.3 1.4 6.7")
totals <- c(4.0, 7.0, 8.0, 7.0)
The two examples simply perform the Round on each row using a 1-1 mapping from the two columns of dat with each corresponding value in totals
lapply returns a list, so to transform the output back into a matrix/data frame, we rbind everything back together.
do.call(rbind, lapply(1:nrow(dat), function(x) Round(dat[x, -3], totals[x])))
# race1 race2
# place1 2 2
# place2 3 4
# place3 7 1
# place4 5 2
the output of apply is transposed to what you want, so we t the result
dat[3] <- totals
t(apply(dat, 1, function(x) Round(x[1:2], x[3])))
# race1 race2
# place1 2 2
# place2 3 4
# place3 7 1
# place4 5 2
Alternatively, you could probably come up with something more clever using Map/mapply or Vectorize the Round to avoid these loops, but it doesn't seem like your data is very large.
I came up with a relatively straight-forward but lazy method to solve your problem. The basic idea is to: 1. Check how many additional numbers you need to round for the second time; 2. Dynamically sort out which number should be preferably rounded for the second time.
I used the dataset "B" you quoted above with a rounded sum of 58701; and I set the designated round output of 58711.
raw <- B
round <- round(B)
data <- data.frame(raw, round)
calc_sum = sum(data$round)
desig_sum = 58711
data$residual = abs(data$raw - data$round)
data$above = ifelse(data$round > data$raw, 1, 0)
data$round2 = 0
data1 <- data[order(data$residual),]
if (calc_sum < desig_sum) {
diff = desig_sum - calc_sum
count = 0
while (count < diff) {
for (i in 1:nrow(data1)) {
data_tmp <- subset(data1, round2 == 0 & above == 0)
# Finding out which the next number is for its second rounding
if (data1[i,4] == 0 & data1[i,3] == max(data_tmp$residual)) {
data1[i,5] = data1[i,2] + 1
count = count + 1
} else {
count = count
}
}
}
}
data2 <- data1[order(as.numeric(rownames(data1))),]
# Reverting back to the original order
data2$output = 0
for (i in 1:nrow(data2)) {
if (data2[i,5] != 0) {
data2[i,6] = data2[i,5]
} else {
data2[i,6] = data2[i,1]
}
}
data_final = data2[,6]
I have not yet come up with the codes where calc_sum > desig_sum, but in that case, the codes should not differ much from the ones above.
Also, if there are not enough numbers to round to your designated number (for example, in the case above, desig_sum = 5), the codes won't work.
An alternative way to round values that the total is equal to a given number which works also for the case shown in the follow up question.
You can define if the adjustment is done on:
closest numbers
largest numbers
randomly distributed
and also choose the number of decimal places.
#Round to given total
#x..numeric vector
#target..sum of rounded x, if not given target = round(sum(x), digits)
#digits..number of decimal places
#closest..Make adjustment by changing closest number
#ref..reference level to calculate probability of adjustment, if ref==NA the probability of an adjustment is equal for all values of x
#random..should the adjustment be done stochastic or randomly
roundt <- function(x, target=NA, digits = 0, closest=TRUE, ref=0, random=FALSE) {
if(is.na(target)) {target <- round(sum(x), digits)}
if(all(x == 0)) {
if(target == 0) {return(x)}
x <- x + 1
}
xr <- round(x, digits)
if(target == sum(xr)) {return(xr)}
if(is.na(ref)) {
wgt <- rep(1/length(x), length(x))
} else {
if(closest) {
tt <- (x - xr) * sign(target - sum(xr)) + 10^-digits / 2
wgt <- tt / sum(tt)
} else {wgt <- abs(x-ref)/sum(abs(x-ref))}
}
if(random) {adj <- table(sample(factor(1:length(x)), size=abs(target - sum(xr))*10^digits, replace = T, prob=wgt))*sign(target - sum(xr))*10^-digits
} else {adj <- diff(c(0,round(cumsum((target - sum(xr)) * wgt), digits)))}
xr + adj
}
dat <- read.table(text="
race1 race2 total
1.2 2.1 4
3.4 3.6 7
7.7 0.8 8
5.3 1.4 7
3.4 3.6 5
7.7 0.8 12
-5 5 1
0 0 3
0 0 0
", header=T)
apply(dat, 1, function(x) roundt(x[1:2], x[3])) #Default round to target
apply(dat[1:6,], 1, function(x) roundt(x[1:2]*x[3]/sum(x[1:2]))) #Preadjust to target by multiplication
apply(dat, 1, function(x) roundt(x[1:2] + (x[3]-sum(x[1:2]))/2)) #Preadjust to target by addition
apply(dat, 1, function(x) roundt(x[1:2], x[3], cl=F)) #Prefer adjustment on large numbers
apply(dat, 1, function(x) roundt(x[1:2], x[3], ref=NA)) #Give all values the same probability of adjustment
apply(dat, 1, function(x) roundt(x[1:2], x[3], dig=1)) #Use one digit
apply(dat, 1, function(x) roundt(x[1:2], x[3], dig=1, random=TRUE)) #Make the adjustment by random sampling

Pairwise Operations in R

I need to calculate pairwise, consecutive correlations for each of these date variables (there are 246 in my dataset):
Company 2009/08/21 2009/08/24 2009/08/25
A -0.0019531250 -0.0054602184 -6.274510e-03
AA -0.0063291139 -0.0266457680 -1.750199e-02
AAPL 0.0084023598 -0.0055294118 -1.770643e-04 ...
...
So that I can find cor(col1,col2), cor(col2,col3), but nothing for cor(col1,col3). I realize that if I wanted all combinations I could use the combn function, but I can't figure out how to do it for my circumstances without something inefficient like a for loop.
Approach 1
you could do:
lapply(1:(ncol(dat)-1), function(i) cor(dat[, i], dat[, i+1],
use="pairwise.complete.obs"))
Example
A dataframe with 10 variables will give you 9 consecutive correlations, i.e. columns 1-2, 2-3, 3-4 etc, if that is what you want.
dat <- replicate(10, rnorm(10))
lapply(1:(ncol(dat)-1), function(i)
cor(dat[, i], dat[, i+1], use="pairwise.complete.obs"))
Approach 2 (very concise)
Using the iris dataset as well:
dat <- iris[, 1:4]
diag(cor(dat, use="pairwise.complete.obs")[, -1])
[1] -0.1175698 -0.4284401 0.9628654
As you pointed out, combn is the way to go. Assume your data.frame is called dat then for consecutive columns, try this:
ind <- combn(ncol(dat), 2)
consecutive <- ind[ , apply(ind, 2, diff)==1]
lapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]]))
Consider this simple example:
> data(iris)
> dat <- iris[, 1:4]
> # changing colnames to see whether result is for consecutive columns
> colnames(dat) <- 1:ncol(dat)
> head(dat) # this is how the data looks like
1 2 3 4
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
>
> ind <- combn(ncol(dat), 2)
> consecutive <- ind[ , apply(ind, 2, diff)==1]
> lapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]])) # output: cor matrix
[[1]]
1 2
1 1.0000000 -0.1175698
2 -0.1175698 1.0000000
[[2]]
2 3
2 1.0000000 -0.4284401
3 -0.4284401 1.0000000
[[3]]
3 4
3 1.0000000 0.9628654
4 0.9628654 1.0000000
If you want just the correlation, use sapply
> sapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]])[2,1])
[1] -0.1175698 -0.4284401 0.9628654
Usually, loops in R should be avoided, but I think they sometimes have an undeserved stigma. In this case, the loop is easier for me to read than "cooler" functions. It's also fairly efficient. Any call like cor(mydata) calculates n^2 correlations, while the for loop only calculates n correlations.
x = matrix( rnorm(246*20000), nrow=246 )
out = numeric(245)
system.time( { for( i in 1:245 )
out[i] = cor(x[,i],x[,i+1]) } )
# 0.022 Seconds
system.time( diag(cor(x, use="pairwise.complete.obs")[, -1]) )
# Goes for 2 minutes and then crashes my R session
First, I'll assume your data is stored in df.
Here's what I'd do. First create a function that for any given column number it will calculate the correlation between that column and the one up from it like this
cor.neighbour <- function(i) {
j <- i + 1
cr <- cor(df[, i], df[, j])
# returning a dataframe here will make sense when you see the results from sapply
result <- data.frame(
x = names(df)[i],
y = names(df)[j],
cor = cr,
stringsAsFactors = FALSE
)
return(result)
}
Then to apply it to your whole data I would first create a vector of all the columns I want to use, i which, by the way, is all but the last column. Then use lapply to process through them
i <- 1:(ncol(df) - 1)
cor.pairs <- lapply(i, cor.neighbour)
# change list in to a data frame
cor.pairs <- melt(cor.pairs, id=names(cor.pairs[[1]]))

Resources