How to get a vector which identify to which intervals the elements belong in R - r

I need to sort my vector values into custom intervals and subsequently identify which element belong to which interval.
For example if a vector is:
x <- c(1,4,12,13,18,24)
and the intervals are:
interval.vector <- c(1,7,13,19,25)
1st interval: 1 - 7
2nd interval: 7 - 13
3rd interval: 13 - 19
4th interval: 19 - 25
...how do I combine x and interval.vector to get this:
element: 1 4 12 13 18 24
interval: 1 1 2 2 3 4

You can also use cut.
x <- c(1,4,12,13,18,24)
interval.vector <- c(1,7,13,19,25)
x.cut <- cut(x, breaks = interval.vector, include.lowest = TRUE)
data.frame(x, x.cut, group = as.numeric(x.cut))
x x.cut group
1 1 [1,7] 1
2 4 [1,7] 1
3 12 (7,13] 2
4 13 (7,13] 2
5 18 (13,19] 3
6 24 (19,25] 4
Another option is the very efficient findInterval function, but I'm not sure how robust this solution on different variations of x
findInterval(x, interval.vector + 1L, all.inside = TRUE)
## [1] 1 1 2 2 3 4

Related

R: Creating bins by a factor when number of observations not divisible by number of bins?

I have a data set in which I have a number of DV's for each level of a factor. The number of DV's/ factor is not consistent. I would like to create quintiles, such that for each level of the factor the smallest 25% of values are assigned to bin 1, the next 25% smallest in bin2, etc,
I have found a package with a NEAR perfect solution: schoRsch, in which the function ntiles creates bins based on levels of the factor, like so:
library(schoRsch)
#{
dv <- c(5, 2, 10, 15, 3, 7, 20, 44, 18)
factor <- c(1,1,2,2,2,2,3,3,3)
tmpdata <- data.frame(cbind(dv,factor))
tmpdata$factor <- as.factor(tmpdata$factor)
head(tmpdata)
tmpdata$bins <- ntiles(tmpdata, dv = "dv", bins=2, factors = "factor")
tmpdata
#}
the output looks like:
dv factor bins
1 5 1 2
2 2 1 1
3 10 2 2
4 15 2 2
5 3 2 1
6 7 2 1
7 20 3 2
8 44 3 2
9 18 3 1
My problem occurs when the number of DV's for a particular factor level is not divisible by the number of bins. In the example above, factor 3 has 3 observations, and when sorting into two bins the first bin has one observation, and the second has 2. However, I would like the priority such that the first bin gets priority for assigning a DV, and the second and so-on. In my actual data set, for instance, I have a factor with 79 associated DV's and 5 bins. So I would want 16 observations in each of bin 1-4, and then 15 in bin 5. However this method gives me 16 observation in bins 1 and 3-5, and 15 in bin 2.
Is there any way to specify here my desired order of binning? Or is there an alternative way that I can solve this problem with another method that allows me to bin on the basis of a factor or, more helpfully, multiple factors?
Thank-you!
Something like this?
foo = function(x, bins) {
len = length(x)
n1 = ceiling(len/bins)
n2 = len - n1 * (bins - 1)
c(rep(1:(bins - 1), each = n1), rep(bins, n2))
}
table(foo(1:79, 5))
# 1 2 3 4 5
#16 16 16 16 15
library(dplyr)
tmpdata %>% group_by(factor) %>% mutate(bin = foo(dv, 2))
## A tibble: 9 x 3
## Groups: factor [3]
# dv factor bin
# <dbl> <fct> <dbl>
#1 5 1 1
#2 2 1 2
#3 10 2 1
#4 15 2 1
#5 3 2 2
#6 7 2 2
#7 20 3 1
#8 44 3 1
#9 18 3 2

calculating column sum for certain row

I am trying to calculate column sum of per 5 rows for each row, in R using the following code:
df <- data.frame(count=1:10)
for (loop in (1:nrow(df)))
{df[loop,"acc_sum"] <- sum(df[max(1,loop-5):loop,"count"])}
But I don't like the explicit loop here, how can I modify it? Thanks.
According to your question, your desired result is:
df
# count acc_sum
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
# 7 7 27
# 8 8 33
# 9 9 39
# 10 10 45
This can be done like this:
df <- data.frame(count=1:10)
library(zoo)
df$acc_sum <- rev(rollapply(rev(df$count), 6, sum, partial = TRUE, align = "left"))
To obtain this result, we are reversing the order of df$count, we sum the elements (using partial = TRUE and align = "left" is important here), and we reverse the result to have the vector needed.
rev(rollapply(rev(df$count), 6, sum, partial = TRUE, align = "left"))
# [1] 1 3 6 10 15 21 27 33 39 45
Note that this sums 6 elements, not 5. According to the code in your question, this gives the same output. If you just want to sum 5 rows, just replace the 6 with a 5.

R: Split weighted column into equal-sized buckets

I would like to use something like dplyr's cut_number to split a column into buckets with approximately the same number of observations, where my dataset is in a compact form where each row has a weight (number of observations).
Example data frame:
df <- data.frame(
x=c(18,17,18.5,20,20.5,24,24.4,18.3,31,34,39,20,19,34,23),
weight=c(1,10,3,6,19,20,34,66,2,3,1,6,9,15,21)
)
If there were one observation of x per row, I would simply use df$bucket <- cut_number(df$x,3) to segment x into 3 buckets with approximately the same number of observations. But how do I take into account the fact that each row is weighted with some number of observations? I'd like to avoid splitting each row into weight rows since the original dataframe already has millions of rows.
Based on the comments, I think this may be the interval set you are seeking. Apologies for the general un-R-ness of it:
dfTest <- data.frame(x=1:6, weight=c(1,1,1,1,4,1))
f <- function(df, n) {
interval <- round(sum(df$weight) / n)
buckets <- vector(mode="integer", length(nrow(df)))
bucketNum <- 1
count <- 0
for (i in 1:nrow(df)) {
count <- count + df$weight[i]
buckets[i] <- bucketNum
if (count >= interval) {
bucketNum <- bucketNum + 1
count <- 0
}
}
return(buckets)
}
Running this function buckets items as follows:
dfTest$bucket <- f(dfTest, 3)
# x weight bucket
# 1 1 1 1
# 2 2 1 1
# 3 3 1 1
# 4 4 1 2
# 5 5 4 2
# 6 6 1 3
For your example:
df$bucket <- f(df, 3)
# x weight bucket
# 1 18.0 1 1
# 2 17.0 10 1
# 3 18.5 3 1
# 4 20.0 6 1
# 5 20.5 19 1
# 6 24.0 20 1
# 7 24.4 34 1
# 8 18.3 66 2
# 9 31.0 2 2
# 10 34.0 3 2
# 11 39.0 1 2
# 12 20.0 6 3
# 13 19.0 9 3
# 14 34.0 15 3
# 15 23.0 21 3
Here's another approach, based on my assumption that you have in total x1*weight1 + x2*weight2 +..... observations. Furthermore, each 'unique' observation can only be in one bucket. The approach uses sorting and the cumulative sum of the weights to create the buckets.
#sort data
df <- df[order(df$x),]
#calculate cumulative weights (this is why we sort)
df$cumulative_weight <- cumsum(df$weight)
#create bucket by cumulative weight
n_buckets <- 3
df$bucket <- cut(df$cumulative_weight, n_buckets)
#check: calculate total number of observations per bucket
> aggregate(weight~bucket,FUN=sum, data=df)
bucket weight
1 (9.79,78.7] 77
2 (78.7,147] 64
3 (147,216] 75

How to replace the rows in data

Hello I have a table with 5 columns. One of the column X is:
x <- c(1,1,1,1,1,1,2,2,2,3)
How can I change the order of numbers in vector X, for example on the first place put 3s, on the second place put 1s and on the third place put 2s. The output should be in format like:
x <- c(3,1,1,1,1,1,1,2,2,2)
And replace not only the values in the column X but all other rows for each number of X
To clarify the question:
X(old version) -> X(new version)
1 2
2 3
3 1
So, If X=1 make it X=2
If X=2 make it X=3
If X=3 make it X=1
And if for example we change X=1 to X=2 we should put all the rows for X=1 to X=2
I have two vectors:
x <- c(1,1,1,1,1,1,2,2,2,3)
z <- c(10,10,10,10,10,10,20,20,20,30)
The desired output:
x z
1 30
2 10
2 10
2 10
2 10
2 10
2 10
3 20
3 20
3 20
You could
x1 <-c(2,3,1)[x]
x[order(x1)]
# [1] 3 1 1 1 1 1 1 2 2 2
or
x[order(chartr(old="123",new="231",x))]
#[1] 3 1 1 1 1 1 1 2 2 2
Update
If you have many columns.
x <- c(1,1,1,1,1,1,2,2,2,3)
z <- c(10,10,10,10,10,10,20,20,20,30)
set.seed(14)
y <- matrix(sample(25,10*3,replace=TRUE),ncol=3)
m1 <- as.data.frame(cbind(x,z,y))
x1 <- c(2,3,1)[m1$x]
x1
# [1] 2 2 2 2 2 2 3 3 3 1
res <- cbind(x=c(2,3,1)[m1$x[order(x1)]],subset(m1[order(x1),], select=-x))
res
# x z V3 V4 V5
#10 1 30 10 15 2
#1 2 10 7 23 9
#2 2 10 16 5 11
#3 2 10 24 12 16
#4 2 10 14 22 18
#5 2 10 25 22 19
#6 2 10 13 19 16
#7 3 20 24 9 10
#8 3 20 11 17 14
#9 3 20 13 22 18
If I'm understanding correctly, it sounds as though you want to define your own order for sorting something. Is that right? Two ways you could do that:
Option #1: Make another column in your data.frame and assign values in the order you'd like. If you wanted the threes to come first, the ones to come second and the twos to come third, you'd do this:
Data$y <- rep(NA, nrow(Data)
Data$y[Data$x == 3] <- 1
Data$y[Data$x == 1] <- 2
Data$y[Data$x == 2] <- 3
Then you can sort on y and your data.frame will have the order you want.
Option #2: If the numbers you list in x are levels in a factor, you could do this using plyr:
library(plyr)
Data$x <- revalue(Data$x, c("3" = "1", "1" = "2", "2" = "3"))
Personally, I think that the 2nd option would be rather confusing, but if you are using "1", "2", and "3" to refer to levels in a factor, that is one quick way to change things.

Is there any way to bind data to data.frame by some index?

#For say, I got a situation like this
user_id = c(1:5,1:5)
time = c(1:10)
visit_log = data.frame(user_id, time)
#And I've wrote a method to calculate interval
interval <- function(data) {
interval = c(Inf)
for (i in seq(1, length(data$time))) {
intv = data$time[i]-data$time[i-1]
interval = append(interval, intv)
}
data$interval = interval
return (data)
}
#But when I want to get intervals by user_id and bind them to the data.frame,
#I can't find a proper way
#Is there any method to get something like
new_data = merge(by(visit_log, INDICE=visit_log$user_id, FUN=interval))
#And the result should be
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
We can replace your loop with the diff() function which computes the differences between adjacent indices in a vector, for example:
> diff(c(1,3,6,10))
[1] 2 3 4
To that we can prepend Inf to the differences via c(Inf, diff(x)).
The next thing we need is to apply the above to each user_id individually. For that there are many options, but here I use aggregate(). Confusingly, this function returns a data frame with a time component that is itself a matrix. We need to convert that matrix to a vector, relying upon the fact that in R, columns of matrices are filled first. Finally, we add and interval column to the input data as per your original version of the function.
interval <- function(x) {
diffs <- aggregate(time ~ user_id, data = x, function(y) c(Inf, diff(y)))
diffs <- as.numeric(diffs$time)
x <- within(x, interval <- diffs)
x
}
Here is a slightly expanded example, with 3 time points per user, to illustrate the above function:
> visit_log = data.frame(user_id = rep(1:5, 3), time = 1:15)
> interval(visit_log)
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
11 1 11 5
12 2 12 5
13 3 13 5
14 4 14 5
15 5 15 5

Resources