Given two sorted vectors, how can you get the index of the closest values from one onto the other.
For example, given:
a = 1:20
b = seq(from=1, to=20, by=5)
how can I efficiently get the vector
c = (1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)
which, for each value in a, provides the index of the largest value in b that is less than or equal to it. But the solution needs to work for unpredictable (though sorted) contents of a and b, and needs to be fast when a and b are large.
You can use findInterval, which constructs a sequence of intervals given by breakpoints in b and returns the interval indices in which the elements of a are located (see also ?findInterval for additional arguments, such as behavior at interval boundaries).
a = 1:20
b = seq(from = 1, to = 20, by = 5)
findInterval(a, b)
#> [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
We can use cut
as.integer(cut(a, breaks = unique(c(b-1, Inf)), labels = seq_along(b)))
#[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Related
I have a data.frame like this :
A B C
4 8 2
1 3 5
5 7 6
It could have more column and lines.
So what I'd like to know is for each column how many times they have the lowest values (in my example the result should be 2 for A and 1 for C).
d = data.frame(a = c(4, 1, 5), b = c(8, 3, 7), c = c(2, 5, 6))
row_mins = apply(d, 1, min)
# alternately, slightly more efficient
row_mins = do.call(pmin, d)
colSums(d == row_mins)
# a b c
# 2 0 1
This question already has answers here:
Aggregating regardless of the order of columns
(4 answers)
Closed 3 years ago.
The following works as expected:
m <- matrix (c(1, 2, 3,
1, 2, 4,
2, 1, 4,
2, 1, 4,
2, 3, 4,
2, 3, 6,
3, 2, 3,
3, 2, 2), byrow=TRUE, ncol=3)
df <- data.frame(m)
aggdf <- aggregate(df$X3, list(df$X1, df$X2), FUN=sum)
colnames(aggdf) <- c("A", "B", "value")
and results in:
A B value
1 2 1 8
2 1 2 7
3 3 2 5
4 2 3 10
But I would like to treat rows 1/2 and 3/4 as equal, not caring whether observation A is 1 and B is 2 or vice versa.
I also do not care about how the aggregation is sorting A/B in the final data.frame, so both of the following results would be fine:
A B value
1 2 1 15
2 3 2 15
A B value
1 1 2 15
2 2 3 15
How can that be achieved?
You need to get them in a consistent order. For just 2 columns, pmin and pmax work nicely:
df$A = with(df, pmin(X1, X2))
df$B = with(df, pmax(X1, X2))
aggregate(df$X3, df[c("A", "B")], FUN = sum)
# A B x
# 1 1 2 15
# 2 2 3 15
For more columns, use sort, as akrun recommends:
df[1:2] <- t(apply(df[1:2], 1, sort))
By changing 1:2 to all the key columns, this generalizes up easily.
For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.
> test
user_id period
1 1 1
2 5 1
3 1 1
4 3 1
5 4 1
6 2 2
7 3 2
8 2 2
9 3 2
10 1 2
11 5 3
12 5 3
13 2 3
14 1 3
15 4 3
16 5 4
17 5 4
18 5 4
19 4 4
20 3 4
For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.
The output would be the following:
> output
period retention
1 1 NA
2 2 0.500
3 3 0.666
4 4 0.500
The test data:
> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5,
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply.
splt <- split(test$user_id, test$period)
carryover <- function(x, y) {
length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])
1 2 3
0.5000000 0.6666667 0.5000000
Here is an attempt using dplyr, though it also uses some standard syntax in the summarise:
test %>%
group_by(period) %>%
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>%
mutate(retention=lag(retention))
This returns:
period retention
<dbl> <dbl>
1 1 NA
2 2 0.5000000
3 3 0.6666667
4 4 0.5000000
This isn't so elegant but it seems to work. Assuming df is the data frame:
# make a list to hold unique IDS by
uniques = list()
for(i in 1:max(df$period)){
uniques[[i]] = unique(df$user_id[df$period == i])
}
# hold the retention rates
retentions = rep(NA, times = max(df$period))
for(j in 2:max(df$period)){
retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}
Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.
I have bunch of observations
x = c(1, 2, 4, 1, 6, 7, 11, 11, 12, 13, 14)
that I want to turn into the group:
y = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3)
I.e I want the first 5 integers (1 to 5) to constitute one group, the next 5 integers to constitute the next group (6 to 10), and so on.
Is there a straightforward way to accomplish this without a loop?
Clarification: I need to programmatically create the groups form the input vector (x)
We can use %/% to create the group
x%/%5+1
#[1] 1 1 1 1 2 2 3 3 3 3 3
You can use ceiling to create groups
ceiling(x/5)
# [1] 1 1 1 1 2 2 3 3 3 3 3
I am relatively new with R and I have a problem with a dataframe.
I have a very long dataframe (df1) with some coordinates xy and a value z. I have a shorter dataframe (df2) with the same columns but smaller number of rows. I want to replace values in df1 when xy are equal in df2.
x = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4)
y = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
z = c(8, 5, 3, 1, 2, 6, 8, 5, 3, 2, 8, 4, 4, 6, 2, 1)
df1 = data.frame(x, y, z)
x1=c(1,3,4)
y1=c(2,1,4)
z1=c(58,37,23)
df2=data.frame(x1,y1,z1)
names(df2) <- c("x", "y", "z")
I thought that I might use ifelse function as:
df1$znew<-ifelse((df1[,1]== df2[,1])&(df1[,2]==df2[,2]), df2[,3], df1[,3])
But the two objects are not the same dimensions.
I have tried to use loops so it analyse each row to compare x and y and then decide what z to use but I can't make it work.
At the end I would like to have a dataframe with a new variable of z to compare the values and corroborate that it really changed the values. My final dataframe would look like:
znew = c(8,58,3,1,2,6,8,5,37,2,8,4,4,6,2,23)
I really appreciate any help and I am sorry if somebody else posted similar questions, I have been all day trying to figure it out and I can't find any example that suits my case.
Assuming the two data frames do in fact have the same column names (probably just a typo in your question), you might do this with merge:
tmp <- merge(df1,df2,all.x = TRUE,by = c('x','y'))
tmp$z.x[!is.na(tmp$z.y)] <- tmp$z.y[!is.na(tmp$z.y)]
> tmp
x y z.x z.y
1 1 1 8 NA
2 1 2 4 4
3 1 3 3 NA
4 1 4 1 NA
5 2 1 2 NA
6 2 2 6 NA
7 2 3 8 NA
8 2 4 5 NA
9 3 1 4 4
10 3 2 2 NA
11 3 3 8 NA
12 3 4 4 NA
13 4 1 4 NA
14 4 2 6 NA
15 4 3 2 NA
16 4 4 3 3
Then just remove the extra column and rename the columns.