From dataframe with values per min max to value per key - r

I have a dataframe with values defined per bucket. (See df1 below)
Now I have another dataframe with values within those buckets for which I want to look up a value from the bucketed dataframe (See df2 below)
Now I would like to have the result df3 below.
df1 <- data.frame(MIN = c(1,4,8), MAX = c(3, 6, 10), VALUE = c(3, 56, 8))
df2 <- data.frame(KEY = c(2,5,9))
df3 <- data.frame(KEY = c(2,5,9), VALUE = c(3, 56, 8))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
> df2
KEY
1 2
2 5
3 9
> df3
KEY VALUE
1 2 3
2 5 56
3 9 8
EDIT :
Extended the example.
> df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
> df2 <- data.frame(KEY = c(2,5,9,18,3))
> df3 <- data.frame(KEY = c(2,5,9,18,3), VALUE = c(3, 56, 3, 5, 3))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 3
4 14 18 5
> df2
KEY
1 2
2 5
3 9
4 18
5 3
> df3
KEY VALUE
1 2 3
2 5 56
3 9 3
4 18 5
5 3 3

This solution assumes that KEY, MIN and MAX are integers, so we can create a sequence of keys and then join.
df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
df2 <- data.frame(KEY = c(2,5,9,18,3))
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
group_by(VALUE, id=row_number()) %>% # for each value and row id
nest() %>% # nest rest of columns
mutate(KEY = map(data, ~seq(.$MIN, .$MAX))) %>% # create a sequence of keys
unnest(KEY) %>% # unnest those keys
right_join(df2, by="KEY") %>% # join the other dataset
select(KEY, VALUE)
# # A tibble: 5 x 2
# KEY VALUE
# <dbl> <dbl>
# 1 2.00 3.00
# 2 5.00 56.0
# 3 9.00 3.00
# 4 18.0 5.00
# 5 3.00 3.00
Or, group just by the row number and add VALUE in the map:
df1 %>%
group_by(id=row_number()) %>%
nest() %>%
mutate(K = map(data, ~data.frame(VALUE = .$VALUE,
KEY = seq(.$MIN, .$MAX)))) %>%
unnest(K) %>%
right_join(df2, by="KEY") %>%
select(KEY, VALUE)

A very good and well-thought-out solution from #AntioniosK.
Here's a base R solution implemented as a general lookup function given as arguments a key dataframe and a bucket dataframe defined as listed in the question. The lookup values need not be unique or contiguous in this example, taking account of #Michael's comment that values may occur in more than one row (though normally such lookups would use unique ranges).
lookup = function(keydf, bucketdf){
keydf$rowid = 1:nrow(keydf)
T = merge(bucketdf, keydf)
T = T[T$KEY >= T$MIN & T$KEY <= T$MAX,]
T = merge(T, keydf, all.y = TRUE)
T[order(T$rowid), c("rowid", "KEY", "VALUE")]
}
The first merge uses a Cartesian join of all rows in the key to all rows in the bucket list. Such joins can be inefficient if the number of rows in the real tables is large, as the result of joining x rows in the key to y rows in the bucket would be xy rows; I doubt this would be a problem in this case unless x or y run into thousands of rows.
The second merge is done to recover any key values which are not matched to rows in the bucket list.
Using the example data as listed in #AntioniosK's post:
> lookup(df2, df1)
rowid KEY VALUE
2 1 2 3
4 2 5 56
5 3 9 3
1 4 18 5
3 5 3 3
Using key and bucket exemplars that test edge cases (where the key = the min or the max), where a key value is not in the bucket list (the value 50 in df2A), and where there is a non-unique range (row 6 of df4 below):
df4 <- data.frame(MIN = c(1,4,8, 20, 30, 22), MAX = c(3, 6, 10, 25, 40, 24), VALUE = c(3, 56, 8, 10, 12, 23))
df2A <- data.frame(KEY = c(3, 6, 22, 30, 50))
df4
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
4 20 25 10
5 30 40 12
6 22 24 23
> df2A
KEY
1 3
2 6
3 22
4 30
5 50
> lookup(df2A, df4)
rowid KEY VALUE
1 1 3 3
2 2 6 56
3 3 22 10
4 3 22 23
5 4 30 12
6 5 50 NA
As shown above, the lookup in this case returns two values for the non-unique ranges matching the key value 22, and NA for values in the key but not in the bucket list.

Related

Checking if columns in dataframe are "paired"

I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?
You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)
If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.
I have this:
df <- data.frame(
a = rep(1:2, each = 5),
b = 1:10,
step_window = c(2,3,1,2,4, 1,2,3,2,1)
)
I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.
This is the output I'm looking for:
data.frame(
a = rep(1:2, each = 5),
step_window = c(2,3,1,2,4,
1,2,3,2,1),
b = 1:10,
sum_b_step_window = c(3, 9, 3, 9, 5,
6, 15, 27, 19, 10)
)
I tried to do this using the RcppRoll but I get an error Expecting a single value:
df %>%
group_by(a) %>%
mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))
I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :
library(dplyr)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window,
~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))
# a b step_window sum_b_step_window
# <int> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 9
# 3 1 3 1 3
# 4 1 4 2 9
# 5 1 5 4 5
# 6 2 6 1 6
# 7 2 7 2 15
# 8 2 8 3 27
# 9 2 9 2 19
#10 2 10 1 10
1) rollapply
rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.
library(dplyr)
library(zoo)
df %>%
group_by(a) %>%
mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
ungroup
2) SQL
This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.
library(sqldf)
sqldf("select A.*, sum(B.b) as sum
from df A
left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
and A.a = B.a
group by A.rowid")
Here is a solution with the package slider.
library(dplyr)
library(slider)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>%
ungroup()
It is flexible on different window sizes.
Output:
# A tibble: 10 x 4
a b step_window sum_b_step_window
<int> <int> <dbl> <int>
1 1 1 2 3
2 1 2 3 9
3 1 3 1 3
4 1 4 2 9
5 1 5 4 5
6 2 6 1 6
7 2 7 2 15
8 2 8 3 27
9 2 9 2 19
10 2 10 1 10
slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette
hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.
With _vec you're asking hop to return a vector: a double in this case.
row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.
data.table solution using cumulative sums
setDT(df)
df[, sum_b_step_window := {
cs <- c(0,cumsum(b))
cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

weighted mean between two specific rows

I have a dataframe of this form
id value
1 10
2 25
5 30
7 15
9 30
10 50
I would like to transform it in the following way
id value
1 10
2 25
5 30
9 30
7+10 43
where the obs with id "7+10" is the weighted mean of the previous obs for 7 and 10 with weights 0.2 and 0.8, in other words 43=0.2*15+0.8*50. I tried to use the function aggregate to do this, but it does not work. What can I use to make operations between specific rows?
Thank you for your help.
Since it is a lot easier to work with variables than with rows, you can transform your data from the long to the wide format with the package tidyr (part of the tidyverse), make your transformations, then back to the long format again with tidyr:
library(tidyverse)
dat <- tibble(
id = c(1, 2, 5, 7, 9, 10),
value = c(10, 25, 30, 15, 30, 50)
)
dat %>%
spread(id, value) %>%
mutate(`7 + 10` = 0.2 * `7` + 0.8 * `10`) %>%
select(- `7`, - `10`) %>%
gather("id", "value", everything())
id value
<chr> <dbl>
1 1 10
2 2 25
3 5 30
4 9 30
5 7 + 10 43

Splitting a range provided in a row into several smaller ranges in multiple rows

I am somehow new to programming and I have been struggling to have my desirable output as explained below.
Suppose i have a table like table below:
My input
which includes the range of coordinates (Start_MP & End_MP) of a specific segment (defied by ID) and length of the segment (difference between the range start and end).
What I need to do is, to split all those ranges which have a length of more than 2, into ranges of two or less. To make it more clear I need my output to be like table below
My desired output
I would appreciate if you let me know how I can handle that with R/ R packages?
Function tidyr::expand is the right option to use to expand rows based on choice/desire of OP.
Approach is to first use expand to generate desired number of rows and then use left_join to join those with original data.frame.
# Data
df <- data.frame(Segment_ID = c(1101, 1102, 1103), Start_MP = c(1, 5, 20),
End_MP = c(2, 10, 30), Segment_Length = c(1, 5, 10))
library(tidyverse)
df %>% group_by(Segment_ID) %>%
expand(Segment_ID, Segment_Sequence_Number =
seq(from = Start_MP, to = End_MP, by = 2)) %>%
left_join(df, by="Segment_ID") %>%
mutate(Start_MP = Segment_Sequence_Number) %>%
group_by(Segment_ID) %>%
mutate(End_MP_Calc = lead(Start_MP)) %>%
mutate(End_MP = coalesce(End_MP_Calc, End_MP)) %>%
filter(Start_MP != End_MP) %>%
mutate(Segment_Length = End_MP - Start_MP) %>%
group_by(Segment_ID) %>%
mutate(Segment_Sequence_Number = row_number()) %>%
select(-End_MP_Calc) %>% as.data.frame()
#Result
# Segment_ID Segment_Sequence_Number Start_MP End_MP Segment_Length
# 1 1101 1 1 2 1
# 2 1102 1 5 7 2
# 3 1102 2 7 9 2
# 4 1102 3 9 10 1
# 5 1103 1 20 22 2
# 6 1103 2 22 24 2
# 7 1103 3 24 26 2
# 8 1103 4 26 28 2
# 9 1103 5 28 30 2

Making a new column by subtracting values based on a key in R?

I have a data table like this
ID DAYS FREQUENCY
"ads" 20 3
"jwa" 45 2
"mno" 4 1
"ads" 13 3
"jwa" 60 2
"ads" 18 3
I want to add a column that subtracts the days based on the id and subtract the closest days together.
My new table would like like this:
ID DAYS FREQUENCY DAYS DIFF
"ads" 20 3 2 (because 20-18)
"jwa" 45 2 NA (because no value greater than 45 for that id)
"mno" 4 1 NA
"ads" 13 3 NA
"jwa" 60 2 15
"ads" 18 3 5
Bonus: Is there a way to use the merge function?
Here's an answer using dplyr:
require(dplyr)
mydata %>%
mutate(row.order = row_number()) %>% # row numbers added to preserve original row order
group_by(ID) %>%
arrange(DAYS) %>%
mutate(lag = lag(DAYS)) %>%
mutate(days.diff = DAYS - lag) %>%
ungroup() %>%
arrange(row.order) %>%
select(ID, DAYS, FREQUENCY, days.diff)
Output:
ID DAYS FREQUENCY days.diff
<fctr> <int> <int> <int>
1 ads 20 3 2
2 jwa 45 2 NA
3 mno 4 1 NA
4 ads 13 3 NA
5 jwa 60 2 15
6 ads 18 3 5
You can do this using dplyr and a quick loop:
library(dplyr)
# Rowwise data.frame creation because I'm too lazy not to copy-paste the example data
df <- tibble::frame_data(
~ID, ~DAYS, ~FREQUENCY,
"ads", 20, 3,
"jwa", 45, 2,
"mno", 4, 1,
"ads", 13, 3,
"jwa", 60, 2,
"ads", 18, 3
)
# Subtract each number in a numeric vector with the one following it
rolling_subtraction <- function(x) {
out <- vector('numeric', length(x))
for (i in seq_along(out)) {
out[[i]] <- x[i] - x[i + 1] # x[i + 1] is NA if the index is out of bounds
}
out
}
# Arrange data.frame in order of ID / Days and apply rolling subtraction
df %>%
arrange(ID, desc(DAYS)) %>%
group_by(ID) %>%
mutate(days_diff = rolling_subtraction(DAYS))

Resources