Inexact joining data based on greater equal condition - r

I have some values in
df:
# A tibble: 7 × 1
var1
<dbl>
1 0
2 10
3 20
4 210
5 230
6 266
7 267
that I would like to compare to a second dataframe called
value_lookup
# A tibble: 4 × 2
var1 value
<dbl> <dbl>
1 0 0
2 200 10
3 230 20
4 260 30
In particual I would like to make a join based on >= meaning that a value that is greater or equal to the number in var1 gets a values of x. E.g. take the number 210 of the orginal dataframe. Since it is >= 200 and <230 it would get a value of 10.
Here is the expected output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
I thought it should be doable using {fuzzyjoin} but I cannot get it done.
value_lookup <- tibble(var1 = c(0, 200,230,260),
value = c(0,10,20,30))
df <- tibble(var1 = c(0,10,20,210,230,266,267))
library(fuzzyjoin)
fuzzyjoin::fuzzy_left_join(
x = df,
y = value_lookup ,
by = "var1",
match_fun = list(`>=`)
)

An option is also findInterval:
df$value <- value_lookup$value[findInterval(df$var1, value_lookup$var1)]
Output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
As you're mentioning joins, you could also do a rolling join via data.table with the argument roll = T which would look for same or closest value preceding var1 in your df:
library(data.table)
setDT(value_lookup)[setDT(df), on = 'var1', roll = T]

You can use cut:
df$value <- value_lookup$value[cut(df$var1,
c(value_lookup$var1, Inf),
right=F)]
# # A tibble: 7 x 2
# var1 value
# <dbl> <dbl>
# 1 0 0
# 2 10 0
# 3 20 0
# 4 210 10
# 5 230 20
# 6 266 30
# 7 267 30

Related

Removing all the observations except for observations from day 10 or day 20

I want to remove all the observations except for observations from day 10 or day 20 from data(ChickWeight). But I want to use logical operations in r : and "&" or :|. Below is my code but i get an error
ChickWeight %>% slice(10|20)
We could concatenate (c) the indexes as a vector and use - to remove the observations in slice - slice requires numeric index
library(dplyr)
ChickWeight %>%
slice(-c(10, 20))
With filter, it expects a logical vector
ChickWeight %>%
filter(!row_number() %in% c(10, 20))
If this is based on the 'Time' column, use either of the one below
ChickWeight %>%
slice(-which(Time %in% c(10, 20)))
ChickWeight %>%
filter(! Time %in% c(10, 20))
Here is another option using filter:
ChickWeight %>%
filter(row_number() != 10 &
row_number() != 20)
# A tibble: 576 × 4
weight Time Chick Diet
<dbl> <dbl> <ord> <fct>
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 199 20 1 1
You can use subset,
ChickWeight |> subset(Time == 10 | Time == 20)
or with (same result)
ChickWeight[with(ChickWeight, Time == 10 | Time == 20), ]
# weight Time Chick Diet
# 6 93 10 1 1
# 11 199 20 1 1
# 18 103 10 2 1
# 23 209 20 2 1
# 30 99 10 3 1
# 35 198 20 3 1
# ...
or likewise a sequence if you aim for row numbers.
ChickWeight |> subset({m <- seq_len(nrow(ChickWeight)); m == 10 | m == 20})
ChickWeight[{m <- seq_len(nrow(ChickWeight)); m == 10 | m == 20}, ]
# weight Time Chick Diet
# 10 171 18 1 1
# 20 138 14 2 1

Use dplyr (I think) to manipulate a dataset

I am giving a data set called ChickWeight. This has the weights of chicks over a time period. I need to introduce a new variable that measures the current weight difference compared to day 0.
I first cleaned the data set and took out only the chicks that were recorded for all 12 weigh ins:
library(datasets)
library(dplyr)
Frequency <- dplyr::count(ChickWeight$Chick)
colnames(Frequency)[colnames(Frequency)=="x"] <- "Chick"
a <- inner_join(ChickWeight, Frequency, by='Chick')
complete <- a[(a$freq == 12),]
head(complete,3)
This data set is in the library(datasets) of r, called ChickWeight.
You can try:
library(dplyr)
ChickWeight %>%
group_by(Chick) %>%
filter(any(Time == 21)) %>%
mutate(wdiff = weight - first(weight))
# A tibble: 540 x 5
# Groups: Chick [45]
weight Time Chick Diet wdiff
<dbl> <dbl> <ord> <fct> <dbl>
1 42 0 1 1 0
2 51 2 1 1 9
3 59 4 1 1 17
4 64 6 1 1 22
5 76 8 1 1 34
6 93 10 1 1 51
7 106 12 1 1 64
8 125 14 1 1 83
9 149 16 1 1 107
10 171 18 1 1 129
# ... with 530 more rows

Adding a column to a data set under a certain condition of weights

I am giving a data set called ChickWeight. This has the weights of chicks over a time period. I need to introduce a new variable that measures the current weight difference compared to day 0. The data set is in library(datasets) so you should have it.
library(dplyr)
weightgain <- ChickWeight %>%
group_by(Chick) %>%
filter(any(Time == 21)) %>%
mutate(weightgain = weight - first(weight))
I have this code, but this code just subtracts each weight by 42 which is the weight at time 0 for chick 1. I need each chick to be subtracted by its own weight at time 0 so that the weightgain column is correct.
We could do
library(dplyr)
ChickWeight %>%
group_by(Chick) %>%
mutate(weightgain = weight - weight[Time == 0])
#Or mutate(weightgain = weight - first(weight))
# A tibble: 578 x 5
# Groups: Chick [50]
# weight Time Chick Diet weightgain
# <dbl> <dbl> <ord> <fct> <dbl>
# 1 42 0 1 1 0
# 2 51 2 1 1 9
# 3 59 4 1 1 17
# 4 64 6 1 1 22
# 5 76 8 1 1 34
# 6 93 10 1 1 51
# 7 106 12 1 1 64
# 8 125 14 1 1 83
# 9 149 16 1 1 107
#10 171 18 1 1 129
# … with 568 more rows
Or using base R ave
with(ChickWeight, ave(weight, Chick, FUN = function(x) x - x[1]))

Better way of binning data in a group in a data frame by equal intervals

I have a dataframe of which is characterized by many different ID's. For every ID there are multiple events which are characterized by the cumulative time duration between events(hours) and the duration of that event(seconds). So, it would look something like:
Id <- c(1,1,1,1,1,1,2,2,2,2,2)
cumulative_time<-c(0,3.58,8.88,11.19,21.86,29.54,0,5,14,19,23)
duration<-c(188,124,706,53,669,1506.2,335,349,395,385,175)
test = data.frame(Id,cumulative_time,duration)
> test
Id cummulative_time duration
1 1 0.00 188.0
2 1 3.58 124.0
3 1 8.88 706.0
4 1 11.19 53.0
5 1 21.86 669.0
6 1 29.54 1506.2
7 2 0.00 335.0
8 2 5.00 349.0
9 2 14.00 395.0
10 2 19.00 385.0
11 2 23.00 175.0
I would like to group by the ID and then restructure the group by sampling by a cumulative amount of every say 10 hours, and in that 10 hours sum by the duration that occurred in the 10 hour interval. The number of bins I want should be from say 0 to 30 hours. Thus were would be 3 bins.
I looked at the cut function and managed to make a hack of it within a dataframe - even me as a new r user I know it isn't pretty
test_cut = test %>%
mutate(bin_durations = cut(test$cummulative_time,breaks = c(0,10,20,30),labels = c("10","20","30"),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
which gives the output:
test_cut
Id time_bins duration
1 1 10 1018.0
2 1 20 53.0
3 1 30 2175.2
4 2 10 684.0
5 2 20 780.0
6 2 30 175.0
Ultimately I want the interval window and number of bins to be arbitrary - If I have a span of 5000 hours and I want to bin in 1 hour samples. For this I would use breaks=seq(0,5000,1) for the bins I would say labels = as.character(seq(1,5000,1))
This is will also be applied to a very large data frame, so computational speed somewhat desired.
A dplyr solution would be great since I am applying the binning per group.
My guess is there is a nice interaction between cut and perhaps split to generate the desired output.
Thanks in advance.
Update
After testing, I find that even my current implementation isn't quite what I'd like as if I say:
n=3
test_cut = test %>%
mutate(bin_durations = cut(test$cumulative_time,breaks=seq(0,30,n),labels = as.character(seq(n,30,n)),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
I get
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 24 669
6 1 30 1506.
7 2 3 335
8 2 6 349
9 2 15 395
10 2 21 385
11 2 24 175
Where there are no occurrences in the bin sequence I should just get 0 in the duration column. Rather than an omission.
Thus, it should look like:
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 15 0
6 1 18 0
7 1 21 0
8 1 24 669
9 1 27 0
10 1 30 1506.
11 2 3 335
12 2 6 349
13 2 9 0
14 2 12 0
15 2 15 395
16 2 18 0
17 2 21 385
18 2 24 175
19 2 27 0
20 2 30 0
Here is one idea via integer division (%/%)
library(tidyverse)
test %>%
group_by(Id, grp = cumulative_time %/% 10) %>%
summarise(toatal_duration = sum(duration))
which gives,
# A tibble: 6 x 3
# Groups: Id [?]
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 1018
2 1 1 53
3 1 2 2175.
4 2 0 684
5 2 1 780
6 2 2 175
To address your updated issue, we can use complete in order to add the missing rows. So, for the same example, binning in hours of 3,
test %>%
group_by(Id, grp = cumulative_time %/% 3) %>%
summarise(toatal_duration = sum(duration)) %>%
ungroup() %>%
complete(Id, grp = seq(min(grp), max(grp)), fill = list(toatal_duration = 0))
which gives,
# A tibble: 20 x 3
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 188
2 1 1 124
3 1 2 706
4 1 3 53
5 1 4 0
6 1 5 0
7 1 6 0
8 1 7 669
9 1 8 0
10 1 9 1506.
11 2 0 335
12 2 1 349
13 2 2 0
14 2 3 0
15 2 4 395
16 2 5 0
17 2 6 385
18 2 7 175
19 2 8 0
20 2 9 0
We could make these changes:
test$cummulative_time can be simply cumulative_time
breaks could be factored out and then used in the cut as shown
the second mutate could be changed to summarize in which case the select and distinct are not needed
it is always a good idea to close any group_by with a matching ungroup or in the case of summarize we can use .groups = "drop")
add complete to insert 0 for levels not present
Implementing these changes we have:
library(dplyr)
library(tidyr)
breaks <- seq(0, 40, 10)
test %>%
mutate(bin_durations = cut(cumulative_time, breaks = breaks,
labels = breaks[-1], include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
summarize(total_duration = sum(duration), .groups = "drop") %>%
complete(Id, bin_durations, fill = list(total_duration = 0))
giving:
# A tibble: 8 x 3
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 10 1018
2 1 20 53
3 1 30 2175.
4 1 40 0
5 2 10 684
6 2 20 780
7 2 30 175
8 2 40 0

R - Loop to subtract each successive value from first in a series

I am tinkering around with this loop (im new to writing loops but trying to learn).
I am aiming when x == 1, on first 1 match, store the value of z, then on each successive z value subtract that z value from the first value. If x == 0 then it will do nothing (not sure if i have to tell the code to do nothing when x ==0?)
This is my dummy data:
x <- c(0,0,1,1,1,0,1,1,1,1,0,0,0)
z <- c(10,34,56,43,56,98,78,98,23,21,45,65,78)
df <- data.frame(x,z)
for (i in 1:nrow(df)) {
if (df$x[i] == 1)
first_price <- df$z[i]
df$output <- first_price - df$z
}
}
I have my if (df$x == 1)
Then I want to save the first price... so first_price <- df$z[i]
The i in here, that means the first in the series right?
Then for my output... i wanted to subtract the first price from each successive price. If I fix the first price with [i] is this the correct way? And if I leave df$z would that then take the next price each time in the loop and subtract from
first_price <- df$z[i]?
Heres a visual:
******Progress****
> for (i in 1:nrow(df)) {
+ if (df$x[i] == 1) {
+ first_price <- df$z[1]
+ df$output <- first_price - df$z
+ }
+ }
> df$output
[1] 0 -24 -46 -33 -46 -88 -68 -88 -13 -17 -35 -55 -68
If i add [1] which is assigning the first element in df$z this actually fixes the first element and then subtracts each successive, now It needs to be rule based and understand that this is only to be the case when df$x == 1
This should work for you
library(dplyr)
library(data.table)
ans <- df %>%
mutate(originalrow = row_number()) %>% # original row position
group_by(rleid(x)) %>%
mutate(ans = first(z) - z) %>%
filter(x==1)
# # A tibble: 7 x 5
# # Groups: rleid(x) [2]
# x z originalrow `rleid(x)` ans
# <dbl> <dbl> <int> <int> <dbl>
# 1 1 56 3 2 0
# 2 1 43 4 2 13
# 3 1 56 5 2 0
# 4 1 78 7 4 0
# 5 1 98 8 4 -20
# 6 1 23 9 4 55
# 7 1 21 10 4 57
vans <- ans$ans
# [1] 0 13 0 0 -20 55 57
EDIT
To keep all rows, and outputting 0 where x==0
ans <- df %>%
mutate(originalrow = row_number()) %>%
group_by(rleid(x)) %>%
mutate(ans = ifelse(x==0, 0, first(z) - z))
# # A tibble: 13 x 5
# # Groups: rleid(x) [5]
# x z originalrow `rleid(x)` ans
# <dbl> <dbl> <int> <int> <dbl>
# 1 0 10 1 1 0
# 2 0 34 2 1 0
# 3 1 56 3 2 0
# 4 1 43 4 2 13
# 5 1 56 5 2 0
# 6 0 98 6 3 0
# 7 1 78 7 4 0
# 8 1 98 8 4 -20
# 9 1 23 9 4 55
# 10 1 21 10 4 57
# 11 0 45 11 5 0
# 12 0 65 12 5 0
# 13 0 78 13 5 0

Resources