R: Average rows together within a group to match a second dataframe - r

This question is an extension of one of my previous questions. I have two long-form dataframes, small and big, that have the same groups (id) and a combination of numeric and character variables. In big, the number of rows per group is greater compared to the number of rows per group in small. My goal is to average rows together in big so that the number of rows per group matches the number of rows per group in small as closely as possible.
I have created a reprex below, which gets close but not as close as I think is possible. I believe the issue is that in big, each group may need its own sum_ref value (which refers to how many n rows should be averaged together), but I am unsure of how to implement that. Any advice is appreciated.
set.seed(123)
library(tidyverse)
id <- c(rep("101", 10), rep("102", 21), rep("103", 15))
color <- c(rep ("red", 10), rep("blue", 21), rep("green", 15))
time <- c(1:10, 1:21, 1:15)
V1 <- sample(1:3, 10+21+15, replace=TRUE)
V2 <- sample(1:3, 10+21+15, replace=TRUE)
V3 <- sample(1:3, 10+21+15, replace=TRUE)
small <- data.frame(id,color,time,V1,V2,V3) %>%
mutate(time = 1:length(V1)) %>%
select(id, time, everything())
id <- c(rep("101", 32), rep("102", 45), rep("103", 27))
color <- c(rep ("red", 32), rep("blue", 45), rep("green", 27))
time <- c(1:32, 1:45, 1:27)
V1 <- sample(1:3, 32+45+27, replace=TRUE)
V2 <- sample(1:3, 32+45+27, replace=TRUE)
V3 <- sample(1:3, 32+45+27, replace=TRUE)
big <- data.frame(id,color,time,V1,V2,V3) %>%
mutate(time = 1:length(V1)) %>%
select(id, time, everything())
rm(V1,V2,V3,color,id,time)
small_size <- nrow(small)
big_size <- nrow(big)
sum_ref <- big_size/small_size
# `new` should have the same number of rows as `small`
# also for each ID, the number of rows in `small` should equal the number of rows in `new`
new <- big %>%
group_by(id, color, new_time = as.integer(gl(n(), sum_ref, n()))) %>%
summarise(across(starts_with('V'), mean), .groups = 'drop')
print(nrow(small))
#> [1] 46
print(nrow(new))
#> [1] 53
# for id 101
small %>% filter(id == "101") %>% nrow()
#> [1] 10
new %>% filter(id == "101") %>% nrow()
#> [1] 16

You are correct: "each group may need its own sum_ref value". My solution to that is creating a variable to storage the sizes of each group in small:
small_size <- small %>% group_by(id, color) %>% summarise(size = n())
Then, for each group in big, we create the column of which values should be averaged together. In your code, you did that using as.integer(gl(n(), sum_ref, n())), but as sum_ref is a decimal number, that doesn't assures that this column will go from 1 to the size of the corresponding small group size, so I made a new version:
seq(1, small_size$size[cur_group_id()], length = n()) %>% trunc()
This makes a sequence that goes from 1, to the small group size stored in small_size, using the cur_group_id() to acess the correct entry of the array. This sequence will have size n() (the big group size) and will only have intergers because of %>% trunc() (which does the same as your as.interger()). There might be a better way to do this, as with my method the last value only apears once. But regardles of the way you choose to make this vector transformation, the essense of the answer is how to make a diferent transformation for each group with small_size$size[cur_group_id()].
new <- big %>%
group_by(id, color) %>%
mutate(new_time = seq(1, small_size$size[cur_group_id()], length = n()) %>% trunc()) %>%
group_by(new_time, .add = TRUE) %>%
summarise(across(starts_with('V'), mean), .groups = 'drop')
print(nrow(small))
#> [1] 46
print(nrow(new))
#> [1] 46
# for id 101
small %>% filter(id == "101") %>% nrow()
#> [1] 16
new %>% filter(id == "101") %>% nrow()
#> [1] 16

Related

I'm trying to find a way to check if any of the values match the sequentially remaining values in a group

library(tidyverse)
#Create a data frame with 100 rows and 3 columns
set.seed(42)
df <- data.frame(DATE = as.Date("2021-01-01") + 0:(50-1),
ID = 1:50,
N1 = round(rnorm(1000, mean = 1150, sd = 4)),
N2 = round(rnorm(1000, mean = 1150, sd = 4)))
#Arrange DF and create a Match variable
df <- df %>%
arrange(DATE, ID) %>%
group_by(DATE, ID) %>%
mutate(Match_N = case_when(N1 == N2 ~ N1,
TRUE ~ 0)) %>%
ungroup()
In R I have this data structure. I want to check if the Match_N occurs, and does the same number appear in N1 or N2 sequentially below the Match_N and within the same group?
In my df, the number 1151 matches and appears on row 15 within the same group (DATE, ID).
The matched number 1146 does not appear within the same group (DATE, ID)
So I found a solutionfor my problem.
I'm sure there is a better way.
#Create Match_true variable
df <- df %>%
mutate(Match_true = case_when(Match_N == N1 ~ 1,
TRUE ~ 0))
#Create Match_group with cumsum
df <- df %>%
group_by(DATE, ID) %>%
mutate(Match_group = ID + cumsum(Match_true)) %>%
ungroup()
#Find the match for N1 or N2 in the newly created group
df <- df %>%
group_by(DATE, ID, Match_group) %>%
mutate(Search = Match_N %in% lead(N1, order_by=Match_group | lead(N2, order_by=Match_group))) %>%
ungroup()
The cumulative sum is not the best for the question asked because if multiple matches are in one ID, it will only search until the next match appears. This is for my use case enough but might not be the best way for others (e.g. row 116).

How to count the number of times a value appears in a 160Million by 2 dataframe - memory issues

I have a data frame that has 160M rows and 2 columns(material name and price). I want to determine how many the frequency at which prices occur.
For example,
the price $10 was given 100 different times. I'd like to sort the values by largest occurrence to smallest occurs (example, $100 was given 1000 times)
There are 2,484,557 unique prices, so a "table" is not the most useful solution.
my issue is I'm dealing with memory issues.
Any suggestions how I can accomplish this?
Here's a 2 GB data frame with 160M rows and about 3M unique prices:
set.seed(42)
n = 160E6
fake_data <- data.frame(material = sample(LETTERS, n, replace = TRUE),
price = sample(1:3E6, n, replace = TRUE))
I like dplyr syntax, but for large data with many groups, data.table and collapse offer much better performance.
We could use dtplyr to translate dplyr code to data.table. This takes 22 seconds on my machine, with the result showing how many times each price appears in the data.
library(dplyr)
library(dtplyr)
fake_data %>%
lazy_dt() %>%
count(price, sort = TRUE)
Result
Source: local data table [3,000,000 x 2]
Call: `_DT2`[, .(n = .N), keyby = .(price)][order(desc(n))]
price n
<int> <int>
1 2586972 97
2 2843789 95
3 753207 92
4 809482 92
5 1735845 92
6 809659 90
# … with 2,999,994 more rows
If you need higher performance and don't mind a heuristic, you could also sample your data to make it 10% or 1% as big; if any placeholder values occur frequently in the whole data, they are also likely to be frequent in a random sample.
I'd probably create price intervals, e.g. $0-50, $51-100, $101-150 etc.
EDIT: more comprehensive solutution
library(tidyverse)
df <- letters %>%
expand_grid(., .) %>%
rename(v1 = `....1`,
v2 = `....2`) %>%
mutate(name = paste0(v1, v2)) %>%
select(name) %>%
bind_rows(., ., ., .)
df
n <- nrow(df)
df <- df %>%
mutate(price = rnorm(n = n, mean = 1000, sd = 200))
df %>%
ggplot(aes(x = price)) +
geom_histogram()
df <- df %>%
mutate(price_grp = case_when(price < 500 ~ "$0-500",
price > 500 & price <= 1000 ~ "$501-1000",
price > 1000 & price <= 1500 ~ "$1001-1500",
price > 1500 ~ "+ $1500"))
df %>%
group_by(price_grp) %>%
summarize(occurences = n()) %>%
arrange(desc(occurences))

sample multiple different sample sizes using crossing and sample_n to create single df

I am attempting to sample a dataframe using sample_n. I know that sample_n usually takes a single size= argument at a time, however, I would like to sample sizes from 2 to the max # of rows in the df. Unfortunately, the code I have compiled below does not do the job. The needed output would be a dataframe with an id= column or a list divided by the id column from crossing().
df <- data.frame(Date = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
data_sampled_by_stratum <- df %>%
group_by(Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
group_by(id) %>%
sample_n(size=c(2:15)) %>%
group_by(CLUSTER_ID,Date) %>% filter(n() > 2)
If you had a column with different sites you could do this.
data_sampled_by_stratum <- data_grouped_by_stratum %>%
group_by(siteid, Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
sample_n(rbinom(1,sum(siteid==i),(1-s)^2))

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.
df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
Time = seq(1:20))
The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.
An example of what each CV fold would comprise:
TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)
TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)
TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)
TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)
TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)
Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:
CVFolds <- caret::groupKFold(df$Time)
CVFolds
Thanks in advance!
For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:
library(tidyverse)
df %>%
mutate(id = row_number()) %>% #use the row number as a column called id
filter(Time > 15) %>% #filter Time as per your need
split(.$Time) %>% #split df to a list by Time
map(~ .x %>% select(id)) #select row numbers for each list element
example with two rows per each time:
df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
Time = rep(1:20, each = 2))
df %>%
mutate(id = row_number()) %>%
filter(Time > 15) %>%
split(.$Time) %>%
map(~ .x %>% select(id)) -> test_folds
test_folds
#output
$`16`
id
1 31
2 32
$`17`
id
3 33
4 34
$`18`
id
5 35
6 36
$`19`
id
7 37
8 38
$`20`
id
9 39
10 40
with unequal number of rows per time
df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>% select(id))
$`2`
id
1 6
2 7
3 8
$`3`
id
4 9
5 10
Now you can define these hold out folds inside trainControl with the argument indexOut.
EDIT: to get similar output as caret::groupKFold one can:
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>%
select(id) %>%
unlist %>%
unname) %>%
unname

Hot deck imputation in dplyr

I'm trying to do a hot deck imputation in R with the dplyr package. I have non-finite values that I would like to replace with a random value drawn from within the same group.
myData <- data.frame(value = sample(c(Inf, NaN, 1:8), 100, replace=TRUE),
group = sample(letters[1:4], 100, replace=TRUE))
value group
1 4 c
2 6 d
3 Inf c
4 8 c
5 7 a
6 2 b
This code runs but also samples the Inf and NaN values.
myData <- myData %>%
group_by(group) %>%
mutate(imputedvalue = sample(value, n(), replace = TRUE))
table(is.finite(myData$imputedvalue), is.infinite(myData$imputedvalue))
FALSE TRUE
FALSE 16 7
TRUE 77 0
This code doesn't run.
myData <- myData %>%
group_by(group) %>%
mutate(imputedvalue = ifelse(is.finite(value), value,
sample(value, n(), replace = TRUE)))
Error in n() : This function should not be called directly
I feel like there should be a filter() command of some sort, but I don't really see how this should work...
Here is an approach that involves splitting the dataset up first:
# filter non-infinite records
myDataOK <- myData %>%
filter(value %>% is.finite)
# how many replacements are needed?
# sample these, a la #eddi
myDataimputed <- myData %>%
group_by(group) %>%
summarise(n_inf = sum(!is.finite(value))) %>%
group_by(group) %>%
do(sample_n(filter(myDataOK,group == .$group),size = .$n_inf,replace = TRUE))
## and combine!
myData2 <- rbind(myDataOK,myDataimputed)
## here are some various checks:
## same size as original?
nrow(myData2) == nrow(myData)
## all infinites replaced?
with(myData2,table(is.finite(value), is.infinite(value)))
## should be no *decreases* after shuffling.
## value x block combinations might increase but should never decrease
check1 <- myDataOK %>%
group_by(group,value) %>%
tally %>%
arrange(group,value)
check2 <- myData2 %>%
group_by(group,value) %>%
tally %>%
arrange(group,value)
if(any((check2$n-check1$n) < 0)) stop("something went wrong!")
## finally, the increases in group size should equal the number of missing values
Ninf <- myData %>%
group_by(group) %>%
summarise(n_inf = sum(!is.finite(value)))
if(any(tally(check2)$n - tally(check1)$n - Ninf$n_inf !=0) )
stop("group sizes changed!")

Resources