How to partition into equal sum subsets in R? - r

I have a dataset with a column, X1, of various values. I would like to order this dataset by the value of X1, and then partition into K number of equal sum subsets. How can this be accomplished in R? I am able to find quartiles for X1 and append the quartile groupings as a new column to the dataset, however, quartile is not quite what I'm looking for. Thank you in advance!
df <- data.frame(replicate(10,sample(0:1000,1000,rep=TRUE)))
df <- within(df, quartile <- as.integer(cut(X1, quantile(X1, probs=0:4/4), include.lowest=TRUE)))

Here's a rough solution (using set.seed(47) if you want to reproduce exactly). I calculate the proportion of the sum for each row, and do the cumsum of that proportion, and then cut that into the desired number of buckets.
library(dplyr)
n_groups = 10
df %>% arrange(X1) %>%
mutate(
prop = X1 / sum(X1),
cprop = cumsum(prop),
bins = cut(cprop, breaks = n_groups - 1)
) %>%
group_by(bins) %>%
summarize(
group_n = n(),
group_sum = sum(X1)
)
# # A tibble: 9 × 3
# bins group_n group_sum
# <fct> <int> <int>
# 1 (-0.001,0.111] 322 54959
# 2 (0.111,0.222] 141 54867
# 3 (0.222,0.333] 111 55186
# 4 (0.333,0.444] 92 55074
# 5 (0.444,0.556] 80 54976
# 6 (0.556,0.667] 71 54574
# 7 (0.667,0.778] 66 55531
# 8 (0.778,0.889] 60 54731
# 9 (0.889,1] 57 55397
This could of course be simplified--you don't need to keep around the extra columns, just mutate(bins = cut(cumsum(X1 / sum(X1)), breaks = n_groups - 1)) will add the bins column to the original data (and no other columns), and the group_by() %>% summarize() is just to diagnose the result.

Related

Finding the exact match in the values in the categorical variables

I wanted to find an exact match in the values between all three columns (rg1,rg2,rg3).Below is my dataframe.
For instance - first row has a combination of (70,71,72) , if this same combination appears in the remaining rows for the rest of the user ids , then, keep only those users and delete rest.
To describe it further - first row has (70,71,72) and say , if row 10 had the same values in B,C,D column, then I just want to display row 1 and row 10.(using R)
I tried doing clustering on this - kmodes. But I'm not getting the expected results.The current code is grouping all the rgs but it's kind of validating only a single Rg that has appeared most frequently in the data frame(above is my dataframe) and ranking them accordingly.
Can someone please guide me on this?Is there any better way to do this?
kmodes <- klaR::kmodes(mapped_df, modes= 5, iter.max = 10, weighted = FALSE)
#Add these clusters to the main dataframe
final <- mapped_df %>%
mutate(cluster = kmodes$cluster)
You can sort across the columns, then look for duplicates.
set.seed(1234)
df <- tibble(Userids = 1:20,
rg_1 = sample(1:20, 20, TRUE),
rg_2 = sample(1:20, 20, TRUE),
rg_3 = sample(1:20, 20, TRUE))
df[4, -1] <- rev(df[15, -1])
# sort across the columns
df_sorted <- t(apply(df[-1], 1, sort))
# return the duplicated rows
df[duplicated(df_sorted) | duplicated(df_sorted, fromLast = TRUE), ]
This will give you a data frame with all the duplicated values. Once you have the sorted data frame, it should be easy enough to find what you need.
Userids rg_1 rg_2 rg_3
<int> <int> <int> <int>
1 4 16 17 6
2 15 6 17 16
I still do not understand what are you precisely looking for. Besides, it is always recomended to include the data frame you are refering.
I could suggest a solution, which implies the use of a threshold value. So, for each row, if some of the differences (between rg1-rg2, rg1-rg3 and rg2-rg3) is higher than the threshold, it will not be consider.
threshold <- 5
index <- mapped_df %>%
tibble(g1_g2 = abs(rg1 - rg2),
g1_g3 = abs(rg1 - rg3),
g2_g3 = abs(rg2 - rg3)) %>%
apply(1, function(x, threshold) all(x <= threshold),
threshold = threshold)
mapped_df[index]
Maybe you're (just) after some filtering?
library(tidyverse)
data <- tibble(Userids = 1:10,
rg1 = c(70,1:8,70),
rg2 = c(71,11:18,71),
rg3 = c(72,21:28,72))
data |>
filter(rg1 == 70,
rg2 == 71,
rg3 == 72)
data |>
filter(rg1 == rg1[row_number()==1],
rg2 == rg2[row_number()==1],
rg3 == rg3[row_number()==1])
Output:
# A tibble: 2 × 4
Userids rg1 rg2 rg3
<int> <dbl> <dbl> <dbl>
1 1 70 71 72
2 10 70 71 72
Or combine them for ease:
data |>
unite(rg, starts_with("rg")) |>
filter(rg == rg[row_number()==1])
Output:
# A tibble: 2 × 2
Userids rg
<int> <chr>
1 1 70_71_72
2 10 70_71_72

How to group multiple rows based on some criteria and sum values in R?

Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6

How to create conditionally new groups when summarizing group means in R

I have data for which I want to summarize group means. I then would like to re-group some of the smaller groups (matching a certain n < x condition) into a group called "others". I found a way to do this. But it feels like there are more efficient solutions out there. I wonder how a data.table approach would solve the problem.
Here is an example using tibble and dyplr.
# preps
library(tibble)
library(dplyr)
set.seed(7)
# generate 4 groups with more observations
tbl_1 <- tibble(group = rep(sample(letters[1:4], 150, TRUE), each = 4),
score = sample(0:10, size = 600, replace = TRUE))
# generate 3 groups with less observations
tbl_2 <- tibble(group = rep(sample(letters[5:7], 50, TRUE), each = 3),
score = sample(0:10, size = 150, replace = TRUE))
# put them into one data frame
tbl <- rbind(tbl_1, tbl_2)
# aggregate the mean scores and count the observations for each group
tbl_agg1 <- tbl %>%
group_by(group) %>%
summarize(MeanScore = mean(score),
n = n())
So far so easy.
Next I want to only show groups with more than 100 observations. All other groups should be merged into one group called "others".
# First, calculate summary stats for groups less then n < 100
tbl_agg2 <- tbl_agg1 %>%
filter(n<100) %>%
summarize(MeanScore = weighted.mean(MeanScore, n),
sumN = sum(n))
Note: There was a mistake in the calculation above which is now corrected (#Frank: thanks for spotting it!)
# Second, delete groups less then n < 100 from the aggregate table and add a row containing the summary statistics calculated above instead
tbl_agg1 <- tbl_agg1 %>%
filter(n>100) %>%
add_row(group = "others", MeanScore = tbl_agg2[["MeanScore"]], n = tbl_agg2[["sumN"]])
tbl_agg1 basically shows what I want it to show, but I wonder if there is a smoother, more efficient way to do this. At the same time I wonder how a data.table approach would deal with the problem at hand.
I welcome any suggestions.
Your calculation for the "other" group is wrong, I guess... should be...
tbl_agg1 %>% {bind_rows(
filter(., n>100),
filter(., n<100) %>%
summarize(group = "other", MeanScore = weighted.mean(MeanScore, n), n = sum(n))
)}
However, you could keep things a lot simpler from the start by using a different grouping variable:
tbl %>%
group_by(group) %>%
group_by(g = replace(group, n() < 100, "other")) %>%
summarise(n = n(), m = mean(score))
# A tibble: 5 x 3
g n m
<chr> <int> <dbl>
1 a 136 4.79
2 b 188 4.49
3 c 160 5.32
4 d 116 4.78
5 other 150 5.42
Or with data.table
library(data.table)
DT = data.table(tbl)
DT[, n := .N, by=group]
DT[, .(.N, m = mean(score)), keyby=.(g = replace(group, n < 100, "other"))]
g N m
1: a 136 4.786765
2: b 188 4.489362
3: c 160 5.325000
4: d 116 4.784483
5: other 150 5.420000

Aggregating data from value and count attributes

In R, I have a large list of large dataframes consisting of two columns, value and count. The function which I am using in the previous step returns the value of the observation in value, the corresponding column count shows how many times this specific value has been observed. The following code produces one dataframe as an example - however all dataframes in the list do have different values resp. value ranges:
d <- as.data.frame(
cbind(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
)
Now I would like to aggregate the data to be able to create viewable visualizations. This aggregation should be applied to all dataframes in a list, which do each have different value ranges. I am looking for a function, cutting the data into new values and counts, a little bit like a histogram function. So for example, for all data from a value of 0 to 100, the counts should be summated (and so on, in a defined interval, with a clean interval border starting point like 0).
My first try was to create a simple value vector, where each value is repeated in a number of times that is determined by the count field. Then, the next step would have been applying the hist() function without plotting to obtain the aggregated values and counts which can be defined in the hist()'s arguments. However, this produces too large vectors (some Gb for each) that R cannot handle anymore. I appreciate any solutions or hints!
I am not entirely sure I understand your question correctly, but this might solve your problem or at least point you in a direction. I make a list of data-frames and then generate a new column containing the result of applying the binfunction to each dataframe by using mapfrom the purrr package.
library(tidyverse)
d1 <- d2 <- tibble(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
d <- tibble(name = c('d1', 'd2'), data = list(d1, d2))
binfunction <- function(data) {
data %>% mutate(bin = value - (value %% 100)) %>%
group_by(bin) %>%
mutate(sum = sum(count)) %>%
select(bin, sum)
}
d_binned <- d %>%
mutate(binned = map(data, binfunction)) %>%
select(-data) %>%
unnest() %>%
group_by(name, bin) %>%
slice(1L)
d_binned
#> Source: local data frame [66 x 3]
#> Groups: name, bin [66]
#>
#> # A tibble: 66 x 3
#> name bin sum
#> <chr> <dbl> <dbl>
#> 1 d1 900 495123.8
#> 2 d1 1000 683108.6
#> 3 d1 1100 546524.4
#> 4 d1 1200 447077.5
#> 5 d1 1300 604759.2
#> 6 d1 1400 506225.4
#> 7 d1 1500 499666.5
#> 8 d1 1600 541305.9
#> 9 d1 1700 514080.9
#> 10 d1 1800 586892.9
#> # ... with 56 more rows
d_binned %>%
ggplot(aes(x = bin, y = sum, fill = name)) +
geom_col() +
facet_wrap(~name)
See this comment for my inspiration for the binning. It bins the data in groups of 100, so e.g. bin 1100 represents 1100 to <1200 etc. I imagine you can adapt the binfunction to your needs.

Iteratively rbind 10% of the data from data frame and plotting

I have three data frames, each having 1 column but having different number of rows 100,100,1000 for df1,df2,df3 respectively. I want to do an rbind iteratively and calculate measures like mean repeatedly for the small chunks of data by taking 10% of the data each time. Meaning in the first iteration I need to have 10 rows from df1, 10 from df2 and 100 from df3 and for this set i need to get a mean and the process should continue 10 times. And I need to plot the iterations chunks over time showing the mean in y-axis over iterations and get an overall mean with this procedure. Any suggestions?
df1<- data.frame(A=c(1:100))
df2<- data.frame(A=c(1:100))
df3<- data.frame(A=c(1:1000))
library(dplyr)
for i in (1:10)
{ df[i]<- rbind_list(df1,df2,df3)
mean=mean(df$A)}
You're making things complicated by trying to keep separate data frames. Add a "group" column---call it "iteration" if you prefer---and get your data in one data frame:
df1$group = rep(1:10, each = nrow(df1) / 10)
df2$group = rep(1:10, each = nrow(df2) / 10)
df3$group = rep(1:10, each = nrow(df3) / 10)
df = rbind(df1, df2, df3)
means = group_by(df, group) %>% summarize(means = mean(A))
means
# Source: local data frame [10 x 2]
#
# group means
# 1 1 43
# 2 2 128
# 3 3 213
# 4 4 298
# 5 5 383
# 6 6 468
# 7 7 553
# 8 8 638
# 9 9 723
# 10 10 808
Your overall mean is mean(df$A). You can plot with with(means, plot(group, means)).
Edits:
If the groups don't come out exactly, here's how I'd assign the group column. Make sure your dplyr is up-to-date, this uses the the .id argument of bind_rows() which was new this month in version 0.4.3.
library(dplyr)
# dplyr > 0.4.3
df = bind_rows(df1, df2, df3, .id = "id")
df = df %>% group_by(id) %>%
mutate(group = (0:(n() - 1)) %/% (n() / 10) + 1)
The id column tells you which data frame the row came from, and the group column splits it into 10 groups. The rest of the code from above should work just fine.

Resources