Random sample by group and filtering on the basis of result - r

I have a dataframe that is generated by the following code
l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)
dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)),
c("ids", "months", "calWeek", "value"))
and looks like this:
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 1 | 5 | 201710 | 6 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
| 2 | 88 | 201743 | 90 |
| 2 | 88 | 201744 | 51 |
+----+-------+----------+-------+
I would like to randomly sample a calendar week from each id-month group (the months are not calendar months). Then I would like to keep all id-month combination prior to the sample months.
An example output could be: suppose the sampling output returned cal week 201743 for the group id=2 and month=88 and 201709 for the group id=1 and month=5, then the final ouput should be
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
2 | 88 | 201743 | 90 |
+----+-------+----------+-------+
I tried to work with dplyr's sample_n function (which is going to give me the random calendar week by id-month group, but then I do not know how to get all calendar weeks prior to that date. Can you help me with this. If possible, I would like to work with dplyr.
Please let me know in case you need further information.
Many thanks

require(dplyr)
set.seed(1) # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1))
sampled_day <- sampled$calWeek
dat %>% group_by(ids) %>%
mutate(max_day = which(calWeek %in% sampled_day)) %>%
filter(row_number() <= max_day)
#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)
# A tibble: 3 x 4
# Groups: ids [2]
ids months calWeek value
<dbl> <dbl> <dbl> <dbl>
1 1.00 5.00 201708 5.00
2 2.00 88.0 201741 99.0
3 2.00 88.0 201742 100
This depends on the row order! So make sure to arrange by day first. You'll need to think about ties, though. I have edited my previous answer and simply filtered with <=

That should do the trick:
sample_and_get_below <- function(df, when, size){
res <- filter(df, calWeek == when) %>%
sample_n(size)
filter(df, calWeek > when) %>%
rbind(res, .)
}
sample_and_get_below(dat, 201741, 1)
ids months calWeek value
1 2 88 201741 99
2 2 88 201742 100
3 2 88 201743 1001
4 2 88 201744 1002

Related

Merging two data frames without duplicating metric values

I have two data frames and I want to merge them by leader values, so that I can see the total runs and walks for each groups. Each leader can have multiple members in their team, but the problem that I'm having is that when I merge them, the metrics also gets duplicated over to the newly added rows.
Here is an example of the two data sets that I have:
Data set 1:
+-------------+-----------+------------+-------------+
| leader name | leader id | total runs | total walks |
+-------------+-----------+------------+-------------+
| ab | 11 | 4 | 9 |
| tg | 47 | 8 | 3 |
+-------------+-----------+------------+-------------+
Data set 2:
+-------------+-----------+--------------+-----------+
| leader name | leader id | member name | member id |
+-------------+-----------+--------------+-----------+
| ab | 11 | gfh | 589 |
| ab | 11 | tyu | 739 |
| tg | 47 | rtf | 745 |
| tg | 47 | jke | 996 |
+-------------+-----------+--------------+-----------+
I want to merge the two datasets so that they become like this:
+-------------+-----------+--------------+------------+------------+-------------+
| leader name | leader id | member name | member id | total runs | total walks |
+-------------+-----------+--------------+------------+------------+-------------+
| ab | 11 | gfh | 589 | 4 | 9 |
| ab | 11 | tyu | 739 | | |
| tg | 47 | rtf | 745 | 8 | 3 |
| tg | 47 | jke | 996 | | |
+-------------+-----------+--------------+------------+------------+-------------+
But right now I keep getting:
+-------------+-----------+--------------+------------+------------+-------------+
| leader name | leader id | member name | member id | total runs | total walks |
+-------------+-----------+--------------+------------+------------+-------------+
| ab | 11 | gfh | 589 | 4 | 9 |
| ab | 11 | tyu | 739 | 4 | 9 |
| tg | 47 | rtf | 745 | 8 | 3 |
| tg | 47 | jke | 996 | 8 | 3 |
+-------------+-----------+--------------+------------+------------+-------------+
It doesn't matter if they're blank, NA's or 0's, as long as the values aren't duplicating. Is there a way to achieve this?
We can do a replace on those 'total' columns after a left_join
library(dplyr)
left_join(df2, df1 ) %>%
group_by(leadername) %>%
mutate_at(vars(starts_with('total')), ~ replace(., row_number() > 1, NA))
# A tibble: 4 x 6
# Groups: leadername [2]
# leadername leaderid membername memberid totalruns totalwalks
# <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#1 ab 11 gfh 589 4 9
#2 ab 11 tyu 739 NA NA
#3 tg 47 rtf 745 8 3
#4 tg 47 jke 996 NA NA
Or without using the group_by
left_join(df2, df1 ) %>%
mutate_at(vars(starts_with('total')), ~
replace(., duplicated(leadername), NA))
Or a base R option is
out <- merge(df2, df1, all.x = TRUE)
i1 <- duplicated(out$leadername)
out[i1, c("totalruns", "totalwalks")] <- NA
out
# leadername leaderid membername memberid totalruns totalwalks
#1 ab 11 gfh 589 4 9
#2 ab 11 tyu 739 NA NA
#3 tg 47 rtf 745 8 3
#4 tg 47 jke 996 NA NA
data
df1 <- structure(list(leadername = c("ab", "tg"), leaderid = c(11, 47
), totalruns = c(4, 8), totalwalks = c(9, 3)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(leadername = c("ab", "ab", "tg", "tg"), leaderid = c(11,
11, 47, 47), membername = c("gfh", "tyu", "rtf", "jke"), memberid = c(589,
739, 745, 996)), class = "data.frame", row.names = c(NA, -4L))

How to read and process columns with sub columns from an excel/.csv/any file?

I tried reading an Excel file where I need to read sub columns too, but not getting a way to resolve this.
The Excel file contains data as,
| Sl No. | Sales 1 | Sales 2 | % Change |
| | 1 Qtr | % Qtr | 2 Qtr| % Qtr | |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
After importing I can see the data as
| Sl No. |Sales 1| ...3 |Sales 2 | ...5 | % Change |
| NA | 1 Qtr | % Qtr | 2 Qtr | % Qtr | NA |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
I tried several ways to merge "Sales 1 & ...3 and Sales 2 & ...5" and keep 1 Qtr,% Qtr,2 Qtr,% Qtr as sub columns, but unable to do so
I need it to be like,
| Sl No. | Sales 1 | Sales 2 | % Change |
| | 1 Qtr | % Qtr | 2 Qtr| % Qtr | |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
Unfortunately, R doesn't allow for multiple colnames. So probably the easiest thing you can do using base R is combining the colnames and then getting rid of the first line.
library(openxlsx)
x <- read.xlsx("your_file.xlsx")
# Sl.No Sales.1 X3 Sales.2 X5 %Change
# 1 NA 1 Qtr %Qtr 2 Qtr %Qtr NA
# 2 1 134 67 175 74 12.5
colnames(x) <- paste0(colnames(x),ifelse(is.na(x[1,]),"",paste0(" - ", x[1,])))
x <- x[-1,]
# Sl.No Sales.1 - 1 Qtr X3 - %Qtr Sales.2 - 2 Qtr X5 - %Qtr %Change
# 2 1 134 67 175 74 12.5
colnames(x)
# [1] "Sl.No" "Sales.1 - 1 Qtr" "X3 - %Qtr" "Sales.2 - 2 Qtr" "X5 - %Qtr" "%Change"

dplyr - group last n row values

I have a dataframe as below
+--------+-----------+-----+
| make | model | cnt |
+--------+-----------+-----+
| toyota | camry | 10 |
| toyota | corolla | 4 |
| honda | city | 8 |
| honda | accord | 13 |
| jeep | compass | 3 |
| jeep | wrangler | 5 |
| jeep | renegade | 1 |
| accura | x1 | 2 |
| accura | x3 | 1 |
+--------+-----------+-----+
I need to aggregate this dataframe by Make so as to get the total volume and share - I do this as follows.
df <- data.frame(Make=c('toyota','toyota','honda','honda','jeep','jeep','jeep','accura','accura'),
Model=c('camry','corolla','city','accord','compass', 'wrangler','renegade','x1', 'x3'),
Cnt=c(10, 4, 8, 13, 3, 5, 1, 2, 1))
dfc <- df %>%
group_by(Make) %>%
summarise(volume = sum(Cnt)) %>%
mutate(share=volume/sum(volume)*100.0) %>%
arrange(desc(volume))
The above operation gives me the share and volume aggregated by Make as below.
+--------+--------+-----------+
| make | volume | share |
+--------+--------+-----------+
| honda | 21 | 44.680851 |
| toyota | 14 | 29.787234 |
| jeep | 9 | 19.148936 |
| accura | 3 | 6.382979 |
+--------+--------+-----------+
I need to group everything except the first two rows to a group others and also aggregate the volume and share such that the dataframe would look like below.
+--------+--------+-----------+
| make | volume | share |
+--------+--------+-----------+
| honda | 21 | 44.680851 |
| toyota | 14 | 29.787234 |
| others | 12 | 25.53191 |
+--------+--------+-----------+
library(dplyr)
# example data
df <- data.frame(Make=c('toyota','toyota','honda','honda','jeep','jeep','jeep','accura','accura'),
Model=c('camry','corolla','city','accord','compass', 'wrangler','renegade','x1', 'x3'),
Cnt=c(10, 4, 8, 13, 3, 5, 1, 2, 1), stringsAsFactors = F)
# specify number of rows
row_threshold = 2
df %>%
group_by(Make) %>%
summarise(volume = sum(Cnt)) %>%
mutate(share=volume/sum(volume)*100.0) %>%
arrange(desc(volume)) %>%
group_by(Make_upd = ifelse(row_number() > row_threshold, "others", Make)) %>%
summarise(volume = sum(volume),
share = sum(share))
# # A tibble: 3 x 3
# Make_upd volume share
# <chr> <dbl> <dbl>
# 1 honda 21 44.68085
# 2 others 12 25.53191
# 3 toyota 14 29.78723

compare aggregate value across groups

With a df below,
need to compute median for variable metric across the teams tm1, tm2 and tm3 on a per locid, day, hour combo basis
then filter only those locid, day, hour observations which have the same metric median across teams tm1, tm2, tm3.
set.seed(100)
df <- data.frame(
locid = sample(c(1111,1122,1133), 20, replace=TRUE),
day = sample(c(1:3), 20, replace=TRUE),
hour = sample(c(1:4), 20, replace=TRUE),
team = sample(c("tm1", "tm2", "tm3"), 20, replace=TRUE),
metric = sample(1:5, 20, replace=TRUE )
)
my attempt
df_medians <- df %>%
group_by(locid + day + hour + team) %>%
summarise(metric_median = median(metric))
this gives the median per team for each locid + day + hour. I need to now find out the locid + day + hour combos that give the same median value across teams tm1, tm2, tm3.
df_medians %>% group_by(locid, day, hour, team) %>% summarise(??what here??)
I was trying with dplyr, but base-r solution is fine.
As a simpler example we can look at the below data- which has measurements from two different locations for two teams.
+-------+------+-------+-------+---------+
| locid | day | hour | team | metric |
+-------+------+-------+-------+---------+
| 1111 | 1 | 1 | tm1 | 3 |
| 1111 | 1 | 1 | tm1 | 2 |
| 1111 | 1 | 1 | tm1 | 1 |
| 1111 | 1 | 1 | tm2 | 1 |
| 1111 | 1 | 1 | tm2 | 2 |
| 1111 | 1 | 1 | tm2 | 3 |
| 1122 | 1 | 1 | tm1 | 3 |
| 1122 | 1 | 1 | tm1 | 2 |
| 1122 | 1 | 1 | tm1 | 1 |
| 1122 | 1 | 1 | tm2 | 1 |
| 1122 | 1 | 1 | tm2 | 2 |
| 1122 | 1 | 1 | tm2 | 1 |
+-------+------+-------+-------+---------+
step 1 - compute median by group
+-------+------+-------+-------+-------------+
| locid | day | hour | team | metric_med |
+-------+------+-------+-------+-------------+
| 1111 | 1 | 1 | tm1 | 2 |
| 1111 | 1 | 1 | tm2 | 2 |
| 1122 | 1 | 1 | tm1 | 2 |
| 1122 | 1 | 1 | tm2 | 1 |
+-------+------+-------+-------+-------------+
Step2 - compare medians across group (locid + day + hour) only (1111, 1, 1) has the metric_med same across the teams gp1 and gp2
+-------+------+-------+-------------+
| locid | day | hour | metric_med |
+-------+------+-------+-------------+
| 1111 | 1 | 1 | 2 |
+-------+------+-------+-------------+
One way to do it is to spread the groups into one row per each locid, day, and hour, and then compare them. This solution scales well for more than two groups and complicated conditions.
library(dplyr)
library(tidyr)
data %>%
group_by(locid, day, hour, team) %>%
summarize(median = median(metric)) %>%
spread(team, median) %>%
filter(tm1 == tm2)
Another possible solution is to arrange the summarized results by locid, day, and hour, and then compare the median in one row to its lag. This solution only works for two groups in team.
data %>%
group_by(locid, day, hour, team) %>%
summarize(median = median(metric)) %>%
arrange(locid, day, hour) %>%
filter(median == lag(median))
Let's re-cast 'all equal' to mean "zero-variance or a single observation". Thus:
df %>%
# per locid, day, hour, team
group_by(locid, day, hour, team) %>%
# compute median
summarize(team_median = median(metric)) %>%
# ungroup before specifying new grouping
ungroup %>%
# for locid, day, hour
group_by(locid, day, hour) %>%
# find the medians that were the same for all teams
# 'the same' here is taken to mean no variance
# or having a single observation
# note that, although logical vector TRUE | NA does yield TRUE
# this is only because it must yield TRUE.
# As another example, FALSE | NA, yields NA.
# As a guard against team_medians that are NA, I add a coalesce wrapper.
# I've decided that missing team_medians represent non-cases, YMMV
summarize(all_equal = coalesce(n() == 1 | var(team_median) == 0), FALSE) %>%
filter(all_equal == TRUE) %>%
select(-all_equal)

R ddply sum value from next row

I want to sum the column value from a row with the next one.
> df
+----+------+--------+------+
| id | Val | Factor | Col |
+----+------+--------+------+
| 1 | 15 | 1 | 7 |
| 3 | 20 | 1 | 4 |
| 2 | 35 | 2 | 8 |
| 7 | 35 | 1 | 12 |
| 5 | 40 | 1 | 11 |
| 6 | 45 | 2 | 13 |
| 4 | 55 | 1 | 4 |
| 8 | 60 | 1 | 7 |
| 9 | 15 | 2 | 12 |
..........
I would like to have the mean of sum of the Row$Val + nextRow$Val based on their id and Col. I can't assume that the id or Col are consecutive.
I am using ddply to summarize my df. I have tried
> ddply(df, .(Factor), summarize,
max(Val),
sum(Val),
mean(Val + df[df$id == id+1 & df$Col = Col]$Val)
)
> "longer object length is not a multiple of shorter object length"
You can build a vector of values with
sapply(df$id, function(x){mean(c(
subset(df, id == x, select = Val, drop = TRUE),
subset(df, id == x+1, select = Val, drop = TRUE)
))})
You could simplify, but I tried to make it as readable as possible.
You can use rollapply from the zoo package. Since you want mean of only two consecutive rows , you can try
library(zoo)
rollapply(df[order(df$id), 2], 2, function(x) sum(x)/2)
#[1] 17.5 27.5 35.0 37.5 42.5 50.0 57.5 37.5
You can do something like this with dplyr package:
library(dplyr)
df <- arrange(df, id)
mean(df$Val + lead(df$Val), na.rm = TRUE)
[1] 76.25

Resources