Select sample from a grouping variable depending on another grouping in R - r

I have the following data frame with 1,000 rows; 10 Cities, each having 100 rows and I would like to randomly select 10 names by Year in the city and the selected should 10 sample names should come from at least one of the years in the City i.e the 10 names for City 1 should not come from only 1996 for instance.
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
...
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
Desired Final Sample Data
City Year name
1 1 1996 b
2 1 1998 c
3 1 2001 d
...
11 2 1997 g
12 2 1999 h
13 2 2005 b
...
21 3 1998 a
22 3 2010 c
23 3 2005 d
Sample Data
df1 <- data.frame(City = rep(1:10, each = 100),
Year = rep(1996:2015, each = 5),
name = rep(letters[1:25], 40))
I am failing to randomly select the 10 sample names by Year (without repeating years - unless when the number of Years in a city is less than 10) for all the 10 Cities, how can I go over this?
The Final sample should have 10 names of each city and years should not repeat unless when they are less than 10 in that city.
Thank you.

First group by City and use sample_n to sample a sub-dataframe.
Then group by City and Year, and sample from name one element per group. Don't forget to set the RNG seed in order to make the result reproducible.
library(dplyr)
set.seed(2020)
df1 %>%
group_by(City) %>%
sample_n(min(n(), 10)) %>%
ungroup() %>%
group_by(City, Year) %>%
summarise(name = sample(name, 1))
#`summarise()` regrouping output by 'City' (override with `.groups` argument)
## A tibble: 4 x 3
## Groups: City [2]
# City Year name
# <int> <int> <chr>
#1 1 1996 b
#2 1 1997 e
#3 2 1996 f
#4 2 1997 h
Data
df1 <- read.table(text = "
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
", header = TRUE)
Edit
Instead of reinventing the wheel, use package sampling, function strata to get an index into the data set and then filter its corresponding rows.
library(dplyr)
library(sampling)
set.seed(2020)
df1 %>%
mutate(row = row_number()) %>%
filter(row %in% strata(df1, stratanames = c('City', 'Year'), size = rep(1, 1000), method = 'srswor')$ID_unit) %>%
select(-row) %>%
group_by(City) %>%
sample_n(10) %>%
arrange(City, Year)

Related

How to compute cumulative and one specific column in R?

I have the data about sales by years and by-products, let's say like this:
Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale)
Firstly I need to calculate the "Share" column which represents the share of each product within each year.
After I compute cumulative share like this:
In the 3rd step need to identify products that accumulate total sales up to 70% in the last year (2012 in this case) and keep only these products in the whole dataframe + add a ranking column (based on last year) and summarises all the rest of products as category "other". So the final dataframe should be like this:
This is a fairly complex data wrangling task, but can be achieved using dplyr:
library(dplyr)
df %>%
mutate(Sale = as.numeric(Sale)) %>%
group_by(Year) %>%
mutate(Share = 100 * Sale/ sum(Sale),
Year_order = order(order(-Share))) %>%
arrange(Year, Year_order, by_group = TRUE) %>%
mutate(Cumm.Share = cumsum(Share)) %>%
ungroup() %>%
mutate(below_70 = Model %in% Model[Year == max(Year) & Cumm.Share < 70]) %>%
mutate(Model = ifelse(below_70, Model, 'Other')) %>%
group_by(Year, Model) %>%
summarize(Sale = sum(Sale), Share = sum(Share), .groups = 'keep') %>%
group_by(Year) %>%
mutate(pseudoShare = ifelse(Model == 'Other', 0, Share)) %>%
arrange(Year, -pseudoShare, by_group = TRUE) %>%
ungroup() %>%
mutate(Rank = match(Model, Model[Year == max(Year)])) %>%
select(-pseudoShare)
#> # A tibble: 9 x 5
#> Year Model Sale Share Rank
#> <dbl> <chr> <dbl> <dbl> <int>
#> 1 2010 a 30 19.4 2
#> 2 2010 c 23 14.8 1
#> 3 2010 Other 102 65.8 3
#> 4 2011 c 19 10.2 1
#> 5 2011 a 11 5.88 2
#> 6 2011 Other 157 84.0 3
#> 7 2012 c 89 44.7 1
#> 8 2012 a 33 16.6 2
#> 9 2012 Other 77 38.7 3
Note that in the output this code has kept groups a and c, rather than c and d, as in your expected output. This is because a and d have the same value in the final year (16.6), and therefore either could be chosen.
Created on 2022-04-21 by the reprex package (v2.0.1)
Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale, stringsAsFactors=F)
years <- unique(df$Year)
shares <- c()
cumshares <- c()
for (year in years){
extract <- df[df$Year == year, ]
sale <- as.numeric(extract$Sale)
share <- 100*sale/sum(sale)
shares <- append(shares, share)
cumshare <- rev(cumsum(rev(share)))
cumshares <- append(cumshares, cumshare)
}
df$Share <- shares
df$Cumm.Share <- cumshares
df
gives
> df
Year Model Sale Share Cumm.Share
1 2010 a 30 19.354839 100.000000
2 2010 b 45 29.032258 80.645161
3 2010 c 23 14.838710 51.612903
4 2010 d 33 21.290323 36.774194
5 2010 e 24 15.483871 15.483871
6 2011 a 11 5.882353 100.000000
7 2011 b 56 29.946524 94.117647
8 2011 c 19 10.160428 64.171123
9 2011 d 45 24.064171 54.010695
10 2011 e 56 29.946524 29.946524
11 2012 a 33 16.582915 100.000000
12 2012 b 32 16.080402 83.417085
13 2012 c 89 44.723618 67.336683
14 2012 d 33 16.582915 22.613065
15 2012 e 12 6.030151 6.030151
I don't understand what you mean by step 3, how do you decide which products to keep?

R data frame - fill missing values with condition on another column

In R, I have a the following data frame:
Id
Year
Age
1
2000
25
1
2001
NA
1
2002
NA
2
2000
NA
2
2001
30
2
2002
NA
Each Id has at least one row with age filled.
I would like to fill the missing "Age" values with the correct age for each ID.
Expected result:
Id
Year
Age
1
2000
25
1
2001
25
1
2002
25
2
2000
30
2
2001
30
2
2002
30
I've tried using 'fill':
df %>% fill(age)
But not getting the expected results.
Is there a simple way to do this?
The comments were close, you just have to add the .direction
df %>% group_by(Id) %>% fill(Age, .direction="downup")
# A tibble: 6 x 3
# Groups: Id [2]
Id Year Age
<int> <int> <int>
1 1 2000 25
2 1 2001 25
3 1 2002 25
4 2 2000 30
5 2 2001 30
6 2 2002 30
Assuming this is your dataframe
df<-data.frame(id=c(1,1,1,2,2,2),year=c(2000,2001,2002,2000,2001,2002),age=c(25,NA,NA,NA,30,NA))
With the zoo package, you can try
library(zoo)
df<-df[order(df$id,df$age),]
df$age<-na.locf(df$age)
Please see the solution below with the tidyverse library.
library(tidyverse)
dt <- data.frame(Id = rep(1:2, each = 3),
Year = rep(2000:2002, each = 2),
Age = c(25,NA,NA,30,NA,NA))
dt %>% group_by(Id) %>% arrange(Id,Age) %>% fill(Age)
In the code you provided, you didn't use group_by. It is also important to arrange by Id and Age, because the function fill only fills the column down. See for example that data frame, and compare the option with and without arrange:
dt <- data.frame(Id = rep(1:2, each = 3),
Year = rep(2000:2002, each = 2),
Age = c(NA, 25,NA,NA,30,NA))
dt %>% group_by(Id) %>% fill(Age) # only fills partially
dt %>% group_by(Id) %>% arrange(Id,Age) %>% fill(Age) # does the right job

And/or conditional filtering with single factor levels that meet multiple conditions

Consider this data frame:
data <- data.frame(ID = rep(letters[1:4], each= 4),
Year = c('1990','1990','1990','1990',
'1990','1990','2000', '2000',
'1990','1990','1990','1990',
'1990','1990','2000', '2000'))
We have 4 unique ID's and 2 Years. ID == a and ID == c only have observations in 1990, while ID == b and ID == D have observations for both years. We want to filter cases where an ID has observations for both years, so the expected result would look like this:
ID Year
b 1990
b 1990
b 2000
b 2000
d 1990
d 1990
d 2000
d 2000
Using dplyrs syntax, we cant group_by(ID) and filter using & like this:
data%>%
group_by(ID)%>%
filter(Year == '1990' & Year == '2000')
because both conditions relate levels of the same factor (Year).
So how can we do this using dplyrs syntax?
we can do it in this way;
data %>%
group_by(ID) %>%
mutate(unique_ind=n_distinct(Year)) %>%
filter(unique_ind==2) %>%
ungroup %>%
select(-unique_ind)
output;
ID Year
1 b 1990
2 b 1990
3 b 2000
4 b 2000
5 d 1990
6 d 1990
7 d 2000
8 d 2000
We could construct the logical vector in filter
library(dplyr)
data %>%
group_by(ID) %>%
filter(n_distinct(Year) > 1) %>%
ungroup
# A tibble: 8 x 2
ID Year
<chr> <chr>
1 b 1990
2 b 1990
3 b 2000
4 b 2000
5 d 1990
6 d 1990
7 d 2000
8 d 2000

table() function in r - is there a better way with e.g., dplyr?

I am trying to create a basic table with frequencies of a categorical variable (Relationship_type) grouped by another variable (Country), preferably using dplyr library (or anything else that is easier to export as a .csv file than table()).
head(d)
Country Relationship_type
1 Algeria 2
2 Bulgaria 1
3 USA 2
4 Algeria 3
5 Germany 2
6 USA 1
I want this to look like an output from basic table(d$Country, d$Relationship_type) function:
2 3 4
Algeria 141 47 137
Australia 128 27 103
Austria 97 5 17
Belgium 172 16 71
Brazil 104 6 70
CHILE 54 4 46
Tried several combinations of tally(), group_by, count() etc. but can't figure it out.
Could you please help a bit??
All the best,
R_beginner
Another approach is to use tables::tabular() as follows.
textData <- "id Country Relationship_type
1 Algeria 2
2 Bulgaria 1
3 USA 2
4 Algeria 3
5 Germany 2
6 USA 1
7 Algeria 1
8 Bulgaria 3
9 USA 2
10 Algeria 2
11 Germany 1
12 USA 3"
df <- read.table(text=textData,header=TRUE)
library(tables)
tabular(Factor(Country) ~ Factor(Relationship_type),data=df)
...and the output:
Relationship_type
Country 1 2 3
Algeria 1 2 1
Bulgaria 1 0 1
Germany 1 1 0
USA 1 2 1
Still another approach is to recast the output from table() as a data frame, and pivot it wider with tidyr::pivot_wider().
# another approach: recast table output as data.frame
tableData <- data.frame(table(df$Country,df$Relationship_type))
library(dplyr)
library(tidyr)
tableData %>%
pivot_wider(id_cols = Var1,
names_from = Var2,
values_from = Freq)
...and the output:
> tableData %>%
+ pivot_wider(id_cols = Var1,
+ names_from = Var2,
+ values_from = Freq)
# A tibble: 4 x 4
Var1 `1` `2` `3`
<fct> <int> <int> <int>
1 Algeria 1 2 1
2 Bulgaria 1 0 1
3 Germany 1 1 0
4 USA 1 2 1
If we add a dplyr::rename() to the pipeline, we can rename the Var1 column to Country.
tableData %>%
pivot_wider(id_cols = Var1,
names_from = Var2,
values_from = Freq) %>%
rename(Country = Var1)
As usual, there are many ways in R to accomplish this task. Depending on the reason why the desired output is a CSV file, there are a variety of approaches that could fit the requirements. If the ultimate goal is to create presentation quality tables, then it's worth a look at this summary of packages that create presentation quality tables: How gt fits with other packages that create display tables.
You could indeed use count:
d %>% count(Country, Relationship_type)
It's not really better than table but it saves the $ use at least.

How do I join on a filter instead of a key?

I have a dataframe with a list of actions per year, like so -
print(df)
id actions year
b 2 1995
c 156 1997
e 53 1996
f 109 1994
I'd like to make a list of who had the most actions up to each year, so that it looks like this -
print(output)
asofyear id actions rank
1994 f 109 1
1995 f 109 1
1995 b 2 2
1996 f 109 1
1996 e 53 2
1996 b 2 3
1997 c 156 1
1997 f 109 2
1997 e 53 3
1997 b 2 4
How do I do this join where it ranks every id for the id values with a year that is less than the asofyear. One idea I was thinking about was something like this -
asOfYear <- seq(as.Date("1995-01-01"),as.Date("2000-01-01"),by="year")
asOfYear %>%
left_join(df, by = (asOfYear <= year)) %>%
arrange(asofyear, actions) %>%
group_by(asofyear) %>%
mutate(rank = row_number())
I don't know how to make that join key work though, but I would like to do this with dplyr.
Base R method using lapply, assuming you have unique years in df we can filter the daraframe for each year, order them by action column and add a rank column giving the row number.
do.call(rbind, lapply(sort(df$year), function(x) {
temp = df[df$year <= x, ]
transform(year = x,temp[order(temp$actions,decreasing = TRUE), ],
rank = 1:nrow(temp))
}))
# id actions year rank
#4 f 109 1994 1
#41 f 109 1995 1
#1 b 2 1995 2
#42 f 109 1996 1
#3 e 53 1996 2
#11 b 2 1996 3
#2 c 156 1997 1
#43 f 109 1997 2
#31 e 53 1997 3
#12 b 2 1997 4
If we want to do it using tidyverse tools we can do
map_dfr(sort(df$year), function(x)
df %>%
filter(year <= x) %>%
arrange(desc(actions)) %>%
mutate(year = x,
rank = row_number()))
# id actions year rank
#1 f 109 1994 1
#2 f 109 1995 1
#3 b 2 1995 2
#4 f 109 1996 1
#5 e 53 1996 2
#6 b 2 1996 3
#7 c 156 1997 1
#8 f 109 1997 2
#9 e 53 1997 3
#10 b 2 1997 4
How the tidyverse approach works :
Any type of map.. is used to loop over each element passed (here year). map_dfr means that it expects the output of each operation to be a dataframe (df of map_dfr) and it will rbind all the dataframe outputs together (r of map_dfr), there is also map_dfc which cbinds the output.
Now for every year it filters the df for year values which is less than equal to the current value (x), then arranges the dataframe in a descending (desc) order based on values in actions. Creates two new columns using mutate, first which is year (the already existing year column is replaced) gets the current value of year which is x and the rank column gives an incremental row number for every row in the dataframe.
To understand, the operation in detail I would advise you to run manually the steps for each year.
So for first year 1994 it gives output as
df %>%
filter(year <= 1994) %>%
arrange(desc(actions)) %>%
mutate(year = 1994,
rank = row_number())
# id actions year rank
#1 f 109 1994 1
For 1995 it gives output
df %>%
filter(year <= 1995) %>%
arrange(desc(actions)) %>%
mutate(year = 1995,
rank = row_number())
# id actions year rank
#1 f 109 1995 1
#2 b 2 1995 2
and so on this iterates for every year. So for every year you will get such dataframes and we rbind the final output together.

Resources