Summarizing dataset in r to calculate unique values - r

In the following dataframe I want to calculate unique household_id - individual_id combination and Average weights and Total Duration after summarizing at country, state and date column.
Household_id 100 have two unique individual(1and 2) and househld_id 101 have three unique individual(1,2,3). So total unique is 5 after summarizing.
The Average weights I want to calculate of this 5 unique individuals i.e. (100 + 50 + 200 + 200 + 200)/5 =150
Final dataset:
what I did is
data %>% group_by(country,state,date) %>%
summarise(total_unique = n_distinct(household_id,individual_id),
Tot_Duration = sum(duration))
But not able to calculate the Average_weights.
Any help is highly appreciated.
Sample Dataset
library(dplyr)
data <- data.frame(country = c("US","US","US","US","US","US","IND","IND"),
state = c("TX","TX","TX","TX","TX","TX","AP","AP"),
date = c(20220601,20220601,20220601,20220601,20220601,20220601,20220601,20220601),
household_id = c(100,100,100,101,101,101,102,102),
individual_id=c(1,2,1,1,2,3,1,1),
weights = c(100,50,100,200,200,200,100,100),
duration = c(10,20,30,40,50,60,70,80))
EDIT
Apologies for not putting the right dataset which I realized later.
Two update in Dataset
Different individuals may have same weight as in household_id 101
Duration column is added
with solution 1 above distinct will not work and with solution 2 unique will not work. Please suggest
I have updated the sample dataset

Building on to your code, you could add an extra statement in summarise:
library(tidyverse)
data %>%
group_by(country,state,date) %>%
summarise(total_unique = n_distinct(household_id,individual_id),
Average_weights = sum(unique(weights), na.rm = T)/total_unique)
Output
country state date total_unique Average_weights
<chr> <chr> <dbl> <int> <dbl>
1 IND AP 20220601 1 100
2 US TX 20220601 5 210

You may try
library(dplyr)
data %>%
group_by(country, state, date) %>%
distinct() %>%
summarize(total_unique = n(),
average_Weights = sum(weights)/total_unique)
country state date total_unique average_Weights
<chr> <chr> <dbl> <int> <dbl>
1 IND AP 20220601 1 100
2 US TX 20220601 5 210

Related

Assign variables in groups based on fractions and several conditions

I've tried for several days on something I think should be rather simple, with no luck. Hope someone can help me!
I have a data frame called "test" with the following variables: "Firm", "Year", "Firm_size" and "Expenditures".
I want to assign firms to size groups by year and then display the mean, median, std.dev and N of expenditures for these groups in a table (e.g. stargazer). So the first size group (top 10% largest firms) should show the mean, median ++ of expenditures for the 10% largest firms each year.
The size groups should be,
The 10% largest firms
The firms that are between 10-25% largest
The firms that are between 25-50% largest
The firms that are between 50-75% largest
The firms that are between 75-90% largest
The 10% smallest firms
This is what I have tried:
test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on
library(dplyr)
testtest = test%>%
group_by(Variable)%>%
dplyr::summarise(
Mean=mean(Expenditures),
Median=median(Expenditures),
Std.dev=sd(Expenditures),
N=n()
)
stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)
As shown over, I dont know how I could use fractions/group by percentage. I have therefore tried to assign firms in groups based on their rows after having arranged Firm_size to descending. The problem with doing so is that I dont take year in to consideration which I need to, and it is a lot of work to do this for each year (20 in total).
My intention was to make a new variable which gives each size group a name. E.g. top 10% largest firms each year should get a variable with the name "Expenditures, 0% size <10%"
Further I make a new dataframe "testtest" where I calculate the different measures, before using the stargazer to present it. This works.
!!EDIT!!
Hi again,
Now I get the error "List object cannot be coerced to type double" when running the code on a new dataset (but it is the same variables as before).
The mutate-step I'm referring to is the "mutate(gs = cut ++" after "rowwise()" in the solution you provided.
enter image description here
The_code
The_error
You can create the quantiles as a nested variable (size_groups), and then use cut() to create the group sizes (gs). Then group by Year and gs to summarize the indicators you want.
test %>%
group_by(Year) %>%
mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>%
rowwise() %>%
mutate(gs = cut(
Firm_size,c(-Inf, size_groups, Inf),
labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>%
group_by(Year, gs) %>%
summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>%
rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))
Output:
# A tibble: 126 x 6
Year Group_Size Mean_Exp Med_Exp SD_Exp N_Firms
<int> <fct> <dbl> <dbl> <dbl> <int>
1 2000 Lowest 10% 20885. 21363. 3710. 3
2 2000 10%-25% 68127. 69497. 19045. 4
3 2000 25%-50% 42035. 35371. 30335. 6
4 2000 50%-75% 36089. 29802. 17724. 6
5 2000 75%-90% 53319. 54914. 19865. 4
6 2000 Highest 10% 57756. 49941. 34162. 3
7 2001 Lowest 10% 55945. 47359. 28283. 3
8 2001 10%-25% 61825. 70067. 21777. 4
9 2001 25%-50% 65088. 76340. 29960. 6
10 2001 50%-75% 57444. 53495. 32458. 6
# ... with 116 more rows
If you wanted to have an additional column with the yearly mean, you can remove the .groups="drop" from the summarize(across()) line, and then add this final line to the pipeline:
mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))
Note that this is correctly weighted by the number of Firms in each Group_size, and thus returns the equivalent of doing this with the original data
test %>% group_by(Year) %>% summarize(mean(Expenditures))
Input Data:
set.seed(123)
test = data.frame(
Firm = replicate(2000, sample(letters,1)),
Year = sample(2000:2020, 2000, replace=T),
Firm_size= ceiling(runif(2000,2000,5000)),
Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)

Difference between two higher numbers in a column in R

I have a data frame like these:
NUM_TURNO CODIGO_MUNICIPIO SIGLA_PARTIDO SHARE
1 1 81825 PPB 38.713318
2 1 81825 PMDB 61.286682
3 1 09717 PMDB 48.025900
4 1 09717 PL 1.279217
5 1 09717 PFL 50.694883
6 1 61921 PMDB 51.793868
This is a data.frame of elections in Brazil. Grouping by NUM_TURNO and CODGIDO_MUNICIPIO I want to compare the SHARE of the FIRST and SECOND most votted politics in each city and round (1 or 2) and create a new column.
What am I having problem to do? I don't know how to calculate the difference only for the two biggest SHARES of votes.
For the first case, for example, I want to create something that gives me the difference between 61.286682 and 38.713318 = 22.573364 and so on.
Something like this:
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
mutate(Diff = HIGHER SHARE - 2º HIGHER SHARE))
You can also use top_n from dplyr with grouping and summarizing. Keep in mind that in the data you provided, you will get an error in summarize if you use diff with a single value, hence the use of ifelse.
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
top_n(2, SHARE) %>%
summarize(Diff = ifelse(n() == 1, NA, diff(SHARE)))
# A tibble: 3 x 3
# Groups: NUM_TURNO [?]
NUM_TURNO CODIGO_MUNICIPIO Diff
<dbl> <dbl> <dbl>
1 1 9717 2.67
2 1 61921 NA
3 1 81825 22.6
You could arrange your dataframe by Share and then slice the first two values. Then you could use summarise to get the diff between the values for every group:
library(dplyr)
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
arrange(desc(Share)) %>%
slice(1:2) %>%
summarise(Diff = -diff(Share))

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

Calling specific cells in the same column (using dplyr?)

I have a dataframe with character and numeric data. I would like to use dplyr to create a summary grouped by time points and trials generating the following:
averages
standard deviations
variation
ratio between time points
(etc etc)
I feel like all of this could be done in the dplyr pipe, but I am struggling to make a ratio of averages between time points within trials.
I fully admit that I may be carrying around a hammer looking for nails, so please feel free to recommend solutions that utilize other packages or functions, but ideally I'd like simple/straight forward code for ease of use by multiple collaborators.
library(dplyr)
# creating an example DF
num <- runif(100, 50, 3200)
smpl <- 1:100
df <- data.frame( num, smpl)
df$time <- "time1"
df$time[seq(2,100,2)] <- "time2"
df$trial <- "a"
df$trial[26:50] <- "b"
df$trial[51:75] <- "c"
df$trial[75:100] <- "d"
# using the magic of pipelines to calculate useful things
df1 <- df %>%
group_by(time, trial) %>%
summarise(avg = mean(num),
var = var(num),
stdev = sd(num))
I'd love to get [the ratio time2/time1 of the avg for each trial] included in this block above, but I don't know how to call "avg" specifically by "time1" vs "time2" within the pipe.
From here on, nothing does quite what I'm hoping for...
df1 <- df1[with(df1,order(trial,time)),]
# this better ressembles my actual DF structure,
# so reordering it will make some of my next attempts to solve this make more sense
I tried to use the fact that 'every other line' is different (this is not ideal because each df will have a different number of rows, so I will either introduce NAs or it will require constantly change these #'s (or writing a function to constantly change them))
tm2 <- data.frame(x=df1$avg[seq(2,4,2)])
tm1 <- data.frame(x=df1$avg[seq(1,3,2)])
so minimally, this is the ratio I'd like included in the df, but tied to the avg & trial columns:
tm2/tm1
It doesn't matter to me 'which' time row this ratio ends up in, so long as it is consistent across all the trials (so if a column of ratios has "blank" for every "time1" and "value" for every "time2", that's fine).
# I added in a separate column to allow 'match' later
tm1$time <- "time1"
tm2$time <- "time1" # to keep them all 'in row'
df1$avg_tm1 <- tm1$x[match(df1$time, tm1$time)]
df1$avg_tm2 <- tm2$x[match(df1$time, tm2$time)]
but this fails to match by 'trial' also, since that info is lost in this new tm1 df ; this really makes me think it should all be done in dplry the first time...
Then I tried to create a new column in the tm1 df with the ratio
tm2$ratio <-tm2$x/tm1$x
and add in the ratio values only if the avg matches
df1$ratio <- tm2$ratio[match(tm2$x, df1$avg)]
This might work, but when I extract the avg values, it rounds, so the numbers do not match exactly. I'm also cautious about this because if I process ridiculous amounts of data, there's a higher and higher chance that two random averages will be similar enough to misplace these ratios.
I tried several other things that completely failed, so let's pretend that something worked and entered the ratio into the df1 as separate columns
Then any further calculations or annotations are straight forward:
df2 <- df1 %>%
mutate(ratio = avg_tm2/avg_tm1,
lost = 1- ratio,
word = paste0(round(lost*100),"%"))
But I am still stuck on 'how' to call specific cells inside the pipe or which other tools/packages to use to calculate deltas or ratios between cells in the same column.
Thanks in advance
We could group by 'trial' and mutate to create the 'ratio' column
df1 %>%
group_by(trial) %>%
mutate(ratio = last(avg)/first(avg))
# A tibble: 8 x 6
# Groups: trial [4]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 time1 a 1815. 715630. 846. 0.795
#2 time1 b 2012. 1299823. 1140. 0.686
#3 time1 c 1505. 878168. 937. 1.09
#4 time1 d 1387. 902364. 950. 1.17
#5 time2 a 1444. 998943. 999. 0.795
#6 time2 b 1380. 720135. 849. 0.686
#7 time2 c 1641. 1205778. 1098. 1.09
#8 time2 d 1619. 582418. 763. 1.17
NOTE: We used set.seed(2) for creating the dataset
Work out a separate data.frame:
set.seed(2)
# your code above to generate df1
df2 <- select(df1, time, trial, avg) %>%
spread(time, avg) %>%
mutate(ratio = time2/time1)
df2
# # A tibble: 4 × 4
# trial time1 time2 ratio
# <chr> <dbl> <dbl> <dbl>
# 1 a 1815.203 1443.731 0.7953555
# 2 b 2012.436 1379.981 0.6857266
# 3 c 1505.474 1641.439 1.0903135
# 4 d 1386.876 1619.341 1.1676176
and now you can merge the relevant column onto the original frame:
left_join(df1, select(df2, trial, ratio), by="trial")
# Source: local data frame [8 x 6]
# Groups: time [?]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 time1 a 1815.203 715630.4 845.9494 0.7953555
# 2 time1 b 2012.436 1299823.3 1140.0979 0.6857266
# 3 time1 c 1505.474 878168.3 937.1063 1.0903135
# 4 time1 d 1386.876 902363.7 949.9282 1.1676176
# 5 time2 a 1443.731 998943.3 999.4715 0.7953555
# 6 time2 b 1379.981 720134.6 848.6074 0.6857266
# 7 time2 c 1641.439 1205778.0 1098.0792 1.0903135
# 8 time2 d 1619.341 582417.5 763.1629 1.1676176

How to aggregate using ddply when not all elements of a variable exist on R

I am having trouble using combinations of ddply and merge to aggregate some variables. The data frame that I am using is really large, so I am putting an example below:
data_sample <- cbind.data.frame(c(123,123,123,321,321,134,145,000),
c('j', 'f','j','f','f','o','j','f'),
c(seq(110,180, by = 10)))
colnames(data_sample) <- c('Person','Expense_Type','Expense_Value')
I want to calculate, for each person, the percentage of the value of expense of type j on the person's total expense.
data_sample2 <- ddply(data_sample, c('Person'), transform, total = sum(Value))
data_sample2 <- ddply(data_sample2, c('Person','Type'), transform, empresa = sum(Value))
This it what I've done to list total expenses by type, but the problem is that not all individuals have expenses of type j, so their percentage should be 0 and I don't know how to leave only one line per person with the percentage of total expenses of type j.
I might have not made myself clear.
Thank you!
We can use the by function:
by(data_sample, data_sample$Person, FUN = function(dat){
sum(dat[dat$Expense_Type == 'j',]$Expense_Value) / sum(dat$Expense_Value)
})
We could also make use of the dplyr package:
library(dplyr)
data_sample %>%
group_by(Person) %>%
summarise(Percent_J = sum(ifelse(Expense_Type == 'j', Expense_Value, 0)) / sum(Expense_Value))
# A tibble: 5 × 2
Person Percent_J
<dbl> <dbl>
1 0 0.0000000
2 123 0.6666667
3 134 0.0000000
4 145 1.0000000
5 321 0.0000000

Resources