I have a dataframe as below
+--------+-----------+-----+
| make | model | cnt |
+--------+-----------+-----+
| toyota | camry | 10 |
| toyota | corolla | 4 |
| honda | city | 8 |
| honda | accord | 13 |
| jeep | compass | 3 |
| jeep | wrangler | 5 |
| jeep | renegade | 1 |
| accura | x1 | 2 |
| accura | x3 | 1 |
+--------+-----------+-----+
I need to aggregate this dataframe by Make so as to get the total volume and share - I do this as follows.
df <- data.frame(Make=c('toyota','toyota','honda','honda','jeep','jeep','jeep','accura','accura'),
Model=c('camry','corolla','city','accord','compass', 'wrangler','renegade','x1', 'x3'),
Cnt=c(10, 4, 8, 13, 3, 5, 1, 2, 1))
dfc <- df %>%
group_by(Make) %>%
summarise(volume = sum(Cnt)) %>%
mutate(share=volume/sum(volume)*100.0) %>%
arrange(desc(volume))
The above operation gives me the share and volume aggregated by Make as below.
+--------+--------+-----------+
| make | volume | share |
+--------+--------+-----------+
| honda | 21 | 44.680851 |
| toyota | 14 | 29.787234 |
| jeep | 9 | 19.148936 |
| accura | 3 | 6.382979 |
+--------+--------+-----------+
I need to group everything except the first two rows to a group others and also aggregate the volume and share such that the dataframe would look like below.
+--------+--------+-----------+
| make | volume | share |
+--------+--------+-----------+
| honda | 21 | 44.680851 |
| toyota | 14 | 29.787234 |
| others | 12 | 25.53191 |
+--------+--------+-----------+
library(dplyr)
# example data
df <- data.frame(Make=c('toyota','toyota','honda','honda','jeep','jeep','jeep','accura','accura'),
Model=c('camry','corolla','city','accord','compass', 'wrangler','renegade','x1', 'x3'),
Cnt=c(10, 4, 8, 13, 3, 5, 1, 2, 1), stringsAsFactors = F)
# specify number of rows
row_threshold = 2
df %>%
group_by(Make) %>%
summarise(volume = sum(Cnt)) %>%
mutate(share=volume/sum(volume)*100.0) %>%
arrange(desc(volume)) %>%
group_by(Make_upd = ifelse(row_number() > row_threshold, "others", Make)) %>%
summarise(volume = sum(volume),
share = sum(share))
# # A tibble: 3 x 3
# Make_upd volume share
# <chr> <dbl> <dbl>
# 1 honda 21 44.68085
# 2 others 12 25.53191
# 3 toyota 14 29.78723
Related
Suppose I have dataframe 'y'
WR<-c("S",'J',"T")
B<-c("b1","b2","b3")
wgt<-c(0.3,2,3)
y<-data.frame(WR,B,wgt)
I want to make column percentage crosstab with B as row, WR, and total of WR as columns using expss function
library(expss)
y %>% tab_cols(total(),WR) %>% # Columns
tab_stat_valid_n("Base") %>%
tab_weight(wgt) %>%
tab_stat_valid_n("Projection") %>%
tab_cells(mrset(B))%>% # Row
tab_stat_cpct(total_row_position = "none") %>%
tab_pivot()
Result
But the total Base column does not match up
# #Total WR|J WR|S WR|T
# Base 1.000000 1 1.0 1
# Projection 5.300000 2 0.3 3
# b1 5.660377 NA 100.0 NA
# b2 37.735849 100 NA NA
# b3 56.603774 NA NA 100
I think I found the solution
y %>% tab_cols(total(),WR) %>% # Columns
tab_cells(mrset(B))%>% # Row
tab_stat_valid_n("Base") %>%
tab_weight(wgt) %>%
tab_stat_valid_n("Projection") %>%
tab_stat_cpct(total_row_position = "none") %>%
tab_pivot()
| | | #Total | WR | | |
| | | | J | S | T |
| -- | ---------- | ------ | --- | ----- | --- |
| B | Base | 3.0 | 1 | 1.0 | 1 |
| | Projection | 5.3 | 2 | 0.3 | 3 |
| b1 | | 5.7 | | 100.0 | |
| b2 | | 37.7 | 100 | | |
| b3 | | 56.6 | | | 100 |
I am trying to create a function to generate frequency table (to show count , valid percentage , percentage) for list of banner.
I want to export tables in xlsx files.
like for variable "gear" , i want to calculate the table for banner below ()
library(expss)
df <- mtcars
df$all<- 1
df$small<-ifelse(df$vs==1,1,NA)
df$large<-ifelse(df$am ==1,1,NA)
val_lab(df$all)<-c("Total"=1)
val_lab(df$small)<-c("Small"=1)
val_lab(df$large)<-c("Large"=1)
banner <- list(dat$all,dat$small,dat$large)
data <- df
var <- "gear"
var1 <- rlang::parse_expr(var)
expss::var_lab(data[[var]])
#tab1 <- expss::fre(data[[var1]])
table1 <- expss::fre(data[[var1]],
stat_lab = getOption("expss.fre_stat_lab", c("Count N", "Valid percent", "Percent",
"Responses, %", "Cumulative responses, %")))
table1
the output table should be like below
You need to make custom function around fre:
library(expss)
df <- mtcars
df$all<- 1
df$small<-ifelse(df$vs==1,1,NA)
df$large<-ifelse(df$am ==1,1,NA)
val_lab(df$all)<-c("Total"=1)
val_lab(df$small)<-c("Small"=1)
val_lab(df$large)<-c("Large"=1)
my_fre <- function(curr_var) setNames(expss::fre(curr_var)[, 1:3],
c("row_labels", "Count N", "Valid percent"))
cross_fun_df(df, gear, list(all, small, large), fun = my_fre)
# | | Total | | Small | | Large | |
# | | Count N | Valid percent | Count N | Valid percent | Count N | Valid percent |
# | ------ | ------- | ------------- | ------- | ------------- | ------- | ------------- |
# | 3 | 15 | 46.88 | 3 | 21.43 | | |
# | 4 | 12 | 37.50 | 10 | 71.43 | 8 | 61.54 |
# | 5 | 5 | 15.62 | 1 | 7.14 | 5 | 38.46 |
# | #Total | 32 | 100.00 | 14 | 100.00 | 13 | 100.00 |
# | <NA> | 0 | | 0 | | 0 | |
I'm doing exploratory analysis of survey data and the dataframe is a haven labelled dataset, that is, each variable already has value labels and variable labels.
I want to store frequencies tables in a list, and then name each list element as the variable label. I'm using the expss package. The problem is that the output tables contain in the first column name this description: values2labels(Df$var. How can this description be dropped from the table?
Reproducible example:
# Dataframe
df <- data.frame(sex = c(1, 1, 2, 2, 1, 2, 2, 2, 1, 2),
agegroup= c(1, 3, 1, 2, 3, 3, 2, 2, 2, 1),
weight = c(100, 20, 400, 300, 50, 50, 80, 250, 100, 100))
library(expss)
# Variable labels
var_lab(df$sex) <-"Sex"
var_lab(df$agegroup) <-"Age group"
# Value labels
val_lab(df$sex) <- make_labels("1 Male
2 Female")
val_lab(df$agegroup) <- make_labels("1 1-29
2 30-49
3 50 and more")
# Save variable labels
var_labels1 <- var_lab(df$sex)
var_labels2 <- var_lab(df$agegroup)
# Drop variable labels
var_lab(df$sex) <- NULL
var_lab(df$agegroup) <- NULL
# Save frequencies
f1 <- fre(values2labels(df$sex))
f2 <- fre(values2labels(df$agegroup))
# Note: I use the function 'values2labels' from 'expss' package in order to display the value <br />
labels instead of the values of the variable.In this example, since I manually created the value <br />
labels, I don't need that function, but when I import haven labelled data, I need it to
display value labels by default.
# Add frequencies on list
my_list <- list(f1, f2)
# Name lists elements as variable labels
names(my_list) <- c(var_labels1,
var_labels2)
In the following output, how can I get rid of the first column name on both tables: values2labels(df$sex) and values2labels(df$agegroup) ?
$Sex
| values2labels(df$sex) | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| --------------------- | ----- | ------------- | ------- | ------------ | ----------------------- |
| Female | 6 | 60 | 60 | 60 | 60 |
| Male | 4 | 40 | 40 | 40 | 100 |
| #Total | 10 | 100 | 100 | 100 | |
| <NA> | 0 | | 0 | | |
$`Age group`
| values2labels(df$agegroup) | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| -------------------------- | ----- | ------------- | ------- | ------------ | ----------------------- |
| 1-29 | 3 | 30 | 30 | 30 | 30 |
| 30-49 | 4 | 40 | 40 | 40 | 70 |
| 50 and more | 3 | 30 | 30 | 30 | 100 |
| #Total | 10 | 100 | 100 | 100 | |
| <NA> | 0 | | 0 | | |
You need to set var_lab to empty string instead of NULL:
library(expss)
a = 1:2
val_lab(a) = c("One" = 1, "Two" = 2)
var_lab(a) = ""
fre(values2labels(a))
# | | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
# | ------ | ----- | ------------- | ------- | ------------ | ----------------------- |
# | One | 1 | 50 | 50 | 50 | 50 |
# | Two | 1 | 50 | 50 | 50 | 100 |
# | #Total | 2 | 100 | 100 | 100 | |
# | <NA> | 0 | | 0 | | |
I have two data frames and I want to merge them by leader values, so that I can see the total runs and walks for each groups. Each leader can have multiple members in their team, but the problem that I'm having is that when I merge them, the metrics also gets duplicated over to the newly added rows.
Here is an example of the two data sets that I have:
Data set 1:
+-------------+-----------+------------+-------------+
| leader name | leader id | total runs | total walks |
+-------------+-----------+------------+-------------+
| ab | 11 | 4 | 9 |
| tg | 47 | 8 | 3 |
+-------------+-----------+------------+-------------+
Data set 2:
+-------------+-----------+--------------+-----------+
| leader name | leader id | member name | member id |
+-------------+-----------+--------------+-----------+
| ab | 11 | gfh | 589 |
| ab | 11 | tyu | 739 |
| tg | 47 | rtf | 745 |
| tg | 47 | jke | 996 |
+-------------+-----------+--------------+-----------+
I want to merge the two datasets so that they become like this:
+-------------+-----------+--------------+------------+------------+-------------+
| leader name | leader id | member name | member id | total runs | total walks |
+-------------+-----------+--------------+------------+------------+-------------+
| ab | 11 | gfh | 589 | 4 | 9 |
| ab | 11 | tyu | 739 | | |
| tg | 47 | rtf | 745 | 8 | 3 |
| tg | 47 | jke | 996 | | |
+-------------+-----------+--------------+------------+------------+-------------+
But right now I keep getting:
+-------------+-----------+--------------+------------+------------+-------------+
| leader name | leader id | member name | member id | total runs | total walks |
+-------------+-----------+--------------+------------+------------+-------------+
| ab | 11 | gfh | 589 | 4 | 9 |
| ab | 11 | tyu | 739 | 4 | 9 |
| tg | 47 | rtf | 745 | 8 | 3 |
| tg | 47 | jke | 996 | 8 | 3 |
+-------------+-----------+--------------+------------+------------+-------------+
It doesn't matter if they're blank, NA's or 0's, as long as the values aren't duplicating. Is there a way to achieve this?
We can do a replace on those 'total' columns after a left_join
library(dplyr)
left_join(df2, df1 ) %>%
group_by(leadername) %>%
mutate_at(vars(starts_with('total')), ~ replace(., row_number() > 1, NA))
# A tibble: 4 x 6
# Groups: leadername [2]
# leadername leaderid membername memberid totalruns totalwalks
# <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#1 ab 11 gfh 589 4 9
#2 ab 11 tyu 739 NA NA
#3 tg 47 rtf 745 8 3
#4 tg 47 jke 996 NA NA
Or without using the group_by
left_join(df2, df1 ) %>%
mutate_at(vars(starts_with('total')), ~
replace(., duplicated(leadername), NA))
Or a base R option is
out <- merge(df2, df1, all.x = TRUE)
i1 <- duplicated(out$leadername)
out[i1, c("totalruns", "totalwalks")] <- NA
out
# leadername leaderid membername memberid totalruns totalwalks
#1 ab 11 gfh 589 4 9
#2 ab 11 tyu 739 NA NA
#3 tg 47 rtf 745 8 3
#4 tg 47 jke 996 NA NA
data
df1 <- structure(list(leadername = c("ab", "tg"), leaderid = c(11, 47
), totalruns = c(4, 8), totalwalks = c(9, 3)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(leadername = c("ab", "ab", "tg", "tg"), leaderid = c(11,
11, 47, 47), membername = c("gfh", "tyu", "rtf", "jke"), memberid = c(589,
739, 745, 996)), class = "data.frame", row.names = c(NA, -4L))
I have a dataframe that is generated by the following code
l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)
dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)),
c("ids", "months", "calWeek", "value"))
and looks like this:
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 1 | 5 | 201710 | 6 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
| 2 | 88 | 201743 | 90 |
| 2 | 88 | 201744 | 51 |
+----+-------+----------+-------+
I would like to randomly sample a calendar week from each id-month group (the months are not calendar months). Then I would like to keep all id-month combination prior to the sample months.
An example output could be: suppose the sampling output returned cal week 201743 for the group id=2 and month=88 and 201709 for the group id=1 and month=5, then the final ouput should be
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
2 | 88 | 201743 | 90 |
+----+-------+----------+-------+
I tried to work with dplyr's sample_n function (which is going to give me the random calendar week by id-month group, but then I do not know how to get all calendar weeks prior to that date. Can you help me with this. If possible, I would like to work with dplyr.
Please let me know in case you need further information.
Many thanks
require(dplyr)
set.seed(1) # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1))
sampled_day <- sampled$calWeek
dat %>% group_by(ids) %>%
mutate(max_day = which(calWeek %in% sampled_day)) %>%
filter(row_number() <= max_day)
#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)
# A tibble: 3 x 4
# Groups: ids [2]
ids months calWeek value
<dbl> <dbl> <dbl> <dbl>
1 1.00 5.00 201708 5.00
2 2.00 88.0 201741 99.0
3 2.00 88.0 201742 100
This depends on the row order! So make sure to arrange by day first. You'll need to think about ties, though. I have edited my previous answer and simply filtered with <=
That should do the trick:
sample_and_get_below <- function(df, when, size){
res <- filter(df, calWeek == when) %>%
sample_n(size)
filter(df, calWeek > when) %>%
rbind(res, .)
}
sample_and_get_below(dat, 201741, 1)
ids months calWeek value
1 2 88 201741 99
2 2 88 201742 100
3 2 88 201743 1001
4 2 88 201744 1002