Efficiently summarizing and transforming a table of data using tidyverse functions - r

I have a relatively large data file that looks like (a), and need create a structure like (b). Thus I need to calculate the sum of Amount times Coeficient for each ID and each year.
I quickly hacked something together using nested for loops, but thats of course terribly inefficient:
library(tidyverse)
data <- tibble(
id=c("A", "B", "C", "A", "A", "B", "C"),
year=c(2002,2002,2004,2002,2003,2003,2005),
amount=c(1000,1500,1000,500,1000,1000,500),
coef=rep(0.5,7)
)
years <- sort(unique(data$year))
ids <- unique(data$id)
result <- matrix(0,length(ids),length(years)) %>%
as.tibble() %>% setNames(., years)
for (i in seq_along(ids)){
for (j in seq_along(years)){
d <- filter(data, id==ids[i] & year== years[j])
if (nrow(d)!=0){
result[i,j] <- sum(d$amount*d$coef)
}
}
}
result <- add_column(result, ID=ids, .before = 1)
I was wondering how one could solve this efficiently using map(), group_by() or any other tidyverse functions.
Thanks in advance for helpful suggestions.

Here's one way that seems to work. I'm sure there are others.
library(tidyverse)
id <- c("A", "B", "C", "A", "A", "B", "C")
year <- c(2002,2002,2004,2002,2003,2003,2005)
amount <- c(1000,1500,1000,500,1000,1000,500)
coef <- rep(0.5,7)
data <- tibble(id, year, amount, coef)
table <- data %>%
group_by(., id, year) %>%
mutate(prod = amount*coef)%>%
summarize(., sumprod = sum(prod)) %>%
spread(., year, sumprod) %>%
replace(is.na(.), 0)

Thanks for the hint, this really is just one line:
result <- data %>% group_by(id, year) %>% summarise(S=sum(amount*coef)) %>% spread(year, S)

Related

How can I calculate the sum for specific cells?

I want to sum up the Population and householders for NH_AmIn, NH_PI, NH_Other, NH_More as a new row for each county. How can I do that?
A dplyr approach using dummy data, you would have to expand on this. Its filtering to the focal races, grouping by county, getting the sum of population for the filtered rows and by groups, and appending it to the initial data.
library(dplyr)
set.seed(1)
# demo data
df <- data.frame(county=rep(c("A","B"), each=4), race=c("a", "b", "c", "d"), population=sample(2000:15000, size=8))
# sum by state for subset
df %>%
filter(race %in% c("c", "d")) %>%
group_by(cou ty) %>%
summarise("race"="total", "population"=sum(population)) %>%
rbind(df)
The solution for yours, if df is the name of your data.frame, is
df %>%
filter(Race %in% c("NH_AmIn", "NH_PI", "NH_Other", "NH_More")) %>%
group_by(County) %>%
summarise("Race"="total", "Population"=sum(Population), "Householder"=sum(Householder)) %>%
rbind(df)

Summarise multiple functions at once using tidyeval in dplyr 1.0

Say we have a data frame,
library(tidyverse)
library(rlang)
df <- tibble(id = rep(c(1:2), 10),
grade = sample(c("A", "B", "C"), 20, replace = TRUE))
we would like to get the mean of grades grouped by id,
df %>%
group_by(id) %>%
summarise(
n = n(),
mu_A = mean(grade == "A"),
mu_B = mean(grade == "B"),
mu_C = mean(grade == "C")
)
I am handling a case where there are multiple conditions (many grades in this case) and would like to make my code more robust. How can we simplify this using tidyevaluation in dplyr 1.0?
I am talking about the idea of generating multiple column names by passing all grades at once, without breaking the flow of piping in dplyr, something like
# how to get the mean of A, B, C all at once?
mu_{grade} := mean(grade == {grade})
I actually found the answer to my own question from a post that I wrote 2 years ago...
I am just going to post the code right below hoping to help anybody that comes across the same problem.
make_expr <- function(x) {
x %>%
map( ~ parse_expr(str_glue("mean(grade == '{.x}')")))
}
# generate multiple expressions
grades <- c("A", "B", "C")
exprs <- grades %>% make_expr() %>% set_names(paste0("mu_", grades))
# we can 'top up' something extra by adding named element
exprs <- c(n = parse_expr("n()"), exprs)
# using the big bang operator `!!!` to force expressions in data frame
df %>% group_by(id) %>% summarise(!!!exprs)

Filtering a dataframe based on a list and then saving each dataframe based on said filter. Is creating a function or for loop the way to go?

I'm doing topic modeling on different categories within my dataset and before I do that I need to split the data into different dataframes based on the category so that I can then cast each one of them to a document-term matrix. From the little I know about for loops I have the following. The part I get stuck is that I need an output for each item in the list.
category = c("a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j")
for (i in category) {
#Subset to test topic model
someDataFrame = anotherDataFrame %>%
filter(colVariable == i) %>% #here is the column of interest in the dataframe
select(ID, Word) %>%
group_by(ID, Word) %>%
count()
newDataFrame_i = someDataFrame %>% #here's where I'd like to export to individual dataframes
cast_dtm(ID, Word, n) #in order to do topic modeling, you have to build a document-term matrix
}
Like I said before, I'm expecting a dataframe for each item in the list, however, I keep getting Error in (function (cl, name, valueClass) : assignment of an object of class “numeric” is not valid for #‘Dim’ in an object of class “dgTMatrix”; is(value, "integer") is not TRUE.
I've done this using one value (hard-coded, say "a") and get the result I'm looking for so I know my for loop is off.
Solution:
filter_and_cast <- function(df, category){
df %>%
filter(colVariable == i) %>% #here is the column of interest in the dataframe
select(ID, Word) %>%
group_by(ID, Word) %>%
count() %>%
ungroup() %>%
cast_dm(ID, Word, n)
}
for (i in category) {
cast = paste("filterCast", i, sep = "_")
try(assign(cast, filter_and_cast(aDataFrame, i)))
}
Thanks to the contributors, I was finally able to solve my issue.
I think you can solve the problem using the assign() function which enables the creation of objects passing the name and the values.
Something like:
ObjectName = paste(("newDataFrame", i, sep = "_")
assign(ObjectName, newDataFrame_i)
You just need to save the intermediate outputs into an object
category = c("a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j")
out <-list()
for (i in category) {
#Subset to test topic model
someDataFrame = anotherDataFrame %>%
filter(colVariable == i) %>% #here is the column of interest in the dataframe
select(ID, Word) %>%
group_by(ID, Word) %>%
count()
out[[i]] = someDataFrame %>% #here's where I'd like to export to individual dataframes
cast_dtm(ID, Word, n) #in order to do topic modeling, you have to build a document-term matrix
}
Or on the more R-esque style
filter_and_cast <- function(df, category){
filter(colVariable == i) %>% #here is the column of interest in the dataframe
select(ID, Word) %>%
group_by(ID, Word) %>%
count() %>%
ungroup() %>%
cast_dm(ID, Word, n)
}
Then you could do something like
map(category, filter_and_cast, df = anotherDataFrame )

dplyr: how to ignore NA in grouping variable

Using dplyr, I'm trying to group by two variables. Now, if there is a NA in one variable but the other variable match, I'd still like to see those rows grouped, with the NA taking on the value of the non-NA value. So if I have a data frame like this:
variable_A <- c("a", "a", "b", NA, "f")
variable_B <- c("c", "d", "e", "c", "c")
variable_C <- c(10, 20, 30, 40, 50)
df <- data.frame(variable_A, variable_B, variable_C)
And if I wanted to group by variable_A and variable_B, row 1 and 4 normally wouldn't group but I'd like them to , while the NA gets overridden to "a." How can I achieve this? The below doesn't do the job.
df2 <- df %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You can group by B first, and then fill in the missing A values. Then proceed with what you wanted to do:
df_filled = df %>%
group_by(variable_B) %>%
mutate(variable_A = first(na.omit(variable_A)))
df_filled %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You could do the missing value imputation using base R as follows:
ii <- which(is.na(df$variable_A))
jj <- which(df$variable_B == df$variable_B[ii])
df_filled <- df
df_filled$variable_A[jj] = df$variable_A[jj][!is.na(df$variable_A[jj])]
Then group and summarize as planned with dplyr
df_filled %>%
group_by(variable_A, variable_B) %>%
dplyr::summarise(total=sum(variable_C))

Using dplyr collapse rows taking condition from another numeric column

An example df:
experiment = c("A", "A", "A", "A", "A", "B", "B", "B")
count = c(1,2,3,4,5,1,2,1)
df = cbind.data.frame(experiment, count)
Desired output:
experiment_1 = c("A", "A", "A", "B", "B")
freq = c(1,1,3,2,1) # frequency
freq_per = c(20,20,60,66.6,33.3) # frequency percent
df_1 = cbind.data.frame(experiment_1, freq, freq_per)
I want to do the following:
Group df using experiment
Calculate freq using the count column
Calculate freq_per
Calculate sum of freq_per for all observations with count >= 3
I have the following code. How do I do the step 4?
freq_count = df %>% dplyr::group_by(experiment, count) %>% summarize(freq=n()) %>% na.omit() %>% mutate(freq_per=freq/sum(freq)*100)
Thank you very much.
There may be a more concise approach but I would suggest collapsing your count in a new column using mutate() and ifelse() and then summarising:
freq_count %>%
mutate(collapsed_count = ifelse(count >= 3, 3, count)) %>%
group_by(collapsed_count, add = TRUE) %>% # adds a 2nd grouping var
summarise(freq = sum(freq), freq_per = (sum(freq_per))) %>%
select(-collapsed_count) # dropped to match your df_1.
Also, just fyi, for step 2 you might consider the count() function if you're keen to save some keystrokes. Also tibble() or data.frame() are likely better options than calling the dataframe method of cbind explicitly to create a data frame.

Resources