How can I calculate the sum for specific cells? - r

I want to sum up the Population and householders for NH_AmIn, NH_PI, NH_Other, NH_More as a new row for each county. How can I do that?

A dplyr approach using dummy data, you would have to expand on this. Its filtering to the focal races, grouping by county, getting the sum of population for the filtered rows and by groups, and appending it to the initial data.
library(dplyr)
set.seed(1)
# demo data
df <- data.frame(county=rep(c("A","B"), each=4), race=c("a", "b", "c", "d"), population=sample(2000:15000, size=8))
# sum by state for subset
df %>%
filter(race %in% c("c", "d")) %>%
group_by(cou ty) %>%
summarise("race"="total", "population"=sum(population)) %>%
rbind(df)
The solution for yours, if df is the name of your data.frame, is
df %>%
filter(Race %in% c("NH_AmIn", "NH_PI", "NH_Other", "NH_More")) %>%
group_by(County) %>%
summarise("Race"="total", "Population"=sum(Population), "Householder"=sum(Householder)) %>%
rbind(df)

Related

Mutate new column over a large list of tibbles

So I have used the following code to split the below dataframe (df1) into multiple dataframes/tibbles based on the filters so that I can work out the percentile rank of each metric.
df1:
name
group
metric
value
A
A
distance
10569
B
A
distance
12939
C
A
distance
11532
A
A
psv-99
29.30
B
A
psv-99
30.89
C
A
psv-99
28.90
split <- lapply(unique(df1$metric), function(x){
filter <- df1 %>% filter(group == "A" & metric == x)
})
This then gives me a large list of tibbles. I want to now mutate a new column for each tibble to work out the percentile rank of the value column which I can do using the following code:
df2 <- split[[1]] %>% mutate(percentile = percent_rank(value))
I could do this for each metric then row_bind them together, but that seems very messy. Could anyone suggest a better way of doing this?
No need to split the data here. You can use group_by to do the calculation for each metric separately.
library(dplyr)
df %>%
filter(group == "A") %>%
group_by(metric) %>%
mutate(percentile = percent_rank(value))
We can use base R
df1 <- subset(df, group == 'A')
df1$percentile <- with(df1, ave(value, metric, FUN = percent_rank))
df %>%
group_nest(group, metric) %>%
mutate(percentile = map(data, ~percent_rank(.x$value))) %>%
unnest(cols = c("data", "percentile"))

Sampling from a subset of a dataframe where the subset is conditional on a value from another dataframe in R

I have two data frames in R. One contains a row for each individual person and the area they live in. E.g.
df1 = data.frame(Person_ID = seq(1,10,1), Area = c("A","A","A","B","B","C","D","A","D","C"))
The other data frame contains demographic information for each Area.
E.g. for gender df2 = data.frame(Area = c("A","A","B","B","C","C","D","D"), gender = c("M","F","M","F","M","F","M","F"), probability = c(0.4,0.6,0.55,0.45,0.6,0.4,0.5,0.5))
In df1 I want to create a gender column where for each row of df1 I sample a gender from the appropriate subset of df2.
For example, for row 1 of df1 I would sample a gender from df2 %>% filter(Area == "A")
The question is how do I do this for all rows without a for loop as in practice df1 could have up to 5 million rows?
Try using the following :
library(dplyr)
library(tidyr)
out <- df1 %>%
nest(data = -Area) %>%
left_join(df2, by = 'Area') %>%
group_by(Area) %>%
summarise(data = map(data, ~.x %>%
mutate(gender = sample(gender, n(),
prob = probability, replace = TRUE)))) %>%
distinct(Area, .keep_all = TRUE) %>%
unnest(data)
We first nest df1 and join it with df2 by Area. For each Area we sample gender value based on probability in df2 and unnest to get long dataframe.
There are not enough samples in df1 to verify the result but if we increase number of rows in df1 the proportion should be similar to probability in df2.

Efficiently summarizing and transforming a table of data using tidyverse functions

I have a relatively large data file that looks like (a), and need create a structure like (b). Thus I need to calculate the sum of Amount times Coeficient for each ID and each year.
I quickly hacked something together using nested for loops, but thats of course terribly inefficient:
library(tidyverse)
data <- tibble(
id=c("A", "B", "C", "A", "A", "B", "C"),
year=c(2002,2002,2004,2002,2003,2003,2005),
amount=c(1000,1500,1000,500,1000,1000,500),
coef=rep(0.5,7)
)
years <- sort(unique(data$year))
ids <- unique(data$id)
result <- matrix(0,length(ids),length(years)) %>%
as.tibble() %>% setNames(., years)
for (i in seq_along(ids)){
for (j in seq_along(years)){
d <- filter(data, id==ids[i] & year== years[j])
if (nrow(d)!=0){
result[i,j] <- sum(d$amount*d$coef)
}
}
}
result <- add_column(result, ID=ids, .before = 1)
I was wondering how one could solve this efficiently using map(), group_by() or any other tidyverse functions.
Thanks in advance for helpful suggestions.
Here's one way that seems to work. I'm sure there are others.
library(tidyverse)
id <- c("A", "B", "C", "A", "A", "B", "C")
year <- c(2002,2002,2004,2002,2003,2003,2005)
amount <- c(1000,1500,1000,500,1000,1000,500)
coef <- rep(0.5,7)
data <- tibble(id, year, amount, coef)
table <- data %>%
group_by(., id, year) %>%
mutate(prod = amount*coef)%>%
summarize(., sumprod = sum(prod)) %>%
spread(., year, sumprod) %>%
replace(is.na(.), 0)
Thanks for the hint, this really is just one line:
result <- data %>% group_by(id, year) %>% summarise(S=sum(amount*coef)) %>% spread(year, S)

Using dplyr collapse rows taking condition from another numeric column

An example df:
experiment = c("A", "A", "A", "A", "A", "B", "B", "B")
count = c(1,2,3,4,5,1,2,1)
df = cbind.data.frame(experiment, count)
Desired output:
experiment_1 = c("A", "A", "A", "B", "B")
freq = c(1,1,3,2,1) # frequency
freq_per = c(20,20,60,66.6,33.3) # frequency percent
df_1 = cbind.data.frame(experiment_1, freq, freq_per)
I want to do the following:
Group df using experiment
Calculate freq using the count column
Calculate freq_per
Calculate sum of freq_per for all observations with count >= 3
I have the following code. How do I do the step 4?
freq_count = df %>% dplyr::group_by(experiment, count) %>% summarize(freq=n()) %>% na.omit() %>% mutate(freq_per=freq/sum(freq)*100)
Thank you very much.
There may be a more concise approach but I would suggest collapsing your count in a new column using mutate() and ifelse() and then summarising:
freq_count %>%
mutate(collapsed_count = ifelse(count >= 3, 3, count)) %>%
group_by(collapsed_count, add = TRUE) %>% # adds a 2nd grouping var
summarise(freq = sum(freq), freq_per = (sum(freq_per))) %>%
select(-collapsed_count) # dropped to match your df_1.
Also, just fyi, for step 2 you might consider the count() function if you're keen to save some keystrokes. Also tibble() or data.frame() are likely better options than calling the dataframe method of cbind explicitly to create a data frame.

overlapping groups in dplyr

I'm trying to calculate "rolling" summary statistics based on a grouping factor. Is there a nice way to process by (overlapping) groups, based on (say) an ordered factor?
As an example, say I want to calculate the sum of val by groups
df <- data.frame(grp = c("a", "a", "b", "b", "c", "c", "c"),
val = rnorm(7))
For groups based on grp, it's easy:
df %>% group_by(grp) %>% summarise(total = sum(val))
# result:
grp total
1 a 1.6388
2 b 0.7421
3 c 1.1707
However, what I want to do is calculate "rolling" sums for successive groups ("a" & "b", then "b" & "c", etc.). The desired output would be something like this:
grp1 grp2 total
1 a b 1.6388
2 b c 0.7421
I'm having trouble doing this in dplyr. In particular, I can't seem to figure out how to get "overlapping" groups - the "b" rows in the above example should end up in two output groups.
Try lag:
df %>%
group_by(grp) %>%
arrange(grp) %>%
summarise(total = sum(val)) %>%
mutate(grp1 = lag(grp), grp2 = grp, total = total + lag(total)) %>%
select(grp1, grp2, total) %>%
na.omit

Resources