Using dplyr collapse rows taking condition from another numeric column - r

An example df:
experiment = c("A", "A", "A", "A", "A", "B", "B", "B")
count = c(1,2,3,4,5,1,2,1)
df = cbind.data.frame(experiment, count)
Desired output:
experiment_1 = c("A", "A", "A", "B", "B")
freq = c(1,1,3,2,1) # frequency
freq_per = c(20,20,60,66.6,33.3) # frequency percent
df_1 = cbind.data.frame(experiment_1, freq, freq_per)
I want to do the following:
Group df using experiment
Calculate freq using the count column
Calculate freq_per
Calculate sum of freq_per for all observations with count >= 3
I have the following code. How do I do the step 4?
freq_count = df %>% dplyr::group_by(experiment, count) %>% summarize(freq=n()) %>% na.omit() %>% mutate(freq_per=freq/sum(freq)*100)
Thank you very much.

There may be a more concise approach but I would suggest collapsing your count in a new column using mutate() and ifelse() and then summarising:
freq_count %>%
mutate(collapsed_count = ifelse(count >= 3, 3, count)) %>%
group_by(collapsed_count, add = TRUE) %>% # adds a 2nd grouping var
summarise(freq = sum(freq), freq_per = (sum(freq_per))) %>%
select(-collapsed_count) # dropped to match your df_1.
Also, just fyi, for step 2 you might consider the count() function if you're keen to save some keystrokes. Also tibble() or data.frame() are likely better options than calling the dataframe method of cbind explicitly to create a data frame.

Related

How to use R to merge redundant information?

It’s hard to describe what I mean, I mean I have the following data frame
A 1013574 1014475
A 1014005 1014475
A 1014005 1014435
I want to merge these data into A 1013574 1014475,Is there any function that can do me achieve this goal?
My desired output is two have 1 row for each ID (in my case value "A"), the second column will contain the smallest value and the third the highest value for each ID.
This is an updated answer. I think that this is what you want. I added additional rows, so you can see how it works with multiple data.
library(dplyr)
df <- tibble(a = c("A", "A", "A","B", "B", "B" ),
v1 = as.numeric(c(1013574,1014005,1014005, 1014005, 1014305, 1044005)),
v2 = as.numeric(c(1014475, 1014475,1014435, 1014435, 1014435, 1314435)))
df_new <-df %>% group_by(a) %>% mutate(v1 = min(v1),
v2 = max(v2)) %>%
distinct()

Efficiently summarizing and transforming a table of data using tidyverse functions

I have a relatively large data file that looks like (a), and need create a structure like (b). Thus I need to calculate the sum of Amount times Coeficient for each ID and each year.
I quickly hacked something together using nested for loops, but thats of course terribly inefficient:
library(tidyverse)
data <- tibble(
id=c("A", "B", "C", "A", "A", "B", "C"),
year=c(2002,2002,2004,2002,2003,2003,2005),
amount=c(1000,1500,1000,500,1000,1000,500),
coef=rep(0.5,7)
)
years <- sort(unique(data$year))
ids <- unique(data$id)
result <- matrix(0,length(ids),length(years)) %>%
as.tibble() %>% setNames(., years)
for (i in seq_along(ids)){
for (j in seq_along(years)){
d <- filter(data, id==ids[i] & year== years[j])
if (nrow(d)!=0){
result[i,j] <- sum(d$amount*d$coef)
}
}
}
result <- add_column(result, ID=ids, .before = 1)
I was wondering how one could solve this efficiently using map(), group_by() or any other tidyverse functions.
Thanks in advance for helpful suggestions.
Here's one way that seems to work. I'm sure there are others.
library(tidyverse)
id <- c("A", "B", "C", "A", "A", "B", "C")
year <- c(2002,2002,2004,2002,2003,2003,2005)
amount <- c(1000,1500,1000,500,1000,1000,500)
coef <- rep(0.5,7)
data <- tibble(id, year, amount, coef)
table <- data %>%
group_by(., id, year) %>%
mutate(prod = amount*coef)%>%
summarize(., sumprod = sum(prod)) %>%
spread(., year, sumprod) %>%
replace(is.na(.), 0)
Thanks for the hint, this really is just one line:
result <- data %>% group_by(id, year) %>% summarise(S=sum(amount*coef)) %>% spread(year, S)

dplyr: how to ignore NA in grouping variable

Using dplyr, I'm trying to group by two variables. Now, if there is a NA in one variable but the other variable match, I'd still like to see those rows grouped, with the NA taking on the value of the non-NA value. So if I have a data frame like this:
variable_A <- c("a", "a", "b", NA, "f")
variable_B <- c("c", "d", "e", "c", "c")
variable_C <- c(10, 20, 30, 40, 50)
df <- data.frame(variable_A, variable_B, variable_C)
And if I wanted to group by variable_A and variable_B, row 1 and 4 normally wouldn't group but I'd like them to , while the NA gets overridden to "a." How can I achieve this? The below doesn't do the job.
df2 <- df %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You can group by B first, and then fill in the missing A values. Then proceed with what you wanted to do:
df_filled = df %>%
group_by(variable_B) %>%
mutate(variable_A = first(na.omit(variable_A)))
df_filled %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You could do the missing value imputation using base R as follows:
ii <- which(is.na(df$variable_A))
jj <- which(df$variable_B == df$variable_B[ii])
df_filled <- df
df_filled$variable_A[jj] = df$variable_A[jj][!is.na(df$variable_A[jj])]
Then group and summarize as planned with dplyr
df_filled %>%
group_by(variable_A, variable_B) %>%
dplyr::summarise(total=sum(variable_C))

Drop unused levels from a factor after filtering data frame using dplyr

I used dplyr function to create a new data sets which contain the names that have less than 4 rows.
df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)
aa = df %>%
group_by(name) %>%
filter(n() < 4)
But when I type
table(aa$name)
I get,
a b c
3 2 0
I would like to have my output as follow
a b
3 2
How to completely separate new frame aa from df?
To complete your answer and KoenV's comment you can just, write your solution in one line or apply the function factor will remove the unused levels:
table(droplevels(aa$name))
table(factor(aa$name))
or because you are using dplyr add droplevels at the end:
aa <- df %>%
group_by(name) %>%
filter(n() < 4) %>%
droplevels()
table(aa$name)
# Without using table
df %>%
group_by(name) %>%
summarise(count = n()) %>%
filter(count < 4)
aaNew <- droplevels(aa)
table(aa$name)

overlapping groups in dplyr

I'm trying to calculate "rolling" summary statistics based on a grouping factor. Is there a nice way to process by (overlapping) groups, based on (say) an ordered factor?
As an example, say I want to calculate the sum of val by groups
df <- data.frame(grp = c("a", "a", "b", "b", "c", "c", "c"),
val = rnorm(7))
For groups based on grp, it's easy:
df %>% group_by(grp) %>% summarise(total = sum(val))
# result:
grp total
1 a 1.6388
2 b 0.7421
3 c 1.1707
However, what I want to do is calculate "rolling" sums for successive groups ("a" & "b", then "b" & "c", etc.). The desired output would be something like this:
grp1 grp2 total
1 a b 1.6388
2 b c 0.7421
I'm having trouble doing this in dplyr. In particular, I can't seem to figure out how to get "overlapping" groups - the "b" rows in the above example should end up in two output groups.
Try lag:
df %>%
group_by(grp) %>%
arrange(grp) %>%
summarise(total = sum(val)) %>%
mutate(grp1 = lag(grp), grp2 = grp, total = total + lag(total)) %>%
select(grp1, grp2, total) %>%
na.omit

Resources