This question is a follow-up from this thread
I'd like to perform three actions on a disk frame
Count the distinct values of the field id grouped by two columns (key_a and key_b)
Count the distinct values of the field id grouped by the first of two columns (key_a)
Add a column with the distinct values for the first column / the distinct values across both columns
This is my code
my_df <-
data.frame(
key_a = rep(letters, 384),
key_b = rep(rev(letters), 384),
id = sample(1:10^6, 9984)
)
my_df %>%
select(key_a, key_b, id) %>%
chunk_group_by(key_a, key_b) %>%
# stage one
chunk_summarize(count = n_distinct(id)) %>%
collect %>%
group_by(key_a, key_b) %>%
# stage two
mutate(count_summed = sum(count)) %>%
group_by(key_a) %>%
mutate(count_all = sum(count)) %>%
ungroup() %>%
mutate(percent_of_total = count_summed / count_all)
My data is in the format of a disk frame, not a data frame, and it has 100M rows and 8 columns.
I'm following the two step instructions described in this documentation
I'm concerned that the collect will crash my machine since it brings everything to ram
Do I have to use collect in order to use dplyr group bys in disk frame?
You should always use srckeep to load only those columns you need into memory.
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
# select(key_a, key_b, id) %>% # no need if you use srckeep
chunk_group_by(key_a, key_b) %>%
# stage one
chunk_summarize(count = n_distinct(id)) %>%
collect %>%
group_by(key_a, key_b) %>%
# stage two
mutate(count_summed = sum(count)) %>%
group_by(key_a) %>%
mutate(count_all = sum(count)) %>%
ungroup() %>%
mutate(percent_of_total = count_summed / count_all)
collect will only bring the results of computing chunk_group_by and chunk_summarize into RAM. It shouldn't crash your machine.
You must use collect just like other systems like Spark.
But if you are computing n_distinct, that can be done in one-stage anyway
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
#select(key_a, key_b, id) %>%
group_by(key_a, key_b) %>%
# stage one
summarize(count = n_distinct(id)) %>%
collect
If you really concerned about RAM usage, you can reduce the number of workers to 1
setup_disk.frame(workers=1)
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
#select(key_a, key_b, id) %>%
group_by(key_a, key_b) %>%
# stage one
summarize(count = n_distinct(id)) %>%
collect
setup_disk.frame()
Related
I want my dataframe to return unique rows based on two logical conditions (OR not AND).
But when I ran this, df %>% group_by(sex) %>% distinct(state, education) %>% summarise(n=n()) I got deduplicated rows based on the two conditions joined by AND not OR.
Is there a way to get something like this df %>% group_by(sex) %>% distinct(state | education) %>% summarise(n=n()) so that the deduplicated rows will be joined by OR not AND?
Thank you.
You can use tidyr::pivot_longer and then distinct afterwards:
df %>%
pivot_longer(c(state, education), names_to = "type", values_to = "value")
group_by(sex) %>%
distinct(value) %>%
summarise(n = n())
In this case, pivot_longer simply puts state and education into one column called value.
How can I write these two code together into one instead of having them separate? They work well but I would like to have them into one code. That is, have the total number of production and their average by task and department.
df %>%
group_by(task, department) %>%
summarise(across(.cols = c(production),
.fns = sum,
na.rm=T))
df %>%
group_by(task, department) %>%
summarise(across(.cols = c(hours),
.fns = mean,
na.rm=T))
df %>%
group_by(task, department) %>%
summarise(mean = mean(hours), sum = sum(production))
enter image description hereI have a huge data set which has data for every 30 seconds . First I get the mean to take hourly data , then sum it for daily data and again sum it for monthly data . I need to assign the mutate function to a new data set / variable called mE_131 . for plotting monthly value .I'm new to this Please Help!
library(dplyr)
library(ggplot2)
attach(data)
data%>% #filtering 131 and 132
select(time,Column3,m_Pm) %>%
filter(data,Column3=="131")
filter(data,Column3=="132")
data_131<-filter(data,Column3=="131")
data_132<-filter(data,Column3=="132")
data_131%>%
mutate(datehour= format(time,"%Y-%m-%d %H"), date1= format(time,"%Y-%m-%d"), month=format(time,"%Y-%m")) %>%
group_by(datehour) %>% mutate(hourlyP=mean(m_Pm)) %>% distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>% mutate(dailyP=sum(hourlyP)) %>% distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>% summarise(monthlyP=sum(dailyP))
If your goal is to compare monthly data between column3 == 131 and column3 == 132 then you don't necessarily need to create a separate dataset for each of them although I will show you how to do it in the end.
First, let's create the required summary for both 131 and 132 :
data <- data %>%
filter(column3 == "131" | column3 == "132") %>% # filtering the required data only
mutate(datehour= format(time,"%Y-%m-%d %H"), # calculate the required stats
date1= format(time,"%Y-%m-%d"),
month=format(time,"%Y-%m")) %>%
group_by(datehour) %>%
mutate(hourlyP=mean(m_Pm)) %>%
distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>%
mutate(dailyP=sum(hourlyP)) %>%
distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>%
summarise(monthlyP=sum(dailyP))
Note: I have written every part of code in separate line to enhance readability but it is basically the same as your code shown above.
Now, let's do the plotting:
data %>%
ggplot(aes(x=month, y=monthlyP, fill=column3)) +
geom_bar(position="dodge") # this will produce similar plot as in your example
If you insist on having a separate dataset for each value in column3 then you can simply use the assignment operator <- to create a new dataframe as follows
mE_131 <- data_131 %>%
mutate(datehour= format(time,"%Y-%m-%d %H"),
date1= format(time,"%Y-%m-%d"),
month=format(time,"%Y-%m")) %>%
group_by(datehour) %>%
mutate(hourlyP=mean(m_Pm)) %>%
distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>%
mutate(dailyP=sum(hourlyP)) %>%
distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>%
summarise(monthlyP=sum(dailyP))
Then do the same thing to create mE_132. However, I don't recommend this because it would be harder to plot them.
I would like to create a set of columns based on papers count for each number of year, therefore filtering multiple conditions in dplyr through summarise:
This is my code:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(papers_count = n()) %>%
arrange(desc(papers_count))
The code above gives me two columns, 'word' and 'papers_count', I would like to create more columns like papers_count (papers_count1990, papers_count1991, etc..) based on each year between 1990 and 2017.
I Am looking for something like ths:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(tot_papers_count = n(), papers_count_1991 = n()year="1991", ...) %>%
arrange(desc(papers_count))
please does anybody have any suggestion?
I would suggest adding year to the group_by, and then using spread to create multiple summary columns.
library(tidyr)
words_list_by_year <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(year,word) %>%
summarise(papers_count = n()) %>%
spread(year,papers_count,fill=0)
I've got a data frame (df) with two variables, site and purchase.
I'd like to use dplyr() to group my data by site and purchase, and get the counts and percentages for the grouped data. I'd however also like the tibble to feature rows called ALLSITES, representing the data of all the sites grouped by purchase, so that I end up with a tibble looking similar to dfgoal.
The problem's that my current code doesn't get me the ALLSITES rows. I've tried adding a base R function into dplyr(), which doesn't work.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("LON","MAD","PAR","MAD","PAR","MAD","PAR","MAD","PAR","LON","MAD","LON","MAD","MAD","MAD"),purchase=c("a1","a2","a1","a1","a1","a1","a1","a1","a1","a2","a1","a2","a1","a2","a1"))
Desired outcome:
dfgoal <- data.frame(site=c("LON","LON","MAD","MAD","PAR","ALLSITES","ALLSITES"),purchase=c("a1","a2","a1","a2","a1","a1","a2"),bin=c(1,2,6,2,4,11,4),pin_per=c(33.33333,66.66667,75.00000,25.00000,100.00000,73.33333,26.66666))
Current code:
library(dplyr)
df %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
df %>%
rbind(df, transform(df, site = "ALLSITES") %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
We can start from the first output code block, after grouping by 'site' with a created string of 'ALLSITES' and 'purchase' get the sum of 'bin' and later 'bin_per', then with bind_rows row bind the two datasets
df1 %>%
ungroup() %>%
group_by(site = 'ALLSITES', purchase) %>%
summarise(bin = sum(bin)) %>%
ungroup %>%
mutate(bin_per = 100*(bin/sum(bin))) %>%
bind_rows(df1, .)