Renaming a Summarised Column inside Redshift dplyr operations - r

I am using dplyr to do certain operations in Redshift so I don't load the data in memory.
data <- tbl(conn, "customers") %>%
filter(age >= 18)
subset <- data %>%
filter(eye_color != "brown") %>%
group_by(gender, method, age, region) %>%
summarise(sum(purchases)) %>% # will create a column called sum(purchases)
full_join(data, by=c("region", "age", "method"))
Right now when I look at the resulting dataframe, I will see a column called sum(purchases) and I want to rename it to purchases which will create the columns, purchase.x and purchase.y after the merge.
Most of the renaming that I've read so far are dealing with dataframes that are in memory rather than dataframes that are lazily evaluated with dbplyr. I have tried using rename, rename_, rename_at as well as different variations of select. I have also tried strategies laid out here and here but no luck
Is there a way to rename the sum(purchases). The only other option I have is to load the dataframe in memory at a certain step
data <- tbl(conn, "customers") %>%
filter(age >= 18)
subset <- data %>%
filter(eye_color != "brown") %>%
group_by(gender, method, age, region) %>%
summarise(sum(purchases)) %>%
loaded <- as.data.frame(subset)
# do some join here but in memory and not in Redshift
# full_join(data, by=c("region", "age", "method"))

You can assign names in summarise. I don't have your data so I can't triple-check, but I've used this in my own code before when calling summarise(n()). Something like...
summarise(your_column_name = sum(purchases))
You can also pass it a column name with spaces, you just have to use backticks
summarise(`your column name` = sum(purchases))

Related

Creating Groups based on Column Position

Good afternoon!
I think this is pretty straight forward question, but I think I am missing a couple of steps. Would like to create groups based on column position.
Am working with a dataframe / tibble; 33 rows long, and 66 columns wide. However, every sequence of 6 columns, should really be separated into its own sub-dataframe / tibble.
The sequence of the number columns is arbitrary to the dataframe. Below is an attempt with mtcars, where I am trying to group every 2 columns into its own sub-dataframe.
mtcars %>%
tibble() %>%
group_by(across(seq(1,2, length.out = 11))) %>%
nest()
However, that method generates errors. Something similar applies when working just within nest() as well.
Using mtcars, would like to create groups using a sequence for every 3 columns, or some other number.
Would ultimately like the mtcars dataframe to be...
Columns 1:3 to be group 1,
Columns 4:6 to be group 2,
Columns 7:9 to be group 3, etc... while retaining the information for the rows in each column.
Also considered something with pivot_longer...
mtcars %>%
tibble() %>%
pivot_longer(cols = seq(1,3, by = 1))
...but that did not generate defined groups, or continue the sequencing along all columns of the dataframe.
Hope one of you can help me with this! Would make certain tasks for work much easier.
PS - A plus if you can keep the workflow to tidyverse centric code :)
You could try this. It splits the dataframe into a list of dataframes based on the number of columns you want (3 in your example):
library(tidyverse)
list_of_dataframes <- mtcars %>%
tibble() %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
group_by(row) %>%
mutate(group = ceiling(row_number()/ 3)) %>%
ungroup() %>%
group_split(group) %>%
map(
~select(.x, row, name, value) %>%
pivot_wider()
)
EDIT
Here, based on comments from the question asker, we will avoid pivoting the data. Instead, we map the groups across the dataframe.
list_of_dataframes <- map(seq(1, ncol(mtcars), by = 3),
~mtcars %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+2, ncol(mtcars))))))
We can then wrap this in a function to make it a little easier to use and change group sizes and dataframes:
group_split_cols <- function(.data, ncols_per_group){
map(seq(1, ncol(.data), by = ncols_per_group),
~.data %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+ncols_per_group-1, ncol(.data))))))
}
list_of_dataframes <- group_split_cols(.data = mtcars, ncols_per_group = 3)

How to modify a dataframe nested inside a list without re-assignment

I have a list object that contains several elements. One element is a data frame that I wish to modify: I want to perform some operations such as column renaming and mutating new columns.
Although one simple way is to extract the nested data frame, modify it, and finally re-assign the output to the original parent list, I'd like to avoid such solution because it requires intermediate assignment.
Example
Data. let's build a list of several data objects
library(tibble)
my_list <- lst(letters, mtcars, co2, uspop, iris)
Task.
I want to modify my_list$mtcars to:
rename the cyl column
compute a new column that takes a square root of values in mpg column
I want to modify my_list$iris to:
select columns that start with sepal
rename them to lowercase
and ultimately I expect to get back a list object that is identical to the original my_list, except for the changes I made for mtcars and iris.
My attempt. Right now, the only way I know to achieve this involves re-assignment:
library(dplyr)
my_list$mtcars <-
my_list$mtcars %>%
rename("Number of cylinders" = cyl) %>%
mutate(sqrt_of_mpg = sqrt(mpg))
my_list$iris <-
my_list$iris %>%
select(starts_with("Sepal")) %>%
rename_with(tolower)
My question: Given my_list, how could I point to a nested element by its name, specify which actions should happen to modify it, and get back the parent my_list with just those modifications?
I imagine some sort of a pipe that looks like this (just to get my general idea)
## DEMO ##
my_list %>%
update_element(which = "mtcars", what = rename, mutate) %>%
update_element(which = "iris", what = select, rename)
Thanks!
You can try purrr's modify_at function
library(tidyverse)
my_list %>%
modify_at("mtcars", ~rename(.,"Number of cylinders" = cyl) %>%
mutate(sqrt_of_mpg = sqrt(mpg))) %>%
modify_at("iris", ~select(., starts_with("Sepal")) %>%
rename_with(tolower))
You can use imap which passes name along with data for each iteration but this is not closer to your general idea.
library(dplyr)
my_list <- purrr::imap(my_list, ~{
if(.y == 'mtcars')
.x %>% rename("Number of cylinders" = cyl) %>%mutate(sqrt_of_mpg = sqrt(mpg))
else if(.y == 'iris')
.x %>% select(starts_with("Sepal")) %>% rename_with(tolower)
else .x
})

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.

How to loop through date variable names and sum by group?

I have some time series data where there are a few region variables and the rest of the variable names are all dates. I am trying to trying to loop through the entire list of date variables and sum each of them but am unsure how to do it using dplyr syntax. This is what I have so far
library(dplyr)
library(lubridate)
library(data.table)
library(curl)
# county level
covid_jhu <- as.data.frame(fread(paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")))
# remove territories and assign the correct FIPS code
covid_jhu <- covid_jhu %>%
filter(Admin2 != "") %>%
mutate(FIPS = substr(as.character(UID), 4, 8))
jhu_state <- covid_jhu %>%
group_by(Province_State) %>%
mutate(`1/22/20` = sum(`1/22/20`))
I can't seem to figure out the loop here even though I seem to be able to get it right for 1 variable.
Here is potential method to perform the desired grouping. The key is convert the wide data frame from the source and transform it into a long format.
library(dplyr)
library(tidyr)
# county level
covid_jhu <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
# remove territories and assign the correct FIPS code
covid_jhu <- covid_jhu %>%
filter(Admin2 != "") %>%
mutate(FIPS = substr(as.character(UID), 4, 8))
#convert from wide to long
long_covid_jhu<-pivot_longer(covid_jhu, cols=starts_with("X"), names_to = "Date")
long_covid_jhu$Date <- as.Date(long_covid_jhu$Date, format="X%m.%d.%y")
#grouping by state
long_covid_jhu %>%
group_by(Province_State) %>% summarize(TotalCases=sum(value))
#grouping by date
long_covid_jhu %>%
group_by(Date) %>% summarize(TotalCases=sum(value))
#grouping by state & date
long_covid_jhu %>%
group_by(Province_State, Date) %>% summarize(TotalCases=sum(value))
Suggest if you want to try functions like
group_by_all,
group_by_ (this take a variable name as input rather than hard-coding a column name, essentially you can keep passing column names as input in a loop)
Similarly, you will have mutate_ , summarise_ functions as well
With my understanding of the question, i think reading slightly about this solves your purpose

Loop through aggregation using dplyr

So I'm I've tried to find the answer to this probably-obvious question. I have multiple predictor variables that I need to loop through in order to get a summary of another column for each predictor. This data frame will change with every iteration so I need code that work for multiple different data frames. Here are the places I've looked so far:
R- producing a summary calculation for each column that is dependent on aggregations at a factor level
Multiple data frame handling
Using the mtcars package, this is what I've tried:
#get mtcars data from graphics package
install.packages("graphics")
library(graphics)
data <- mtcars
#loop through names
variable <- list(colnames(data))
for(i in variable){
data1 <- data %>%
group_by(i)
summarise('number' = mean(mpg))
}
However, I get the following error:
Error in grouped_df_impl(data, unname(vars), drop) :
Column `i` is unknown
Not sure where to go next.
There are multiple issues in the code,
1) the variable is unnecessarily created as a list
2) Looping through the 'variable' is not getting inside the list, which is an issue from 1.
3) group_by_at can be used in place of group_by for string inputs
4) there is a typo of no connection ie. chain (%>%) between group_by and summarise
5) the output should be stored in a list or else it will be overwritten as we are assigning to the same object 'data1'
The below code does the correction
variable <- colnames(data) #is a `vector` now
data1 <- list() # initialize as a `list`
for(i in variable){
data1[[i]] <- data %>%
group_by_at(i) %>% #changed to `group_by_at`
summarise(number = mean(mpg))
}
Or this can be done in a tidyverse syntax which will return the output as a list of tibble and to avoid the initialization of list and assignment
purrr::map(variable, ~ data %>%
group_by_at(.x) %>%
summarise(number = mean(mpg)))
If we need to bind the list elements use bind_rows. But, it would also create multiple columns as the first column name is different and fill with NA
purrr::map(variable, ~ data %>%
group_by_at(.x) %>%
summarise(number = mean(mpg))) %>%
set_names(variable) %>%
bind_rows(., .id = 'variable')

Resources