How to mutate for loop in dplyr - r

I want to create multiple lag variables for a column in a data frame for a range of values. I have code that successfully does what I want but is not scalable for what I need (hundreds of iterations)
I have code below that successfully does what I want but is not scalable for what I need (hundreds of iterations)
Lake_Lag <- Lake_Champlain_long.term_monitoring_1992_2016 %>%
group_by(StationID,Test) %>%
arrange(StationID,Test,VisitDate) %>%
mutate(lag.Result1 = dplyr::lag(Result, n = 1, default = NA))%>%
mutate(lag.Result5 = dplyr::lag(Result, n = 5, default = NA))%>%
mutate(lag.Result10 = dplyr::lag(Result, n = 10, default = NA))%>%
mutate(lag.Result15 = dplyr::lag(Result, n = 15, default = NA))%>%
mutate(lag.Result20 = dplyr::lag(Result, n = 20, default = NA))
I would like to be able to use a list c(1,5,10,15,20) or a range 1:150 to create lagging variables for my data frame.

Here's an approach that makes use of some 'tidy eval helpers' included in dplyr that come from the rlang package.
The basic idea is to create a new column in mutate() whose name is based on a string supplied by a for-loop.
library(dplyr)
grouped_data <- Lake_Champlain_long.term_monitoring_1992_2016 %>%
group_by(StationID,Test) %>%
arrange(StationID,Test,VisitDate)
for (lag_size in c(1, 5, 10, 15, 20)) {
new_col_name <- paste0("lag_result_", lag_size)
grouped_data <- grouped_data %>%
mutate(!!sym(new_col_name) := lag(Result, n = lag_size, default = NA))
}
The sym(new_col_name) := is a dynamic way of writing lag_result_1 =, lag_result_2 =, etc. when using functions like mutate() or summarize() from the dplyr package.

We can use shift from data.table, which can take take multiple valuees for n. According to ?shift
n - Non-negative integer vector denoting the offset to lead or lag the input by. To create multiple lead/lag vectors, provide multiple values to n
Convert the 'data.frame' to 'data.table' (setDT), order by 'StationID', 'Test', 'VisitDate' in i, grouped by 'StationID', 'Test'), get the lag (default type of shift is "lag") of 'Result' with n as a vector of values, and assign (:=) the output to a vector of columns names (created with paste0)
library(data.table)
i1 <- c(1, 5, 10, 15, 20)
setDT(Lake_Champlain_long.term_monitoring_1992_2016)[order(StationID,
Test, VisitDate), paste0("lag.Result", i) := shift(Result, n= i),
by = .(StationID, Test)][]
NOTE: Showed a much efficient solution

Related

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

A particular syntactic construct in R

In
Orders1=Orders[Datecreated<floor_date(send_Date,unit='week',week_start = 7)-weeks(PrevWeek),
.(Previous_Sales=sum(Sales)),
by=.(Category,send_Date=floor_date(send_Date,unit='week',week_start = 7))]
What does the . in .(Previous_Sales=sum(Sales)) mean? This is some syntactic nuance with which I am not familiar.
Also, what does by=.(Category,s....
Can someone help?
Here the . is similar to calling a list in data.table. It is creating a summarised output column
.(Previous_Sales=sum(Sales))
Or with list
list(Previous_Sales=sum(Sales))
In dplyr, similar syntax would be
summarise(Previous_Sales = sum(Sales))
and for creating a column/modifying an existing column usee
mutate(Previous_Sales = sum(Sales))
With data.table, updating/creating a column is done with :=
Previous_Sales := sum(Sales)
Similarly, the by also would be a list of column names
by = list(Category, send_Date=floor_date(send_Date,unit='week',week_start = 7)
which we can also use
by = .(Category, send_Date=floor_date(send_Date,unit='week',week_start = 7)
In the context of data.table, the syntax is consistent in the order
dt[i, j, by]
where i, is the place where we specify the row condition for subsetting the rows, j, we apply functions on column/columns and by the grouping columns. Using a simple example with iris
as.data.table(iris)[Sepal.Length < 5, .(Sum = sum(Sepal.Width)), by = Species]
the i is Sepal.Length < 5 it selects only those rows meeting that condition to sum the 'Sepal.Width' (in that rows), and as the by option is provided, it will do the sum of 'Sepal.Width' for each 'Species' resulting in a 3 row (here there are 3 unique 'Species'). We can also do this without the i option by doing the subsetting in j itself
as.data.table(iris)[, .(Sum = sum(Sepal.Width[Sepal.Length < 5])), by = Species]
With summariseation, both of these are okay, but if we do an assignment (:=), it would different
as.data.table(iris)[Sepal.Length < 5, Sum := sum(Sepal.Width), by = Species]
This would create a column 'Sum' and fills the sum values only where the 'Sepal.Length < 5and all other row elements will beNA`. If we do the second option
as.data.table(iris)[, Sum := sum(Sepal.Width[Sepal.Length < 5]), by = Species]
there won't be any NA element because it is subsetting within the j to create a single sum value for each 'Species'

Applying mutate to multiple columns and rows in dplyr

A pretty simple question but has me dumbfounded.
I have a table and am trying to round each column to 2 decimal places using mutate_all (or another dplyr function). I know this can be done with certain apply functions but I like the dplyr/tidyverse frame work.
DF = data.frame(A = seq(from = 1, to = 2, by = 0.0255),
B = seq(from = 3, to = 4, by = 0.0255))
Rounded.DF = DF%>%
mutate_all(funs(round(digits = 2)))
This does not work however and just gives me a 2 in every column. Thoughts?
You need a "dot" in the round function. The dot is a placeholder for where mutate_all should place each column that you are trying to manipulate.
Rounded.DF = DF%>%
mutate_all(funs(round(., digits = 2)))
To make it more intuitive you can write the exact same thing as a custom function and then reference that function inside the mutate_all:
round_2_dgts <- function(x) {round(x, digits = 2)}
Rounded.DF = DF%>%
mutate_all(funs(round_2_dgts))

Mutate to Create Minimum in Each Row

I have a question relating to creating a minimum value in a new column in dplyr using the mutate function based off two other columns.
The following code repeats the same value for each row in the new column. Is there a way to create an independent minimum for each row in the new column? I wish to avoid using loops or the apply family due to speed and would like to stick with dplyr if possible. Here's the code:
a = data.frame(runif(5,0,5))
b = data.frame(runif(5,0,5))
c = data.frame(runif(5,0,5))
y = cbind(a,b,c)
colnames(y) = c("a","b","c")
y = mutate(y, d = min(y$b, y$c))
y
The new column "d" is simply a repeat of the same number. Any suggestions on how to fix it so that it's the minimum of "b" and "c" in each row?
Thank you for your help.
We can use pmin
y$d <- with(y, pmin(b, c))
Or
transform(y, d = pmin(b,c))
Or with dplyr
library(dplyr)
y %>%
mutate(d = pmin(b,c))
min works columnwise, suppose if we want to use min, an option would be
y %>%
rowwise %>%
mutate(d = min(unlist(c(b,c))))
You could make the min function apply by rows rather than columns by using the apply function and setting the margin argument to MARGIN = 1. Your rowwise min function would look like this:
apply(y, MARGIN = 1, FUN = function(x) min(x)))
Then, in order to make the rowwise min function only apply to columns b and c, you could use the select function within mutate, like this:
y %>% mutate(b.c.min =
y %>%
select(one_of("b", "c")) %>%
apply(MARGIN = 1, FUN = function(x) min(x)))

dplyr show all rows and columns for small data.frame inside a tbl_df

How do I force dplyr to show all columns and rows of a rather small data.frame. The ddf object below, for example:
df = data.frame(a=rnorm(100), b=c(rep('x', 50), rep('y', 50)), c=sample(1:20, 100, replace=T), d=sample(letters,100, replace=T), e=sample(LETTERS,100,replace=T), f=sample("asdasdasdasdfasdfasdfasdfasdfasdfasdfasd asdfasdfsdfsd", 100, replace=T))
ddf= tbl_df(df)
if you want to still use dplyr and print your dataframe just run
print.data.frame(ddf)
ddf
Ah, I was getting angry with dplyr therefore I could not see. the solution is simple: as.data.frame(ddf). That is to convert dplyr-backed data.frame to generic data.frame.
You can use the function print and adjust the n parameter to adjust the number of rows to show.
For example, the following commdands will show 20 rows.
print(ddf, n = 20)
You can also use the typical dplyr pipe syntax.
ddf %>% print(n = 20)
If you want to show all rows, you can use n = Inf (infinity).
print(ddf, n = Inf)
ddf %>% print(n = Inf)
From the docs:
You can control the default appearance with options:
options(tibble.print_max = n, tibble.print_min = m): if there are more
than n rows, print only the first m rows. Use options(tibble.print_max
= Inf) to always show all rows.
options(tibble.width = Inf) will always print all columns, regardless
of the width of the screen.

Resources