For Loop across several data frames or tibbles - r

I have 5 tibbles for successive years 2016 to 2020. I am doing the same thing to each of the sets of tibbles so I want to use a for-loop rather than copying and pasting the same code 5 times. I have named the tibbles in the following way with the final number indicating the year of the data:
alpha_20
beta_20
gamma_20
delta_20
epsilon_20
My thought was to do this:
for (i in 16:20) {
alpha_a_[i]<-alpha_[i]%>%
mutate(NEWVAR=1+OLDVAR)%>%
select(NEWVAR, VAR2, VAR3)
beta_a_[i]<-beta_[i]%>%
group_by(PIN)%>%
summarize(sum(VAR1))
# and so on for all 5 tibbles
}
But I think I am not calling the tibble correctly because the code breaks at the first mutate. I can't seem to figure out how to instruct it to take the tibbles ending in "16" and then the tibbles ending in "17" and so on.

There's a couple things going on here. First, in order to actually call the name of your tibble, you're going to want use the get() function on the string name. Try typing "alpha_20" vs. get("alpha_20") in the command line. However, the way you have it coded now as alpha_[i] won't generate the string you want. To generate the name of your tibble as a string, you're going to need to do something like get(paste0("alpha_", i)).
That's all just to get the tibble you want. To edit/save it within the for loop, look into the assign() command (see Change variable name in for loop using R). So all in all, your code will look something like this:
> require(tidyverse)
> alpha_20 <- data.frame(x = 1:5, y = 6:10)
> alpha_20
x y
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
>
> for (i in 20) {
+ assign(paste0('alpha_', i),
+ get(paste0('alpha_', i)) %>%
+ mutate(z = 11:15))
+ }
> alpha_20
x y z
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15

You can try a combination of get, assign and paste.
for (i in 16:20) {
alpha <- get(paste("alpha_", i, sep = "")) %>%
mutate(NEWVAR = 1 + OLDVAR) %>%
select(NEWVAR, VAR2, VAR3)
assign(paste("alpha_a_", i, sep = ""), alpha)
beta <- get(paste("beta_", i, sep = "")) %>%
group_by(PIN) %>%
summarize(sum(VAR1))
assign(paste("beta_a_", i, sep = ""), beta)
# and so on for all 5 tibbles
}

Related

Using mutate with a stored list of formulas over specified columns

This is a follow up to my previous question here, which #ronak_shah was kind enough to answer. I apologize as some of this information may be redundant to anyone who saw that post, but figure best to post a new question, rather than modify the previous version.
I would still like to iterate through a stored list of columns and procedures to create n new columns based on this list. In the example below, we start with 3 columns, a, b, c and a simple function, func1.
The data frame col_mod identifies which column should be changed, what the second argument to the function that changes them should be, and then generates a statement to execute the function. Each of these modifications should be an addition to the original data frame, rather than replacements of the specified columns. The new names of these columns should be a_new and c_new, respectively.
At the bottom of the reprex below, I am able to obtain my desired result manually, but as before, I would like to automate this using a mapping function.
I am attempting to use the same approach that was provided as an answer to my previous question, but I keep on getting the following error: "Error in get(as.character(FUN), mode = "function", envir = envir) : object 'func1(a,3)' of mode 'function' was not found"
If anyone can help would be much appreciated!
library(tidyverse)
## fake data
dat <- data.frame(a = 1:5,
b = 6:10,
c = 11:15)
## function
func1 <- function(x, y) {x + y}
## modification list
col_mod <- data.frame("col" = c("a", "c"),
"y_val" = c(3, 4),
stringsAsFactors = FALSE) %>%
mutate(func = paste0("func1(", col, ",", y_val, ")"))
## desired end result
dat %>%
mutate(a_new = func1(a, 3),
c_new = func1(c, 4))
## attempting to generate new columns based on #ronak_shah's answer to my previous
## question but fails to run
dat[paste0(col_mod$col, '_new')] <- Map(function(x, y) match.fun(y)(x),
dat[col_mod$col], col_mod$func)
We can use pmap from purrr, transmute the columns based on the name from the 'col' i.e. ..1, function from the 'func' i.e. ..3 and 'y_val' from ..2, assign (:=) the value to a new column by creating a string with paste (or str_c), and bind the columns to the original dataset
library(dplyr)
library(purrr)
library(stringr)
library(tibble)
col_mod$func <- 'func1'
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
match.fun(..3)(!! rlang::sym(..1), ..2))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
If we want to parse the function as it is, use the parse_expr and eval i.e. without changing the func column - it remains as func1(a, 3), and func1(c, 4)
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
eval(rlang::parse_expr(..3)))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
Or using base R with Map
dat[paste0(col_mod$col, '_new')] <- do.call(Map, c(f =
function(x, y, z) eval(parse(text = z), envir = dat), unname(col_mod)))

Add new column to data frames in list

I have a set of data frames named df_1968, df_1969, df_1970, ..., df_2016 collected in a list called my_list.
I want to add a new column in each of these data frames which simply is the current year (1968 in df_1968 and so on). I've managed to do it by looping through the data frames but I am looking for a more neat solution. I've tried the following:
# Function to extract year from name of data frames
substrRight <- function(y, n) {
substr(y, nchar(y) - n + 1, nchar(y))
}
# Add variable "year" equal to 1968 in df_1968 and so on
my_list <- lapply(my_list, function(x) cbind(x, year <- as.numeric(substrRight(names(x), 4 ))))
However this throws the error:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing numbers of rows: 18878, 7
I can see that the way I assign the value to the variable probably does not make sense but can't wrap my head around how to do it instead. Help appreciated.
Note that the substrRight function seems to be working perfectly fine and that
as.numeric(substrRight(names(x), 4 ))
yields the vector of years 1968-2016
This works in Base-R
years <- sub(".*([0-9]{4}$)","\\1",names(my_list))
new_list <- lapply(1:length(years), function(x) cbind(my_list[[x]],year=years[x]))
names(new_list) <- names(my_list)
with this self-made example data
df_1968 = data.frame(a=c(1,2,3),b=c(4,5,6))
df_1969 = data.frame(a=c(1,2,3),b=c(4,5,6))
df_1970 = data.frame(a=c(1,2,3),b=c(4,5,6))
my_list <- list(df_1968,df_1969,df_1970)
names(my_list) <- c("df_1968","df_1969","df_1970")
I get this output
> new_list
$df_1968
a b year
1 1 4 1968
2 2 5 1968
3 3 6 1968
$df_1969
a b year
1 1 4 1969
2 2 5 1969
3 3 6 1969
$df_1970
a b year
1 1 4 1970
2 2 5 1970
3 3 6 1970
The following function will loop through a named list of data frames and create a column year with the 4 last characters of the list's names.
I have simplified the function substrRighta bit. Since it's the last characters that are needed, it uses substring, with no need for a last character position.
substrRight <- function(y, n) {
substring(y, nchar(y) - n + 1)
}
my_list <- lapply(names(my_list), function(x){
my_list[[x]][["year"]] <- as.numeric(substrRight(x, 4))
my_list[[x]]
})
Data creation code.
my_list <- lapply(1968:1970, function(i) data.frame(a = 1:5, b = letters[1:5]))
names(my_list) <- paste("df", 1968:1970, sep = "_")

unquote string as variable in pipe

I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

Sorting a column in descending order in R excluding the first row

I have a dataframe with 5 columns and a very large dataset. I want to sort by column 3. How do you sort everything after the first row? (When calling this function I want to end it with nrows)
Example output:
Original:
4
7
9
6
8
New:
4
9
8
7
6
Thanks!
If I'm correctly understanding what you want to do, this approach should work:
z <- data.frame(x1 = seq(10), x2 = rep(c(2,3), 5), x3 = seq(14, 23))
zsub <- z[2:nrow(z),]
zsub <- zsub[order(-zsub[,3]),]
znew <- rbind(z[1,], zsub)
Basically, snip off the rows you want to sort, sort them in descending order on column 3, then reattach the first row.
And here's a piped version using dplyr, so you don't clutter the workspace with extra objects:
library(dplyr)
z <- z %>%
slice(2:nrow(z)) %>%
arrange(-x3) %>%
rbind(slice(z, 1), .)
You might try this single line of code to modify the third column in your data frame df as described:
df[,3] <- c(df[1,3],sort(df[-1,3]))
df$x[-1] <- df$x[-1][order(df$x[-1], decreasing=T)]
# x
# 1 4
# 2 9
# 3 8
# 4 7
# 5 6

Resources