mutate list of dataframes with another list - r

I've got a list of 17 dataframes and a list of 17 dates. They are ordered and correspond to each other. In other words, list_of_dfs[[1]] corresponds to dates[[1]] and so forth. The list of dates, below, are date objects using lubridate::ymd.
> dates
[1] "2004-10-10" "2005-10-10" "2006-10-10" "2007-10-10" "2008-10-10" "2009-10-10" "2010-10-10" "2011-10-10" "2012-10-10" "2013-10-10" "2014-10-10" "2015-10-10" "2016-10-10" "2017-10-10"
[15] "2018-10-10" "2019-10-10" "2020-10-10"
I would like to mutate a subset of variables in each dataframe such that I am subtracting the subset from the corresponding object in dates. For example, I could do the following for the first item.
list_of_dfs[[1]] <- list_of_dfs[[1]] %>% `
mutate_at(.vars = vars(contains('string')),
.funs = funs(dates[[1]] - .)
Is there a way that I incorporate the above into a map or lapply like command that will allow me to iterate through dates?
My closet approximation would be something like
list_of_dfs <- list_of_dfs %>%
map(., function(x) mutate_at(x,
.vars = vars(contains('string')),
.funs = funs(dates - .)))
which can't take a list object in .funs as shown above.

We can use map2 as we are doing the subtraction from corresponding elements of 'dates' list
library(dplyr)
library(purrr)
list_of_dfs2 <- map2(list_of_dfs, dates, ~ {date <- .y
.x %>%
mutate_at(vars(contains('string')), ~ date - as.Date(.))})
In the devel version of dplyr, across can be used along with mutate
list_of_dfs2 <- map2(list_of_dfs, dates, ~ { date <- .y
.x %>%
mutate(across(contains('string'), ~ date - as.Date(.x)))
})
data
list_of_dfs <- list(data.frame(string1 = Sys.Date() - 1:6, string2 = Sys.Date()),
data.frame(string1 = Sys.Date() - 1:6, string2 = Sys.Date()))
dates <- Sys.Date() + 1:2

Related

R - Regular Expressions (Regex) with a list of Data Frames (only first match)

So, I'm the happy owner of a 17246 list of data frames and need to extract 3 data from each of them:
To whom the job was given.
The standard code that describes what kind of job it is (Ex. "00" inside this "12-00.07").
The date on which it was assigned.
Each data frame contains data about just one worker.
But the data is inputted differently: It always starts by the regular expression “Worker:” + “Name or number identification”.
So, I can find the data with a regular expression that targets “Worker:”
I can also target the first regular expression that represents a date: “dd/dd/dd”
The desired output is a df with 3 columns (“Worker”, “Code”, “Date”) and then unite all dfs into one.
In order to achieve this end, I find myself with three problems:
a) The information is presented in no order (cannot subset specific
rows).
b) The intended worker and code are a substring inside other
characters.
c) More then one date is presented on each df and I only desire the
first match. All other dates are misleading.
The input is this:
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
The desired outcome is this:
library(dplyr)
n1 <- c("Joseph", "Joseph")
c1 <- c("00", "11")
d1 <- c("06/01/21", "06/01/21")
n2 <- c("3rd", "3rd")
c2 <- c("08", "57")
d2 <- c("05/01/21", "05/01/21")
df1 <- data.frame(name = n1, code = c1, date = d1)
df2 <- data.frame(name = n2, code = c2, date = d2)
PDF_list <- list(df1, df2)
one_df <- bind_rows(PDF_list)
So far, I've managed to write this poor excuse of a code. It doesn’t select the substrings and it cheats to get the desired date:
library(tidyverse)
library(tidyr)
library(stringr)
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
for(num in 1:length(PDF_list)){
worker <- filter(PDF_list[[num]], grepl("Worker:\\s*?(\\w.+)", text))
code <- filter(PDF_list[[num]], grepl("-(\\d{2}).+", text))
date <- filter(PDF_list[[num]], grepl("^\\d{2}/\\d{2}.+", text))
if(nrow(date) > 1){
date <- date[1,1]
}
t_list <- cbind(worker, code, date)
names(t_list) <- c("name", "code", "date")
PDF_list[[num]] <- t_list
}
rm(worker, code, date, t_list)
one_df <- bind_rows(PDF_list)
View(one_df)
Any help? Thanks!
A method using tidyverse
Loop over the list - map, arrange the rows of the data so that row with the 'Worker:' becomes the top row
Bind the list elements as a single dataset with _dfr suffix in map, while creating a grouping index by specifying the .id
Group by 'grp' column
Use summarise to create summarised output with the first 'date' from the pattern two digits followed by /, two digits / and two digits from the start (^) till the end ($) of the string elements in 'text' column
The first element will become 'name' after removing the substring 'Worker:' and any spaces - str_remove
Similarly, we extract the 'code' rows based on capturing the digits from those having only digits with some characters - or .
library(dplyr)
library(stringr)
library(purrr)
PDF_list %>%
map_dfr(~ .x %>%
arrange(!str_detect(text, 'Worker:')), .id = 'grp') %>%
group_by(grp) %>%
summarise(date = first(text[str_detect(text, "^\\d{2}/\\d{2}/\\d{2}$")]),
name = str_remove(first(text), "Worker:\\s*"),
code = str_replace(text[str_detect(text, '^\\d+-(\\d+)[.-]\\d+$')],
"^\\d+-(\\d+)[.-]\\d+$", "\\1"), .groups = 'drop') %>%
select(name, code, date)
-output
# A tibble: 4 x 3
name code date
<chr> <chr> <chr>
1 Joseph 00 06/01/21
2 Joseph 11 06/01/21
3 3rd 08 05/01/21
4 3rd 57 05/01/21

How to apply functions over a list of list of dataframes?

I've got a list state-list which contains 4 lists wa, tex, cin and ohi, all of which contain around 60 dataframes. I want to apply the same functions to these dataframes. For example, I want to add a new column with a mean, like this:
library(dplyr)
df # example df from one of the lists
df %>% group_by(x) %>% mutate(mean_value = mean(value))
How can I do this?
We can use a nested map to loop over the list
library(purrr)
library(dplyr)
out <- map(state_list, ~ map(.x, ~ .x %>%
group_by(x) %>%
mutate(mean_value = mean(value)))
Or using base R
out <- lapply(state_list, function(lst1) lapply(lst1,
function(dat) transform(dat, mean_value = ave(value, x))))

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

Return list of columns containing data outside a predetermined range in r

In order to filter a data.frame for only the the columns of interest I need to find the columns in this data.frame containing data outside a specific range.
Let the data.frame be
df<-data.frame(x1=c(1,5,9),x2=c(10,20,30),x3=c(20,100,1000))
ranges<-data.frame(y1=c(3,8),y2=c(10,20), y3=c(15,1250))
As an output I'd like a list returning the colnames: "x1","x2"
I tried the following, but the code works only if "ranges" contains all the numbers as specified below, and matches if the number is found. Thats unfortunately not what I need.
ranges<-c(15:300,10:20)
df.l<-colnames(df)[sapply(df,function(x) any(x %in% ranges))]
Any ideas?
Thanks!
If 'ranges' is a data.frame or list, one option is
names(which(unlist(Map(function(x, y) any(!(x >= y[1] & x <= y[2])), df, ranges))))
#[1] "x1" "x2"
Or use the reverse logic
names(which(unlist(Map(function(x, y) any(x < y[1]| x > y[2]), df, ranges))))
Or in tidyverse,
library(purrr)
library(dplyr)
library(tibble)
map2(df, ranges, ~ between(.x, .y[1], .y[2]) %>% `!` %>% any) %>%
enframe %>%
unnest(cols = value) %>%
filter(value) %>%
pull(name)
#[1] "x1" "x2"
data
ranges <- data.frame(y1 = c(3, 8), y2 = c(10, 20), y3 = c(15, 1250))

Add multiple columns with mutate using column-based conditions, without using explicit column name + POSIX

I have a dataframe of data: 1 column is POSIX, the rest is data.
I need to remove selectively some data from a group of columns and add these "new" columns to the original dataframe.
I can "easily" do it in base R (I am an old-style user). I'd like to do it more compactly with mutate_at or with other function... although I am having several issues.
A solution homemade with base R could be
df <- data.frame("date" = seq.POSIXt(as.POSIXct(format(Sys.time(),"%F %T"),tz="UTC"),length.out=20,by="min"), "a.1" = rnorm(20,0,3), "a.2" = rnorm(20,1,2), "b.1"= rnorm(20,1,4), "b.2"= rnorm(20,3,4))
df1 <- lapply(df[,grep("^a",names(df))], function(x) replace(x, which(x > 0 & x < 0.2), NA))
df1 <- data.frame(matrix(unlist(df1), nrow = nrow(df), byrow = F)) ## convert to data.frame
names(df1) <- grep("^a",names(df),value=T) ## rename columns
df1 <- cbind.data.frame("date"=df$date, df1) ## add date
Can anyone help me in setting up something working with dplyr + transmute?
So far I come up with something like:
df %>%
select(starts_with("a.")) %>%
transmute(
case_when(
.>0.2 ~ NA,
)
) %>%
cbind.data.frame(df)
But I am quite stuck, since I can't combine transmute with case_when: all examples that I found use explicitly the column names in case_when, but I can't, since I won't know the names of the column in advance. I will only know the initial of the columns that I need to transmute.
Thanks,
Alex
We can use transmute_at if the intention is to return only those columns specified in the vars
library(dplyr)
df %>%
transmute_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
bind_cols(df %>% select(date), .)
If we need all the columns to return, but only change the columns of interest in vars, then we need mutate_at instead of transmute_at
df %>%
mutate_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
select(date, starts_with('a')) # only need if we are selecting a subset of columns

Resources