sapply use in conjunction with dplyr's mutate_at - r

I am trying to clean-up some data stored in multiple data frames using the same function repeatedly. I am trying in this example to leverage mutate_at from dplyr to convert to Date format all columns names which contain 'date'.
I have a list of tables in my environment such as:
table_list <- c('table_1','table_2','table_3')
The objective for me is to overwrite each of the tables for which the name is listed in table_list with their corrected version. Instead I can only get the results stored in a large list.
I have so far created a basic function as follows:
fix_dates <- function(df_name){
get(df_name) %>%
mutate_at(vars(contains('date')),
funs(as.Date(.,
origin = "1899-12-30")
))
}
The fix_dates() function works perfectly fine if I feed it one element at a time with for example fix_dates('table_1').
However if I use sapply such as results <- sapply(table_list, fix_dates) then I will find in results list all the tables from table_list at their respective indexes. However I would like to instead have table_1 <- fix_dates('table_1') instead for each of the elements of table_list
Is it possible to have sapply store the results in-place instead?

There's probably a more elegant way to do this, but I think this gets you where you want to go:
# use lapply to get a list with the transformed versions of the data frames
# named in table_list
new_tables <- lapply(table_list, function(x) {
mutate_at(get(x), vars(contains("date")), funs(as.Date(., origin = "1899-12-30")))
})
# assign the original df names to the elements of that list
names(new_tables) <- table_list
# use list2env to put those list elements in the current environment, overwriting
# the ones that were there before
list2env(new_tables, envir = environment())

Related

Adding rows to dataframe using lapply

I am trying to extract data from a pdf then enter it into a row in a dataframe. I have figured out how to extract the data I want, but the last two parts Im not able to figure out yet. I've set up a basic function to try with lapply and it gives me a 1 row, 39 observation dataframe with the information I want as characters properly formatted and
filenames <- list.files("C:/Users/.../inputfolder", pattern="*.pdf")
function01 <- function(x) {
df1 <- pdf_text(x) |>
str_squish() |>
mgsub ()|>
etc
}
master_list <- lapply(filenames, function01)
mdf <- as.data.frame(do.call(rbind, master_list))
So right now this works for one pdf and Im not quite sure how to make it apply to all files in the folder properly and add the data to the rows of mdf.
You can use purrr:map_dfr.
This function calls the function in a loop, then returns the output as a data.frame, with a row for every iteration:
library(purrr)
master_list <- map_dfr(filenames, function01)

Automatically name the elements of a list after importing using lapply

I have a list of dataframes which I imported using
setwd("C:path")
fnames <- list.files()
csv <- lapply(fnames, read.csv, header = T, sep=";")
I will need to do this multiple times creating more lists, I would like to keep all the dataframes available separately (i.e. I don't want or need to combine them), I simply used the above code to import them all in quickly. But accessing them now is a little cumbersome and not intuitive (to me at least). Rather having to use [[1]] to access the first element, is there a way that I could amend the first bit of code so that I can name the elements in the list, for example based off a Date which is a variable in each of the dataframes in the list? The dates are stored as chr in the format "dd-mm-yyyy" , so I could potentially just name the dataframes using dd-mm from the Date variable.
You can extract the required data from the 1st value in the Date column of each dataframe and assign it as name of the list.
names(csv) <- sapply(csv, function(x) substr(x$Date[1], 1, 5))
Or extract the data using regex.
names(csv) <- sapply(csv, function(x) sub("(\\w+-\\w+).*", "\\1", x$Date[1]))
We can use
names(csv) <- sapply(csv, function(x) substring(x$Date[1], 1, 5))

Is there a way to extract a data frame from a list, and assign the data frame to an object with a dynamic name?

I have a list containing many named data frames. I am trying to find a way to extract each data frame from this list. Ultimately, the goal is to assign each data frame in the list to an object according to the name that it has in the list, allowing me to reference the data frames directly instead of through the list (eg. dataframe instead of LIST[[dataframe]])
Here is an example similar to what I am working with.
library(googlesheets4)
inst.pkg("dplyr")
library(dplyr)
gs4_deauth()
TABLES <- list("Test1", "Test2")
readTable <- function(TABLES){
TABLES <- range_read(as_sheets_id("SHEET ID"),sheet = TABLES)
TABLES <-as.data.frame(TABLES)
TABLES <- TABLES %>%
transmute(Column1= as.character(Column1), Column2 = as.character(Column2 ))
return(TABLES)}
LIST <- lapply(TABLES, readTable)
names(LIST) <- TABLES
I know that this could be done manually, but I'm trying to find a way to automate this process. Any advice would be helpful. Thanks in advance.
If named_dfs is a named list where each element is a dataframe you can use the assign function to achieve your goal.
Map(assign, names(named_dfs), named_dfs, pos = 1)
For each name, it assigns (equivalent to <- operator) the corresponding dataframe object.
Map(function(x, y) assign(x, y, envir = globalenv()), names(named_dfs), named_dfs)
Should also work.

Loop to create one dataframe from multiple URLs

I have a character vector with multiple URLs that each host a csv of crime data for a certain year. Is there an easy way to create a loop that will read.csv and rbind all the dataframes without having to run read.csv 8 times over? The vector of URLs is below
urls <- c('https://opendata.arcgis.com/datasets/73cd2f2858714cd1a7e2859f8e6e4de4_33.csv',
'https://opendata.arcgis.com/datasets/fdacfbdda7654e06a161352247d3a2f0_34.csv',
'https://opendata.arcgis.com/datasets/9d5485ffae914c5f97047a7dd86e115b_35.csv',
'https://opendata.arcgis.com/datasets/010ac88c55b1409bb67c9270c8fc18b5_11.csv',
'https://opendata.arcgis.com/datasets/5fa2e43557f7484d89aac9e1e76158c9_10.csv',
'https://opendata.arcgis.com/datasets/6eaf3e9713de44d3aa103622d51053b5_9.csv',
'https://opendata.arcgis.com/datasets/35034fcb3b36499c84c94c069ab1a966_27.csv',
'https://opendata.arcgis.com/datasets/bda20763840448b58f8383bae800a843_26.csv'
)
The function map_dfr from the purrr package does exactly what you want. It applies a function to every element of an input (in this case urls) and binds together the result by row.
library(tidyverse)
map_dfr(urls, read_csv)
I used read_csv() instead of read.csv() out of personal preference but both will work.
In base R:
result <- lapply(urls, read.csv, stringsAsFactors = FALSE)
result <- do.call(rbind, result)
I usually take this approach as I want to save all the csv files separately in case later I need to do further analysis on each of them. Otherwise, you don't need a for-loop.
for (i in 1:length(urls)) assign(paste0("mycsv-",i), read.csv(url(urls[i]), header = T))
df.list <- mget(ls(pattern = "mycsv-*"))
#use plyr if different column names and need to know which row comes from which csv file
library(plyr)
df <- ldply(df.list) #you can remove first column if you wish
#Alternative solution in base R instead of using plyr
#if they have same column names and you only want rbind then you can do this:
df <- do.call("rbind", df.list)

subset multiple data tables using lapply

I have multiple data tables and all have a common column called ID. I have a vector vec that contains a set of ID values.
I would like to use lapply to subset all data tables using vec
I understand how to use lapply to subset the data tables but my question is how to assign the subsetted results back to original data tables
Here is what I tried :
tables<-c("dt1","dt2","dt3","dt4")
lapply(mget(tables),function(x)x[ID %in% vec,])
The above gives subsets of all data tables but how do I assign them back to dt1,dt2,dt3,dt4 ?
I would keep the datasets in the list rather than updating the dataset objects in the global environment as most of the operations can be done within the list (including reading the files and writing to output files ). But, if you insist, we can use list2env which will update the original dataset objects with the subset dataset
lst <- lapply(mget(tables),function(x)x[ID %in% vec,])
list2env(lst, envir=.GlobalEnv)
You could also just name the datasets in the list:
tables <- c("dt1","dt2","dt3","dt4")
dflist <- lapply(mget(tables),function(x)x[ID %in% vec,])
dflist <- setNames(dflist, tables)

Resources