Adding rows to dataframe using lapply - r

I am trying to extract data from a pdf then enter it into a row in a dataframe. I have figured out how to extract the data I want, but the last two parts Im not able to figure out yet. I've set up a basic function to try with lapply and it gives me a 1 row, 39 observation dataframe with the information I want as characters properly formatted and
filenames <- list.files("C:/Users/.../inputfolder", pattern="*.pdf")
function01 <- function(x) {
df1 <- pdf_text(x) |>
str_squish() |>
mgsub ()|>
etc
}
master_list <- lapply(filenames, function01)
mdf <- as.data.frame(do.call(rbind, master_list))
So right now this works for one pdf and Im not quite sure how to make it apply to all files in the folder properly and add the data to the rows of mdf.

You can use purrr:map_dfr.
This function calls the function in a loop, then returns the output as a data.frame, with a row for every iteration:
library(purrr)
master_list <- map_dfr(filenames, function01)

Related

When binding together a list of csv tables with purrr as dataframe i want to include a tag column conditional on a string in each csv

I am binding togeter participant raw data from 33 csv files with the following code
filenames <- list.files(pattern = '*.csv', recursive = TRUE)
result <- purrr::map_df(filenames, read.csv, .id = 'id')
This works great. Now I need to to include a tag per participant(csv) in the final dataframe to make clear in which of several randomized conditions they were in.
I want to make it conditional on the first word in my first column of each .csv, as each participant got one of several randomized sequences of words.
I thought of something with ifelse() but not sure how to include this is in above code. I am a total R noob, any help is appreciated!
I think this should achieve what you're looking for:
result <- lapply(result, function(x) { x$tag <- x[1, 1]; x })
do.call(rbind, result)

Function to update dataframe name stored in variable

I am having to convert some dates to Character formats for a project I am working on, to make the code cleaner I wanted to write a function that you pass the name of the dataframe (and possibly the column name, though in this example it doesn't change so can be hard coded) to and it does the format for each, rather than having to repeat the full line for each dataframe I am formatting the column in.
Is this possible to do? I have done a lot of googling and can't seem to find an answer.
kpidataRM$Period <- format(kpidataRM$Period, "%b-%y")
kpidataAFM$Period <- format(kpidataAFM$Period, "%b-%y")
kpidataNATIONAL$Period <- format(kpidataNATIONAL$Period, "%b-%y")
kpidataHOD$Period <- format(kpidataHOD$Period, "%b-%y")
To answer your specific question, you could create a very simple function like this:
# Your function here takes as input the dataframe name (df) and formats the predefined column (Period)
new_function <- function(df){
df$Period <- format(df$Period, "%b-%y")
return(df)
}
and then run
df1 <- new_function(df1)
df2 <- new_function(df2)
for each of your dataframes (in your example df1 would be kpidataRM for instance). If you would like to include the column as a variable as well in your function you can write it like this:
# Your function here takes as input the dataframe name (df) and column name (col) and formats it.
new_function2 <- function(df, col){
df[[col]] <- format(df[[col]], "%b-%y")
return(df)
}
However, I would say though that this is not the best approach in this case, as you only seem to want to format a set of columns from a set of dataframes, in a specific way. What i would instead propose, exactly as Roland suggested, is to make a list of dataframes and iterate through each element. A simple example would look like this:
# Push all your dataframes in a list (dflist)
dflist <- list(df1,df2)
# Apply in this list a function that changes the column format (lapply)
dflist <- lapply(dflist, function(x){x[[Period]] <- format(x[[Period]], "%b-%y")})
Hope this works for you.

Loop to create one dataframe from multiple URLs

I have a character vector with multiple URLs that each host a csv of crime data for a certain year. Is there an easy way to create a loop that will read.csv and rbind all the dataframes without having to run read.csv 8 times over? The vector of URLs is below
urls <- c('https://opendata.arcgis.com/datasets/73cd2f2858714cd1a7e2859f8e6e4de4_33.csv',
'https://opendata.arcgis.com/datasets/fdacfbdda7654e06a161352247d3a2f0_34.csv',
'https://opendata.arcgis.com/datasets/9d5485ffae914c5f97047a7dd86e115b_35.csv',
'https://opendata.arcgis.com/datasets/010ac88c55b1409bb67c9270c8fc18b5_11.csv',
'https://opendata.arcgis.com/datasets/5fa2e43557f7484d89aac9e1e76158c9_10.csv',
'https://opendata.arcgis.com/datasets/6eaf3e9713de44d3aa103622d51053b5_9.csv',
'https://opendata.arcgis.com/datasets/35034fcb3b36499c84c94c069ab1a966_27.csv',
'https://opendata.arcgis.com/datasets/bda20763840448b58f8383bae800a843_26.csv'
)
The function map_dfr from the purrr package does exactly what you want. It applies a function to every element of an input (in this case urls) and binds together the result by row.
library(tidyverse)
map_dfr(urls, read_csv)
I used read_csv() instead of read.csv() out of personal preference but both will work.
In base R:
result <- lapply(urls, read.csv, stringsAsFactors = FALSE)
result <- do.call(rbind, result)
I usually take this approach as I want to save all the csv files separately in case later I need to do further analysis on each of them. Otherwise, you don't need a for-loop.
for (i in 1:length(urls)) assign(paste0("mycsv-",i), read.csv(url(urls[i]), header = T))
df.list <- mget(ls(pattern = "mycsv-*"))
#use plyr if different column names and need to know which row comes from which csv file
library(plyr)
df <- ldply(df.list) #you can remove first column if you wish
#Alternative solution in base R instead of using plyr
#if they have same column names and you only want rbind then you can do this:
df <- do.call("rbind", df.list)

sapply use in conjunction with dplyr's mutate_at

I am trying to clean-up some data stored in multiple data frames using the same function repeatedly. I am trying in this example to leverage mutate_at from dplyr to convert to Date format all columns names which contain 'date'.
I have a list of tables in my environment such as:
table_list <- c('table_1','table_2','table_3')
The objective for me is to overwrite each of the tables for which the name is listed in table_list with their corrected version. Instead I can only get the results stored in a large list.
I have so far created a basic function as follows:
fix_dates <- function(df_name){
get(df_name) %>%
mutate_at(vars(contains('date')),
funs(as.Date(.,
origin = "1899-12-30")
))
}
The fix_dates() function works perfectly fine if I feed it one element at a time with for example fix_dates('table_1').
However if I use sapply such as results <- sapply(table_list, fix_dates) then I will find in results list all the tables from table_list at their respective indexes. However I would like to instead have table_1 <- fix_dates('table_1') instead for each of the elements of table_list
Is it possible to have sapply store the results in-place instead?
There's probably a more elegant way to do this, but I think this gets you where you want to go:
# use lapply to get a list with the transformed versions of the data frames
# named in table_list
new_tables <- lapply(table_list, function(x) {
mutate_at(get(x), vars(contains("date")), funs(as.Date(., origin = "1899-12-30")))
})
# assign the original df names to the elements of that list
names(new_tables) <- table_list
# use list2env to put those list elements in the current environment, overwriting
# the ones that were there before
list2env(new_tables, envir = environment())

How to loop over several lists to make dataframes in R?

I need to create a loop for converting several list into dataframe, and then write each dataframe as csv. I mean, I want to (i) run a loop for unlist all my lists + convert them into data.frames, and (ii) write each list as CSV.
I ran the following scrip which works for one of my lists but I need to do the same for many of them.
Script to convert a nested list (e.g., list1) in data frame, and write as CSV
data <- as.data.frame(t(do.call(rbind,unlist(list1,recursive = FALSE))))
write.csv(data,"list1.csv"))
Please note that "list1" is one of my list that I wrote as an example. I created an script (done <- ls(pattern="list")) to get a vector with the name of all my lists load in the R environment. So that, I should apply the step (i) and (ii) to all the names in the "done" vector. Was it clearer now?
I would really appreciate if you can help me to create the loop?
for(i in 1:nrow(done){
list_name <- done[i]
data <- as.data.frame(t(do.call(rbind,unlist(noquote(list_name),recursive = FALSE))))
write.csv(data,paste0(list_name,".csv"))
}
fun <- function(x){
data <- as.data.frame(t(do.call(rbind,unlist(paste0("list",x),recursive = FALSE))))
write.csv(data,paste0("list",x,".csv"))
}
fun(1:n)
I believe this is the most efficient way.

Resources