Loop to create one dataframe from multiple URLs - r

I have a character vector with multiple URLs that each host a csv of crime data for a certain year. Is there an easy way to create a loop that will read.csv and rbind all the dataframes without having to run read.csv 8 times over? The vector of URLs is below
urls <- c('https://opendata.arcgis.com/datasets/73cd2f2858714cd1a7e2859f8e6e4de4_33.csv',
'https://opendata.arcgis.com/datasets/fdacfbdda7654e06a161352247d3a2f0_34.csv',
'https://opendata.arcgis.com/datasets/9d5485ffae914c5f97047a7dd86e115b_35.csv',
'https://opendata.arcgis.com/datasets/010ac88c55b1409bb67c9270c8fc18b5_11.csv',
'https://opendata.arcgis.com/datasets/5fa2e43557f7484d89aac9e1e76158c9_10.csv',
'https://opendata.arcgis.com/datasets/6eaf3e9713de44d3aa103622d51053b5_9.csv',
'https://opendata.arcgis.com/datasets/35034fcb3b36499c84c94c069ab1a966_27.csv',
'https://opendata.arcgis.com/datasets/bda20763840448b58f8383bae800a843_26.csv'
)

The function map_dfr from the purrr package does exactly what you want. It applies a function to every element of an input (in this case urls) and binds together the result by row.
library(tidyverse)
map_dfr(urls, read_csv)
I used read_csv() instead of read.csv() out of personal preference but both will work.

In base R:
result <- lapply(urls, read.csv, stringsAsFactors = FALSE)
result <- do.call(rbind, result)

I usually take this approach as I want to save all the csv files separately in case later I need to do further analysis on each of them. Otherwise, you don't need a for-loop.
for (i in 1:length(urls)) assign(paste0("mycsv-",i), read.csv(url(urls[i]), header = T))
df.list <- mget(ls(pattern = "mycsv-*"))
#use plyr if different column names and need to know which row comes from which csv file
library(plyr)
df <- ldply(df.list) #you can remove first column if you wish
#Alternative solution in base R instead of using plyr
#if they have same column names and you only want rbind then you can do this:
df <- do.call("rbind", df.list)

Related

Adding rows to dataframe using lapply

I am trying to extract data from a pdf then enter it into a row in a dataframe. I have figured out how to extract the data I want, but the last two parts Im not able to figure out yet. I've set up a basic function to try with lapply and it gives me a 1 row, 39 observation dataframe with the information I want as characters properly formatted and
filenames <- list.files("C:/Users/.../inputfolder", pattern="*.pdf")
function01 <- function(x) {
df1 <- pdf_text(x) |>
str_squish() |>
mgsub ()|>
etc
}
master_list <- lapply(filenames, function01)
mdf <- as.data.frame(do.call(rbind, master_list))
So right now this works for one pdf and Im not quite sure how to make it apply to all files in the folder properly and add the data to the rows of mdf.
You can use purrr:map_dfr.
This function calls the function in a loop, then returns the output as a data.frame, with a row for every iteration:
library(purrr)
master_list <- map_dfr(filenames, function01)

Automatically name the elements of a list after importing using lapply

I have a list of dataframes which I imported using
setwd("C:path")
fnames <- list.files()
csv <- lapply(fnames, read.csv, header = T, sep=";")
I will need to do this multiple times creating more lists, I would like to keep all the dataframes available separately (i.e. I don't want or need to combine them), I simply used the above code to import them all in quickly. But accessing them now is a little cumbersome and not intuitive (to me at least). Rather having to use [[1]] to access the first element, is there a way that I could amend the first bit of code so that I can name the elements in the list, for example based off a Date which is a variable in each of the dataframes in the list? The dates are stored as chr in the format "dd-mm-yyyy" , so I could potentially just name the dataframes using dd-mm from the Date variable.
You can extract the required data from the 1st value in the Date column of each dataframe and assign it as name of the list.
names(csv) <- sapply(csv, function(x) substr(x$Date[1], 1, 5))
Or extract the data using regex.
names(csv) <- sapply(csv, function(x) sub("(\\w+-\\w+).*", "\\1", x$Date[1]))
We can use
names(csv) <- sapply(csv, function(x) substring(x$Date[1], 1, 5))

sapply use in conjunction with dplyr's mutate_at

I am trying to clean-up some data stored in multiple data frames using the same function repeatedly. I am trying in this example to leverage mutate_at from dplyr to convert to Date format all columns names which contain 'date'.
I have a list of tables in my environment such as:
table_list <- c('table_1','table_2','table_3')
The objective for me is to overwrite each of the tables for which the name is listed in table_list with their corrected version. Instead I can only get the results stored in a large list.
I have so far created a basic function as follows:
fix_dates <- function(df_name){
get(df_name) %>%
mutate_at(vars(contains('date')),
funs(as.Date(.,
origin = "1899-12-30")
))
}
The fix_dates() function works perfectly fine if I feed it one element at a time with for example fix_dates('table_1').
However if I use sapply such as results <- sapply(table_list, fix_dates) then I will find in results list all the tables from table_list at their respective indexes. However I would like to instead have table_1 <- fix_dates('table_1') instead for each of the elements of table_list
Is it possible to have sapply store the results in-place instead?
There's probably a more elegant way to do this, but I think this gets you where you want to go:
# use lapply to get a list with the transformed versions of the data frames
# named in table_list
new_tables <- lapply(table_list, function(x) {
mutate_at(get(x), vars(contains("date")), funs(as.Date(., origin = "1899-12-30")))
})
# assign the original df names to the elements of that list
names(new_tables) <- table_list
# use list2env to put those list elements in the current environment, overwriting
# the ones that were there before
list2env(new_tables, envir = environment())

How to indicate row.names=1 using fread() in data.table?

I want to consider the first column in my .csv file as a sequence of rownames. Usually I used to do the following:
read.csv("example_file.csv", row.names=1)
But I want to do this with the fread() function in the data.table R package, as it runs very quickly.
X <- as.matrix(fread("bigmatrix.csv"),rownames=1)
Why not saving the rownames in a column:
df <- data.frame(x=rnorm(1000))
df$row_name = row.names(df)
fwrite(df,file="example_file.csv")
Then you can load the saved CSV.
df <- fread(file="example_file.csv")
From a small search I've done, data.tables never uses row names. Since data.tables inherit from data.frames, it still has the row names attribute. But it never uses them.
However, you can probably use this answer (similar post) and later make the rowname column into your actual rownames. Though, it might not be efficient.
Just one function, convert to a dataframe
a <- fread(file="example_file.csv") %>% as.data.frame()
row.names(a) <- a$V1

Extracting outputs of different length from lapply

Could anyone help me to sort out this problem. I am using lapply in a following code provided by #Arun:
out <- lapply(1:length(f1), function(f.idx) {
df1 <- read.delim(f1[f.idx], header = T)
df2 <- read.delim(f2[f.idx], header = T)
df3 <- read.delim(f3[f.idx], header = T)
idx.v <- get_idx(df1)
result <- get_result(idx.v, df2, df3)
})
Now, out is a list of 110 files. These output files are of different lengths, so I cannot use as.data.frame(do.call(rbind, out)).
Is there a way to save each file as separate file in loop-like manner or do I have to do it manually (e.g., out[1], out[2] etc ...).
What are you trying to achieve by do.call(rbind...? That is for combining data in a list. I'm not sure why you would stick an "as.data.frame" in there. If you have a list of data.frames with the same columns, but the number of rows differ and you essentially want to "stack" those data.frames on top of each other, then you should be able to use the following to get one big data.frame and then save that single object:
do.call(rbind, out)
It sounds like you have different data.frames in a list named "out" and are trying to save your data.frames as individual files in your working directory. If that is the case, try something like:
lapply(names(out),
function(x) write.csv(out[[x]],
file = paste(x, ".csv", sep = "")))
If the names of the data.frames in the list are not unique, you might need to take a different approach.
If you link to your earlier related questions, that might be better than simply mentioning who shared the code with you.

Resources