I am trying to import many csv files from an EPA website. The nomenclature of those csv files is sensible / consistent. Any suggestions on how I can use a loop to automate the importation of the csv files and their naming as dataframes within R?
Right now I'm doing it manually by swapping out the month name in each line of code as illustrated below:
library(tidyverse)
#Download 2013 data
jan_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_jan2013.csv")%>%
add_column("month"="jan","year"=2013)
feb_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_feb2013.csv")%>%
add_column("month"="feb","year"=2013)
mar_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_mar2013.csv")%>%
add_column("month"="mar","year"=2013)
apr_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_apr2013.csv")%>%
add_column("month"="apr","year"=2013)
may_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_may2013.csv")%>%
add_column("month"="may","year"=2013)
jun_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_june2013.csv")%>%
add_column("month"="jun","year"=2013)
jul_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_july2013.csv")%>%
add_column("month"="jul","year"=2013)
aug_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_aug2013.csv")%>%
add_column("month"="aug","year"=2013)
sep_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_sept2013.csv")%>%
add_column("month"="sep","year"=2013)
oct_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_oct2013.csv")%>%
add_column("month"="oct","year"=2013)
nov_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_nov2013.csv")%>%
add_column("month"="nov","year"=2013)
dec_13<-read.csv("https://www.epa.gov/sites/default/files/2017-10/rindata_dec2013.csv")%>%
add_column("month"="dec","year"=2013)
I'd like to set something up where all 12 months are imported, the added column is modified appropriately and the resulting df is named appropriately, by month.
Thanks for the help!
Read all of the csvs using a vector of months and string concatenation, then set their names, enframe, add a year column, and unnest:
months <- c("jan", "feb", "mar", "apr", "may", "june", "july", "aug", "sept", "oct", "nov", "dec")
dfs <- lapply(months, function(x) read.csv(paste0("https://www.epa.gov/sites/default/files/2017-10/rindata_", x, "2013.csv"))) %>%
setNames(months) %>%
enframe(name = "month") %>%
add_column(year = 2013) %>%
unnest(value)
Let me know if this works!
Related
I have list clean_data_2009 containing 12 monthly data frames named wireless_YY_mmm each, where YY represents year 2009 abbreviated as 09 and mmm abbreviates the calendar months.
I want to drop the first row in each of the 12 dataframes, and then convert the first row to variables name row. The command below works, but I want to write a loop instead.
clean_data_2009$wireless_jan_09 <- clean_data_2009$wireless_jan_09[-1,] %>% row_to_names(row_number = 1)
I have written the loop command to print the text that R should accept to manipulate the data frames using paste command, but R tries to read the paste command and thus gives me an error. I try to fix it with the get command, but still run into the error shown below -
month <- c("jan", "feb", "mar", "april", "may", "june", "july", "aug", "sep", "oct", "nov", "dec")
year <- c("09") # "2010", "2011"
list_dt <- c("clean_data_2009$wireless")
rows2del <- c("[-1, ]")
for (y in year) {
for (m in month) {
print(paste(y,m,sep = "_") )
print(paste(list_dt,m,y,sep = "_"))
print(paste(paste(list_dt,m,y,sep = "_"),rows2del, sep=""))
get(paste(list_dt,m,y,sep = "_")) <- get(paste(paste(list_dt,m,y,sep = "_"),rows2del, sep="")) %>% row_to_names(row_number = 1)
}
}
Error:
[1] "09_jan"
[1] "clean_data_2009$wireless_jan_09"
[1] "clean_data_2009$wireless_jan_09[-1, ]"
Error in get(paste(paste(list_dt, m, y, sep = "_"), rows2del, sep = "")) :
object 'clean_data_2009$wireless_jan_09[-1, ]' not found
This alternative approach might help. If you already have your frames in a list, you can just loop through them, and using indexing to drop the first row, and set the names in a single setNames() call for each frame
lapply(clean_data_2009, \(d) setNames(d[-1,],d[1,]))
I currently have several files in a folder. It contains everyday updates on stock. It's looked like this.
Onhand Harian 12 Juli 2019.xlsx
Onhand Harian 13 Juli 2019.xlsx
Onhand Harian 14 Juli 2019.xlsx... and so on.
I would like to read ONLY the latest excel file by using the date on the file name. How to done this? thanx in advance
I would do something like:
library(stringr)
library(tidyverse)
x <- c("Onhand Harian 12 Juli 2019.xlsx",
"Onhand Harian 13 Juli 2019.xlsx",
"Onhand Harian 14 Juli 2019.xlsx")
lookup <- set_names(seq_len(12),
c("Januar", "Februar", "März", "April", "Mai", "Juni", "Juli",
"August", "September", "Oktober", "November", "Dezember"))
enframe(x, name = NULL, value = "txt") %>%
mutate(txt_extract = str_extract(txt, "\\d{1,2} \\D{3,9} \\d{4}")) %>% # September is longest ..
separate(txt_extract, c("d", "m", "y"), remove = FALSE) %>%
mutate(m = sprintf("%02d", lookup[m]),
d = sprintf("%02d", as.integer(d))) %>%
mutate(date = as.Date(str_c(y, m, d), format = "%Y%m%d")) %>%
filter(date == max(date)) %>%
pull(txt)
# "Onhand Harian 14 Juli 2019.xlsx"
If all of your files contain the same name, you can do
#List all the file names in the folder
file_names <- list.files("/path/to/folder/", full.names = TRUE)
#Remove all unwanted characters and keep only the date
#Convert the date string to actual Date object
#Sort them and take the latest file
file_to_read <- file_names[order(as.Date(sub("Onhand Harian ", "",
sub(".xlsx$", "", basename(file_names))), "%d %B %Y"), decreasing = TRUE)[1]]
Apparently, if your files are generated everyday you can also select them based on their creation or modification time using file.info ? Details in the post.
As an example, suppose I have this data:
key <- data.frame(num=c(1,2,3,4,5), month=c("January", "Feb", "March", "Apr", "May"))
data <- c(4,2,5,3)
I want to create a new vector, data2 using the mapping of num to month contained in key. I can do this manually using case_when by doing lots of if statements at once:
library(dplyr)
data2<-case_when(
data==1 ~ "January",
data==2 ~ "Feb",
data==3 ~ "March",
data==4 ~ "Apr",
data==5 ~ "May"
)
However, say that I want to automate this process (maybe I actually have thousands of if statements) and utilize the mapping contained in key. Is this, or something like it, possible?
Here is a failed attempt at code:
data2 <- case_when(data=key$num ~ key$month)
What I am going for is a vector called data2 with these elements: c("Apr","Feb","May","March"). How can I do this?
You can use match and base R indexing (also, set stringsAsFactors=FALSE when you initialize the data.frame, as I did below):
key <- data.frame(num=c(1,2,3,4,5), month=c("January", "Feb", "March", "Apr", "May"), stringsAsFactors = FALSE)
data2 <- key$month[match(data, key$num)]
data2
#[1] "Apr" "Feb" "May" "March"
Why do I get "warning longer object length is not a multiple of shorter object length"?
Forgive me for asking this again, but I am unable to figure out why I am getting this error message - even after combing through stackoverflow. From the above link it says:
"memb only has a length of 10. I'm guessing the length of dih_y2$MemberID isn't a multiple of 10. When using == it will spit out a warning if it isn't a multiple to let you know that it's probably not doing what you're expecting it is doing."
I'm am getting the same error message from the following code, but I am not sure what "objects" are of different length in my example and how to fix it! Essentially, I am trying to separate my dates into months for analysis. Please help if you can. Thank you.
library(ggplot2)
library(dplyr)
library(statsr)
piccolos2 <- piccolos2 %>%
mutate(SERPDate = as.Date(piccolosRankings$SERPDate, format='%m/%d/%Y'))
piccolos2 <- piccolos2 %>%
mutate(Month = ifelse(as.numeric(SERPDate) %in% 0017-04-01:0017-04-30, "April",
ifelse(as.numeric(SERPDate) %in% 0017-05-01:0017-05-31, "May",
ifelse(as.numeric(SERPDate) %in% 0017-06-01:0017-06-30, "June",
ifelse(as.numeric(SERPDate) %in% 0017-07-01:0017-07-31, "July", "August")))))
piccolos2 <- piccolos2 %>%
mutate(Month = ifelse(as.numeric(SERPDate) %in% as.Date("0017-04-01"):as.Date("0017-04-30"), "April",
ifelse(as.numeric(SERPDate) %in% as.Date("0017-05-01"):as.Date("0017-05-31"), "May",
ifelse(as.numeric(SERPDate) %in% as.Date("0017-06-01"):as.Date("0017-06-30"), "June",
ifelse(as.numeric(SERPDate) %in% as.Date("0017-07-01"):as.Date("0017-07-31"), "July", "August")))))
I am trying to run a loop with months as the input. Moreover, I want to spot the number of times a given topic appears in a specific date. The way I am trying to do so is as follows?
for (i in c("January", "February", "March", "April",
"May", "June", "July", "August", "September", "October", "November", "December")){
print(length(which(data$Date == "i 2005" & data$Maxtopic == 3)))
}
Nevertheless, I get 0 as output for all the dates. Any ideas why?
Cheers,
Try data$Date == sprintf("%s 2005", i). Your attempt searches for the literal string "i 2005".
However, the table function was designed for this. Use gsub to remove the year:
table(gsub(" 2005", "", data[data$Maxtopic == 3, "Date"], fixed = TRUE))
PS: Next time please provide a reproducible example to enable development and testing of solutions.