Loop through two data frames and concatenate contents to string - r

I am trying to create a large list of file URLs by concatenating various pieces together. (Say, ~40 file URLs which represent multiple data types for each of the 50 states.) Eventually, I will download and then unzip/unrar these files. (I have working code for that portion of it.)
I'm very much an R noob, so please bear with me, here.
I have a set of data frames:
states - list of 50 state abbreviations
partial_url - a partial URL for the 50 states
url_parts - a list of each of the remaining URL
pieces (5 file types to download)
year
filetype
I need a URL that looks like this:
http://partial_url/state_urlpart_2017_file.csv.gz
I was able to build the partial_url data frame with the following:
for (i in seq_along(states)) {
url_part1 <- as.data.frame(paste0(url,states[[i]],"/",dir,"/"))
}
I was hoping that some kind of nested loop might work to do the rest, like so:
for (i in 1:partial_url){
for (j in 1:url_parts){
for(k in 1:states){
url_part2 <- as.data.frame(paste0(partial_url[[i]],"/",url_parts[[j]],states[[k]],year,filetype))
}}}
Can anyone suggest how to proceed with the final step?

In my understanding all OP needs can be handled by paste0 function itself. The paste0 works as vectorise format. Hence, the for-loop shown by OP is not needed. The data used in my example is stored in vector format but it can be represented by a column of data.frame.
For example:
states <- c("Alabama", "Colorado", "Georgia")
partial_url <- c("URL_1", "URL_2", "URL_3")
url_parts <- c("PART_1", "PART_2", "PART_3")
year <- 2017
fileType <- "xls"
#Now use paste0 will list out all the URLS
paste0(partial_url,"/",url_parts,states,year,fileType)
#[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls"
#[3] "URL_3/PART_3Georgia2017xls"
EDIT: multiple fileType based on feedback from #Onyambu
We can use rep(fileType, each = length(states)) to support multiple files.
The solution will look like.
fileType <- c("xls", "doc")
paste0(partial_url,"/",url_parts,states,year,rep(fileType,each = length(states)))
[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls" "URL_3/PART_3Georgia2017xls"
[4] "URL_1/PART_1Alabama2017doc" "URL_2/PART_2Colorado2017doc" "URL_3/PART_3Georgia2017doc"

Here is a tidyverse solution with some simple example data. The approach is to use complete to give yourself a data frame with all possible combinations of your variables. This works because if you make each variable a factor, complete will include all possible factor levels even if they don't appear. This makes it easy to combine your five url parts even though they appear to have different nrow (e.g. 50 states but only 5 file types). unite allows you to join together columns as strings, so we call it three times to include the right separators, and then finally add the http:// with mutate.
Re: your for loop, I find it hard to work through nested for loop logic in the first place. But at least two issues as written include that you have 1:partial_url instead of 1:length(partial_url) and similar, and you are simply overwriting the same object with every pass of the loop. I prefer to avoid loops unless it's a problem where they're absolutely necessary.
library(tidyverse)
states <- tibble(state = c("AK", "AZ", "AR", "CA", "NY"))
partial_url <- tibble(part = c("part1", "part2"))
url_parts <- tibble(urlpart = c("urlpart1", "urlpart2"))
year <- tibble(year = 2007:2010)
filetype <- tibble(filetype = c("csv", "txt", "tar"))
urls <- bind_cols(
states = states[[1]] %>% factor() %>% head(2),
partial_url = partial_url[[1]] %>% factor() %>% head(2),
url_parts = url_parts[[1]] %>% factor() %>% head(2),
year = year[[1]] %>% factor() %>% head(2),
filetype = filetype[[1]] %>% factor() %>% head(2)
) %>%
complete(states, partial_url, url_parts, year, filetype) %>%
unite("middle", states, url_parts, year, sep = "_") %>%
unite("end", middle, filetype, sep = ".") %>%
unite("url", partial_url, end, sep = "/") %>%
mutate(url = str_c("http://", url))
print(urls)
# A tibble: 160 x 1
url
<chr>
1 http://part1/AK_urlpart1_2007.csv
2 http://part1/AK_urlpart1_2007.txt
3 http://part1/AK_urlpart1_2008.csv
4 http://part1/AK_urlpart1_2008.txt
5 http://part1/AK_urlpart1_2009.csv
6 http://part1/AK_urlpart1_2009.txt
7 http://part1/AK_urlpart1_2010.csv
8 http://part1/AK_urlpart1_2010.txt
9 http://part1/AK_urlpart2_2007.csv
10 http://part1/AK_urlpart2_2007.txt
# ... with 150 more rows
Created on 2018-02-22 by the reprex package (v0.2.0).

Related

Iteratively skip the last rows in CSV files when using read_csv

I have a number of CSV files exported from our database, say site1_observations.csv, site2_observations.csv, site3_observations.csv etc. Each CSV looks like below (site1 for example):
Column A
Column B
Column C
# Team: all teams
# Observation type: xyz
Site ID
Reason
Quantity
a
xyz
1
b
abc
2
Total quantity
3
We need to skip the top 2 rows and the last 1 row from each CSV before combining them as a whole master dataset for further analysis. I know I can use the skip = argument to skip the first few lines of CSV, but read_csv() doesn't seem to have simple argument to skip the last lines and I have been using n_max = as a workaround. The data import has been done in manual way. I want to shift the manual process to programmatic manner using purrr::map(), but just couldn't work out how to efficiently skip the last few lines here.
library(tidyverse)
observations_skip_head <- 2
# Approach 1: manual ----
site1_rawdata <- read_csv("/data/site1_observations.csv",
skip = observations_skip_head,
n_max = nrow(read_csv("/data/site1_observations.csv",
skip = observations_skip_head))-1)
# site2_rawdata
# site3_rawdata
# [etc]
# all_sites_rawdata <- bind_rows(site1_rawdata, site2_rawdata, site3_rawdata, [etc])
I have tried to use purrr::map() and I believe I am almost there, except the n_max = part which I am not sure how/what to do this in map() (or any other effective way to get rid of the last line in each CSV). How to do this with purrr?
observations_csv_paths_chr <- paste0("data/site", 1:3,"_observations.csv")
# Approach 2: programmatically import csv files with purrr ----
all_sites_rawdata <- observations_csv_paths_chr %>%
map(~ read_csv(., skip = observations_skip_head,
n_max = nrow(read_csv("/data/site1_observations.csv",
skip = observations_skip_head))-1)) %>%
set_names(observations_csv_paths_chr)
I know this post uses a custom function and fread. But for my education I want to understand how to achieve this goal using the purrr approach (if it's doable).
You could try something like this?
library(tidyverse)
csv_files <- paste0("data/site", 1:3, "_observations.csv")
csv_files |>
map(
~ .x |>
read_lines() |>
tail(-3) |> # skip first 3
head(-2) |> # ..and last 2
paste(collapse = '\n') |>
read_csv()
)
manual_csv<-function(x) {
txt<-readLines(x)
txt<-txt[-c(2,3,length(txt))] # insert the row you want to delete
result<-read.csv(text=paste0(txt, collapse="\n"))
}
test<-manual_csv('D:/jaechang/pool/final.csv')

Gather multiple data sets from an URL/FTP site and merge them into a single dataframe for wrangling

Okay R community. I have a myrid of code pieces going on here from data.table, dyplr, base, etc.. My goal is to download a group of files from NOAA into a single data frame for wrangling. Currently, my code is ugly, to say the least and of course not working. I should have all of data set 1950, then right below it i have 1951 data, etc.
library(data.table)
library(XML)
library(RCurl)
library(tidyverse)
#hard code website addressess
noaa.url <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
noaa.ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
#set fixed name of files for download
details.str <- "StormEvents_details-ftp_*"
fatalities.str <- "StormEvents_fatalities-ftp_"
locations.str <- "StormEvents_locations-ftp_"
#test function to download file using manual operation
index.storm <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20210803.csv.gz"
storm.1950 <- fread(index.storm )
storm.1951 <- fread("https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1951_c20210803.csv.gz")
#test append
storm.append <- rbind(storm.1950, storm.1951)
#create a list of colnames
detail.colnames <- colnames(storm.1950)
#-------------------------------Begin Website Scrap-------------------------------------
#create a directory from the NOAA website. Must use the FTP directory. Will get 404 error if using the http site
dir_list <-
read.table(
textConnection(
getURLContent(noaa.ftp)
),
sep = "",
strip.white = TRUE)
#subset the data we want
dir_list <- dir_list %>%
select("V9","V5")
#fix column names
colnames(dir_list) <- c("FileName", "FileSize")
#create new table for loop through list with complete website directory. This will get just the storm details we want
details.dir <- dir_list %>%
select(1) %>%
filter(str_detect(FileName,"details")) %>%
mutate(FileName = paste0(noaa.url,FileName))
#how many rows to get. could use this in counter for loop if needed
total.count <- count(details.dir)
total.count
#subset just first 5 rows
details.dirsub <- head(details.dir,5)
details.dirsub
#very basic loop and apply a list. Note: files get larger as years go on.
for (x in details.dirsub) {
something = details.dirsub$FileName
#print(something)
storm.append = lapply(something, fread) #lapply is creating a join not an append
#storm.append = rbindlist(fread(something)) #does not work
return(storm.append)
}
#expand the list into a dataframe for wrangling
storm.full <- as.data.frame(do.call(cbind, storm.append))
# try to set colnames if use sapply instead of lapply
#colnames(storm.full)
#setnames(storm.full, detail.colnames)
#filter by GEORGIA -- can not filter because lapply is creating joins instead of append. tried rbindlist() but errors.
storm.georgia <- storm.full %>%
filter(STATE == "GEORGIA")
If I understand correctly, the OP wants
to read all data files
whose file names include the string "details"
from a certain FTP directory,
combine them into one large data.frame
but keep only rows whih are related to GEORGIA.
This is what I would do using my favourite tools:
library(data.table)
library(RCurl)
library(magrittr)
noaa_ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
state_to_select <- "GEORGIA"
storm.georgia <-
getURL(noaa_ftp, dirlistonly = TRUE) %>%
fread(header = FALSE) %>%
.[V1 %like% "details.*csv.gz", V1] %>%
lapply(function(x) {
url <- file.path(noaa.ftp, x)
cat(url, "\n")
fread(url)[STATE == state_to_select]
}) %>%
rbindlist()
This filters each file directly after reading in order to reduce memory allocation. The result consists of one data.table with nearly 50k rows and 51 columns:
tables()
NAME NROW NCOL MB COLS KEY
1: storm.georgia 48,257 51 28 BEGIN_YEARMONTH,BEGIN_DAY,BEGIN_TIME,END_YEARMONTH,END_DAY,END_TIME,...
Total: 28MB
Please note that there are inconsistencies in the data files as can be seen from the last lines of output
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz'
Content type 'unknown' length 385707 bytes (376 KB)
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz'
Content type 'unknown' length 255646 bytes (249 KB)
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz'
Content type 'unknown' length 298130 bytes (291 KB)
Warning messages:
1: In fread(url) :
Stopped early on line 33957. Expected 51 fields but found 65. Consider fill=TRUE and comment.char=. First discarded non-empty line:
<<199712,7,400,199712,8,1800,2071900,5623952,"UTAH",49,1997,"December","Winter
Storm","Z",1,"NW CACHE","SLC","07-DEC-97 04:00:00","MST","08-DEC-97
18:00:00","20","0","0","0","200K",,,,,,,,,,,,,,,,,,,,,,,,"A strong
trough moved through northern Utah on the 7th. The cold moist airmass
remained unstable into the 8th in a strong northwest flow.
Lake-enhanced snowbands developed along the Wasatch Front on the 8th
as well. Criteria snow fell across much of the state and locally
strong winds occurred >>
2: In fread(url) :
Found and resolved improper quoting out-of-sample. First healed line 23934:
<<199703,14,1543,199703,14,1543,2059347,5589290,"FLORIDA",12,1997,"March","Waterspout","C",87,"MONROE","MFL","14-MAR-97
15:43:00","EST","14-MAR-97
15:43:00","0","0","0","0",,,,,,,,,,,,,,,,,"S OF "7 MILE" BRIDGE",,,"S
OF "7 MILE" BRIDGE",,,,,"TWO WATERSPOUTS OBSERVED BY SKYWARN
SPOTTER...LOST IN RAIN SHAFT.",,"PDC">>. If the fields are not quoted
(e.g. field separator does not appear within any field), try quote=""
to avoid this warning. ```

Read list of files with inconsistent delimiter/fixed width

I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)

How do I create / name dataframes in a for loop in R?

So I'm currently trying to scrape precinct results by county from JSON files on Virginia's Secretary of State. I got code working that gets the data from a URL and creates a dataframe named after the county. To speed up the process, I tried to put the code inside a for loop that iterates through Virginia's counties (which I'm sourcing from a 2020 election by county CSV already on my computer that I constructed from this: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ), constructs the URL for the county JSON file (since the format's consistent), and saves it to a dataframe. My current code doesn't save the dataframes though, so only the last county remains.
This is the code:
library(dplyr)
library(tidyverse)
library(jsonlite)
va <- filter(biden_margin, biden_margin$state_po == "VA")
#i put this line here because the spreadsheet uses spaces to separate "X" and "city" but the URL uses an underline
va$county_name <- gsub(" ", "_", va$county_name)
#i put this line here because the URLs have "county" in the name, but the spreadsheet doesn't; however the spreadsheet does have "city" for the independent cities, like the URLs (and the independent cities are the observations with FIPS above 51199)
va$county_name <- if_else(va$county_fips > 51199, va$county_name, paste0(va$county_name, "_COUNTY"))
#i did this as a list but i realize this might be a bad idea
governor_data <- vector(mode = "list", length = nrow(va))
for (i in nrow(va)) {
precincts <- paste0("https://results.elections.virginia.gov/vaelections/2021%20November%20General/Json/Locality/", va$county_name[i], "/Governor.json")
name <- paste0(va$county_name[i], "_governor_2021")
java_source <- stream_in(file(precincts))
df <- as.data.frame(java_source$Precincts)
df$county <- java_source$Locality$LocalityName
df <- unnest(df, cols = c(Candidates))
df <- subset(df, select = -c(PoliticalParty, BallotOrder))
df <- pivot_wider(df, names_from = BallotName, values_from = c(Votes, Percentage))
#tried append before this, got the same result
governor_data[i] <- assign(name, df)
}
Any thoughts?

R, creating variables on the fly in a list using assign statement

I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)

Resources