ReadLines using multiple sources in R - r

I'm trying to use readLines() to scrape .txt files hosted by the Census and compile them into one .txt/.csv file. I am able to use it to read individual pages but I'd like to have it so that I can just run a function that will go out and readLines() based on a csv with urls.
My knowledge of looping and function properties isn't great, but here are the pieces of my code that I'm trying to incorporate:
Here is how I build my matrix of urls which I can add to and/or turn into a csv and have a function read it that way.
MasterList <- matrix( data = c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt"), ncol = 1)
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)
Here's the function (riddled with problems) I started writing:
Scrape <- function(x){
for (i in x){
URLS <- i
headers <- readLines(URLS, n=2)
bod <- readLines(URLS)
bodclipped <- bod[-c(1,2,3)]
Totes <- c(headers, bodclipped)
write(Totes, file = "[Directory]/ScrapeTest.txt")
return(head(Totes))
}
}
The idea being that I would run Scrape(urls) which would generate a cumulation of the 3 urls I have in my "urls" matrix/csv with the Census' build in headers removed from all files except the first one (headers vs. bodclipped).
I've tried doing lapply() to "urls" with readLines but that only generates text based on the last url and not all three, and they still have the headers for each text file which I could just remove and then reattach at the end.
Any help would be appreciated!

As all of these documents are csv files with 38 columns you can combine then very easily using:
MasterList <- c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt")
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)
raw_dat <- lapply(urls, read.csv, skip = 3, header = FALSE)
dat <- do.call(rbind, dat_raw)
What happens here and how is this looping?
The lapply function basically creates a list with 3 (= length(urls)) entries and populates them with: read.csv(urls[i], skip = 3, header = FALSE). So raw_dat is a list with 3 data.frames containing your data. do.call(rbind, dat) binds em together.
The header row seams somehow broken thats why i use skip = 3, header = FALSE which is equivalent to your bod[-c(1,2,3)].
If all the scraped data fits into memory you can combine it this way and in the end write it into a file using:
write.csv(dat, "[Directory]/ScrapeTest.txt")

Related

Loop over a large number of CSV files with the same statements in R?

I'm having a lot of trouble reading/writing to CSV files. Say I have over 300 CSV's in a folder, each being a matrix of values.
If I wanted to find out a characteristic of each individual CSV file such as which rows had an exact number of 3's, and write the result to another CSV fil for each test, how would I go about iterating this over 300 different CSV files?
For example, say I have this code I am running for each file:
values_4 <- read.csv(file = 'values_04.csv', header=FALSE) // read CSV in as it's own DF
values_4$howMany3s <- apply(values_04, 1, function(x) length(which(x==3))) // compute number of 3's
values_4$exactly4 <- apply(values_04[50], 1, function(x) length(which(x==4))) // show 1/0 on each column that has exactly four 3's
values_4 // print new matrix
I am then continuously copy and pasting this code and changing the "4" to a 5, 6, etc and noting the values. This seems wildly inefficient to me but I'm not experienced enough at R to know exactly what my options are. Should I look at adding all 300 CSV files to a single list and somehow looping through them?
Appreciate any help!
Here's one way you can read all the files and proceess them. Untested code as you haven't given us anything to work on.
# Get a list of CSV files. Use the path argument to point to a folder
# other than the current working directory
files <- list.files(pattern=".+\\.csv")
# For each file, work your magic
# lapply runs the function defined in the second argument on each
# value of the first argument
everything <- lapply(
files,
function(f) {
values <- read.csv(f, header=FALSE)
apply(values, 1, function(x) length(which(x==3)))
}
)
# And returns the results in a list. Each element consists of
# the results from one function call.
# Make sure you can access the elements of the list by filename
names(everything) <- files
# The return value is a list. Access all of it with
everything
# Or a single element with
everything[["values04.csv"]]

Create list of files in directory, apply (lapply) a custom function to each, and cbind results to new file

rewrote in attempt to simplify my problem statement.
I am using R V1.3.959 and relatively new to R overall. I have a custom excel form, which means the objects are in various cells in excel and the variable is also in some cell. I have over 1000 of these forms as product specs. I read in only 1 file and created a function called tidy.form to pull data out and then cbind into new file as below.
read_customer_file = "C:/Users/..../FABRIC TECHNICAL SUBMISSION AGREEMENT J123abd.xlsx"
product_tech <- read_excel(read_customer_file, sheet = "Form") %>% clean_names()
#function for make form tidy
form.extract <- function(tidy.form) {
#extract the object / data point looking for but with entire column
fabric.supplier.name <- product_tech[c( 0,5)]
#extract the specific row in the column with the data point desired
fabric.supplier.name <- slice(fabric.supplier.name, 3,0)
#rename column to correct variable
colnames(fabric.supplier.name)[colnames(fabric.supplier.name) == "x5"] <- "fabric.supplier.name"
combine <- cbind(date, fabric.supplier.name, address)
return(combine)
}
Now I need a way to read in all of the xlsx files from a directory and do the same thing for each.
I figured out how to read the file names in through:
files <- list.files(path="C:/Users/me/productspecfolder", pattern="*.xlsx", full.names=TRUE, recursive=FALSE)
However I am stuck at how to loop / lapply through my list.files and apply the function tidy.form to each.
Any help would be so much appreciated!

Load multiple .csv files in R, edit them and save as new .csv files named by a list of chracter strings

I am pretty new to R and programming so I do apologies if this question has been asked elsewhere.
I'm trying to load multiple .csv files, edit them and save again. But cannot find out how to manage more than one .csv file and also name new files based on a list of character strings.
So I have .csv file and can do:
species_name<-'ace_neg'
{species<-read.csv('species_data/ace_neg.csv')
species_1_2<-species[,1:2]
species_1_2$species<-species_name
species_3_2_1<-species_1_2[,c(3,1,2)]
write.csv(species_3_2_1, file='ace_neg.csv',row.names=FALSE)}
But I would like to run this code for all .csv files in the folder and add text to a new column based on .csv file name.
So I can load all .csv files and make a list of character strings for use as a new column text and as new file names.
NDOP_files <- list.files(path="species_data", pattern="*.csv$", full.names=TRUE, recursive=FALSE)
short_names<- substr(NDOP_files, 14,20)
Then I tried:
lapply(NDOP_files, function(x){
species<-read.csv(x)
species_1_2<-species[,1:2]
species_1_2$species<-'name' #don't know how to insert first character string of short_names instead of 'name', than second character string from short_names for second csv. file etc.
Then continue in the code to change an order of columns
species_3_2_1<-species_1_2[,c(3,1,2)]
And then write all new modified csv. files and name them again by the list of short_names.
I'm sorry if the text is somewhat confusing.
Any help or suggestions would be great.
You are actually quite close and using lapply() is really good idea.
As you state, the issue is, it only takes one list as an argument,
but you want to work with two. mapply() is a function in base R that you can feed multiple lists into and cycle through synchronically. lapply() and mapply()are both designed to create/ manipulate objects inRbut you want to write the files and are not interested in the out withinR. Thepurrrpackage has thewalk*()\ functions which are useful,
when you want to cycle through lists and are only interested in creating
side effects (in your case saving files).
purrr::walk2() takes two lists, so you can provide the data and the
file names at the same time.
library(purrr)
First I create some example data (I’m basically already using the same concept here as I will below):
test_data <- map(1:5, ~ data.frame(
a = sample(1:5, 3),
b = sample(1:5, 3),
c = sample(1:5, 3)
))
walk2(test_data,
paste0("species_data/", 1:5, "test.csv"),
~ write.csv(.x, .y))
Instead of getting the file paths and then stripping away the path
to get the file names, I just call list.files(), once with full.names = TRUE and once with full.names = FALSE.
NDOP_filepaths <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = TRUE,
recursive = FALSE
)
NDOP_filenames <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = FALSE,
recursive = FALSE
)
Now I feed the two lists into purrr::walk2(). Using the ~ before
the curly brackets I can define the anonymous function a bit more elegant
and then use .x, and .y to refer to the entries of the first and the
second list.
walk2(NDOP_filepaths,
NDOP_filenames,
~ {
species <- read.csv(.x)
species <- species[, 1:2]
species$species <- gsub(".csv", "", .y)
write.csv(species, .x)
})
Learn more about purrr at purrr.tidyverse.org.
Alternatively, you could just extract the file name in the loop and stick to lapply() or use purrr::map()/purrr::walk(), like this:
lapply(NDOP_filepaths,
function(x) {
species <- read.csv(x)
species <- species[, 1:2]
species$species <- gsub("species///|.csv", "", x)
write.csv(species, gsub("species///", "", x))
})
NDOP_files <- list.files(path="species_data", pattern="*.csv$",
full.names=TRUE, recursive=FALSE)
# Get name of each file (without the extension)
# basename() removes all of the path up to and including the last path seperator
# file_path_sands_ext() removes the .csv extension
csvFileNames <- tools::file_path_sans_ext(basename(NDOP_files))
Then, I would write a function that takes in 1 csv file and does some manipulation to the file and outputs out a data frame. Since you have a list of csv files from using list.files, you can use the map function in the purrr package to apply your function to each csv file.
doSomething <- function(NDOP_file){
# your code here to manipulate NDOP_file to your liking
return(NDOP_file)
NDOP_files <- map(NDOP_files, ~doSomething(.x))
Lastly, you can manipulate the file names when you write the new csv files using csvFileNames and a custom function you write to change the file name to your liking. Essentially, use the same architecture of defining your custom function and using map to apply to each of your files.

Reading seperate text files and saving them in a single variable as seperate dataframes

I have multiple text files (tab-delimited) generated from the same software. I initially used a loop with assign function to create variables dynamically and store them separately with the read.table function. This resulted in too many variables and was obviously time-consuming to apply operations on separate files.
I came across the lapply and fread method shown in the code below.
I don't need to merge them and they need to be separate data frames so I can compare values in the files. Using the lapply function, this was possible but the file names were not retained in any way. I found the following code from How to import multiple .csv files at once? that helped me with it. It has multiple lines and I was wondering whether there is a one-line solution for this.
foo <- function(fname){
fread(fname, skip = 5, header = TRUE, sep = " ") %>%
mutate(fn = fname)
}
all <- lapply(files, FUN = foo)
Alternatively, how do I access the specific iteration in lapply?
We can use setNames
all <- setNames(lapply(files, foo), files)
We can also make a general function that will set the names as the files are imported:
import_with_names <- function(files){
loaded <- list()
for (fname in files){
loaded[[fname]] <- fread(fname, skip = 5, header = TRUE, sep = " ")
}
return(loaded)
}
all <- import_with_names(files)
You can then call them by using all[[file_name]]

Select multiple rows in multiple DFs with loop in R

I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)

Resources