Reading numerous html tables into R - r

I'm trying to pull html data tables into a single data frame, and I'm looking for an elegant solution. There are 255 tables, and the urls vary by two variable: Year and Aldermanic District. I know there must be a way to use for loops or something, but I'm stumped.
I have successfully imported the data by reading each table in with a separate line of code, but this results in a line for each table, and again, there are 255 tables.
library(XML)
data <- bind_rows(readHTMLTable("http://assessments.milwaukee.gov/SalesData/2018_RVS_Dist14.htm", skip.rows=1),
readHTMLTable("http://assessments.milwaukee.gov/SalesData/2017_RVS_Dist14.htm", skip.rows=1),
readHTMLTable("http://assessments.milwaukee.gov/SalesData/2016_RVS_Dist14.htm", skip.rows=1),
readHTMLTable("http://assessments.milwaukee.gov/SalesData/2015_RVS_Dist14.htm", skip.rows=1),
Ideally, I could use a for loop or something so I wouldn't have to hand code the readHTMLTable function for each table.

You could try creating a vector containing all the URLs which you want to scrape, and then iterate over those inputs using a for loop:
url1 <- "http://assessments.milwaukee.gov/SalesData/"
url2 <- "_RVS_Dist"
years <- c(2015:2018)
dist <- c(1:15)
urls <- apply(expand.grid(paste0(url1, years), paste0(url2, dist)), 1, paste, collapse="")
data <- NULL
for (url in urls) {
df <- readHTMLTable(url)
data <- rbind(data, df)
}

We can use map_dfr from the purrr package (part of the tidyverse) package to apply the readHTMLTable function across the URL. The key is to identify the part that is different from each URL. In this case 2015:2018 is the only thing changed, so we can construct the URL with paste0. map_dfr would automatically combine all data frame to return one combined data frame. dat is the final output.
library(tidyverse)
library(XML)
dat <- map_dfr(2015:2018,
~readHTMLTable(paste0("http://assessments.milwaukee.gov/SalesData/",
.x,
"_RVS_Dist14.htm"), skip.rows = 1)[[1]])
Update
Here is the way to expand the combination between year and numbers, and then download the data with map2_dfr.
url <- expand.grid(Year = 2002:2018, Number = 1:15)
dat <- map2_dfr(url$Year, url$Number,
~readHTMLTable(paste0("http://assessments.milwaukee.gov/SalesData/",
.x,
"_RVS_Dist",
.y,
".htm"), skip.rows = 1)[[1]])

Related

What is the easiest way to import SeatGeek API Data (JSON format) into a flat data frame?

Within R, I've attempted to query the SeatGeek API for a list of venues using jsonlite but continue to run into issues regarding the nested format of the JSON data I'm downloading (i.e. rather than 'location' having a stored value, it is a stored list of 'lat' and 'lon' variables)
What's the easiest way to create a flat data frame with 1 value per cell (rather than a list of multiple values in a single cell)?
Currently, I'm returning a data frame with a list containing the desired value within each cell, rather than just the value itself (typically each list holds a single value, in some situations with more than one value, i.e. location, which contains both latitude and longitude variables).
library(httr)
library(jsonlite)
perpage <- "per_page="
pagenumber <- "page="
pp <- 5000
pn <- 0
ven <- paste("https://api.seatgeek.com/2/venues?", "country=US&", perpage, (pp), "&", pagenumber, (pn+1), "&client_id=NTM2MzE3fDE1NzM4NTExMTAuNzU&client_secret=77264dfa5a0bc99095279fa7b01c223ff994437433c214c8b9a08e6de10fddd6", sep = "")
ven
vpg <- GET("https://api.seatgeek.com/2/venues?country=US&per_page=5000&page=1&client_id=NTM2MzE3fDE1NzM4NTExMTAuNzU&client_secret=77264dfa5a0bc99095279fa7b01c223ff994437433c214c8b9a08e6de10fddd6")
vpgc <- content(vpg)
vpgcv <- (vpgc$venues)
json_file <- sapply(vpgcv, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
as.data.frame(t(x))
})
venues.dataframe <- as.data.frame(t(json_file))
Any help at a more efficient approach to pull this nested data would be greatly appreciated!
Reduce() with bind_rows() works:
json_file <- Reduce(dplyr::bind_rows, lapply(vpgcv, unlist))
EDIT:
Use bind_rows() instead of rbind() because rbind() does not match new rows by column names.

How to clean multiple excel files in one time in R?

I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.

Use R to add a column to multiple dataframes using lapply

I would like to add a column containing the year (found in the file name) to each column. I've spent several hours googling this, but can't get it to work. Am I making some simple error?
Conceptually, I'm making a list of the files, and then using lapply to calculate a column for each file in the list.
I'm using data from Census OnTheMap. Fresh download. All files are named thus: "points_2013" "points_2014" etc. Reading in the data using the following code:
library(maptools)
library(sp)
shps <- dir(getwd(), "*.shp")
for (shp in shps) assign(shp, readShapePoints(shp))
# the assign function will take the string representing shp
# and turn it into a variable which holds the spatial points data
My question is very similar to this one, except that I don't have a list of file names--I just want extract the entry in a column from the file name. This thread has a question, but no answers. This person tried to use [[ instead of $, with no luck. This seems to imply the fault may be in cbind vs. rbind..not sure. I'm not trying to output to csv, so this is not fully relevant.
This is almost exactly what I am trying to do. Adapting the code from that example to my purpose yields the following:
dat <- ls(pattern="points_")
dat
ldf = lapply(dat, function(x) {
# Add a column with the year
dat$Year = substr(x,8,11)
return(dat)
})
ldf
points_2014.shp$Year
But the last line still returns NULL!
From this thread, I adapted their solution. Omitting the do.call and rbind, this seems to work:
lapply(points,
function(x) {
dat=get(x)
dat$year = sub('.*_(.*)$','\\1',x)
return(dat)
})
points_2014.shp$year
But the last line returns a null.
Starting to wonder if there is something wrong with my R in some way. I tested it using this example, and it works, so the trouble is elsewhere.
# a dataframe
a <- data.frame(x = 1:3, y = 4:6)
a
# make a list of several dataframes, then apply function
#(change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {
names(x) <- c("a", "b")
return(x)})
my.list
After some help from this site, my final code was:
#-------takes all the points files, adds the year, and then binds them together
points2<-do.call(rbind,lapply(ls(pattern='points_*'),
function(x) {
dat=get(x)
dat$year = substr(x,8,11)
dat
}))
points2$year
names(points2)
It does, however, use an rbind, which is helpful in the short term. In the long term, I will need to split it again, and use a cbind, so I can substract two columns from each other.
I use the following Code:
for (i in names.of.objects){
temp <- get(i)
# do transformations on temp
assign(i, temp)
}
This works, but is definitely not performant, since it does assignments of the whole data twice in a call by value manner.

R - Subset a Dataframe with a Programmatically built Formula

I'm working with a large data frame that is pulled from a data lake which I need to subset according to multiple different columns and run an analysis on. The basic subsettings come from an external Excel file which I read in and generate all possible combinations of. I want something to loop through each of these columns and subset my data accordingly.
A few of the subsettings follow a similar form to:
data_settings <- data.frame(country = rep(c('DE','RU','US','CA','BR'),6),
transport=rep(c('road','air','sea')),
category = rep(c('A','B')))
And my data lake extract has a form like:
df <- data.frame(country = rep(unique(data_settings$country),6),
transport = rep(unique(data_settings$transport),10),
category = rep(c('A','B'),15),
values = round(runif(30) * 10))
I need to subset the data according to each of the rows in my data_settings data frame, so I built a loop which constructs the formula according to what is in my data_settings data frame.
for(i in 1:nrow(data_settings)){
sub_string <- paste0(names(data_settings[1]), '==', data_settings[i,1])
for(j in 2:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
sub_string <- paste0(sub_string, ' & ', col," == ","'",val,"'")
}
df_sub <- subset(df, formula(sub_string))
}
This successfully builds my strings which I try to pass to formula or as.formula, but I receive an error at that point. I've tried a few different formulations without any success. In my actual case, there are thousands of combinations with different columns and values to filter against.
Thanks in advance for your help!
Try this:
merge(data_settings, df)
I worked with my previous approach a bit more today without using subset, filter, etc. and put this together which seems to do what I want well enough by filtering recursively according to the next item in the data_settings frame.
for(i in 1:nrow(data_settings)){
df_sub <- df
for(j in 1:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
df_col <- grep(col, names(df))
df_sub <- df_sub[df_sub[,df_col] == val,]
}
# Run further analysis here...
}

R for loop - appending results outside the loop

I'm taking an introductory R-programming course on Cousera. The first assignment has us evaluating a list of hundreds of csv files in a specified directory ("./specdata/). Each csv file, in turn, contains hundreds of records of sample pollutant data in the atmosphere - a date, a sulfite sample, a nitrate sample, and an ID of that identifies the sampling location.
The assignment asks us to create a function that takes the pollutant an id or range of ids for sampling location and returns a sample mean, given the supplied arguments.
My code (below) uses a for loop to use the id argument to only read the files of interest (seems more efficient than reading in all 322 files before doing any processing). That works great.
Within the loop, I assign the contents of the csv file to a variable. I then make that variable a data frame and use rbind to append to it the file read in during each loop. I use na.omit to remove the missing files from the variable. Then I use rbind to append the result of each iteration of the loop to variable. When I print the data frame variable within the loop, I can see the entire full list, subgrouped by id. But when I print the variable outside the loop, I only see the last element in the id vector.
I would like to create a consolidated list of all records matching the id argument within the loop, then pass the consolidate list outside the loop for further processing. I can't get this to work. My code is shown below.
Is this the wrong approach? Seems like it could work. Any help would be most appreciated. I searched StackOverflow and couldn't find anything that quite addresses what I'm trying to do.
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
id1 <- id[1]
id2 <- id[length(id)]
for (i in id1:id2) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df <- rbind(df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
}
You can just define the data frame outside the for loop and append to it. Also you can skip some steps in between... There are more ways to improve here... :-)
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
df_final <- data.frame()
for (i in id) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df_final <- rbind(df_final, df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
return(df_final)
}
by only calling df <- rbind(df) you are effectively overwriting df everytime. You can fix this by doing something like this:
df = data.frame() # empty data frame
for(i in 1:10) { # for all you csv files
x <- mean(rnorm(10)) # some new information
df <- rbind(df, x) # bind old dataframe and new value
}
By the way, if you know how big df will be beforehand then this is not the proper way to do it.

Resources