I am trying to import calendar dates in R.
I found a website with dates that I imported with XML.
library('XML')
u="http://www.timeanddate.com/calendar/custom.html?year=2015&country=5&typ=0&display=2&cols=0&hol=0&cdt=1&holm=1&df=1"
tables = readHTMLTable(u)
Get rid of some unecessary elements
tables = tables[-1]
tables = tables[-1]
tables = tables[-13]
Generate list names
names(tables) <- paste('month', 1:12, sep = '')
with a solution proposed here
mtables = mapply(cbind, tables, 'Month'= 1:12, SIMPLIFY=F)
Here when I want to rbind my list:
do.call('rbind', mtables)
I get an error:
Error in match.names(clabs, names(xi)) :
names do not match previous names
Could you help with solve this error problem ?
rbind normally takes two parameters.
here is a code snippet using rbind.
hope this helps.
cheers
oliver
vehicles1 <- unique(grep("Vehicles", SCC$EI.Sector, ignore.case = TRUE, value = TRUE))
vehicles <- SCC[SCC$EI.Sector %in% vehicles1, ]["SCC"]
# Select observations relating to Baltimore MD
vehiclesBaltimore <- NEI[NEI$SCC %in% vehicles$SCC & NEI$fips == "24510",]
# Select observations relating to Los Angeles County CA
vehiclesLosAngelesCounty <- NEI[NEI$SCC %in% vehicles$SCC & NEI$fips == "06037",]
# Merge observations of Baltimore and Los Angeles County
vehiclesCompare <- rbind(vehiclesBaltimore, vehiclesLosAngelesCounty)
The issue was actually in the header.
`tables = readHTMLTable(u, header = F)`
instead of
`tables = readHTMLTable(u, header = T)`
In order to get the same column names for each lists.
Thanks
Related
So I'm currently trying to scrape precinct results by county from JSON files on Virginia's Secretary of State. I got code working that gets the data from a URL and creates a dataframe named after the county. To speed up the process, I tried to put the code inside a for loop that iterates through Virginia's counties (which I'm sourcing from a 2020 election by county CSV already on my computer that I constructed from this: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ), constructs the URL for the county JSON file (since the format's consistent), and saves it to a dataframe. My current code doesn't save the dataframes though, so only the last county remains.
This is the code:
library(dplyr)
library(tidyverse)
library(jsonlite)
va <- filter(biden_margin, biden_margin$state_po == "VA")
#i put this line here because the spreadsheet uses spaces to separate "X" and "city" but the URL uses an underline
va$county_name <- gsub(" ", "_", va$county_name)
#i put this line here because the URLs have "county" in the name, but the spreadsheet doesn't; however the spreadsheet does have "city" for the independent cities, like the URLs (and the independent cities are the observations with FIPS above 51199)
va$county_name <- if_else(va$county_fips > 51199, va$county_name, paste0(va$county_name, "_COUNTY"))
#i did this as a list but i realize this might be a bad idea
governor_data <- vector(mode = "list", length = nrow(va))
for (i in nrow(va)) {
precincts <- paste0("https://results.elections.virginia.gov/vaelections/2021%20November%20General/Json/Locality/", va$county_name[i], "/Governor.json")
name <- paste0(va$county_name[i], "_governor_2021")
java_source <- stream_in(file(precincts))
df <- as.data.frame(java_source$Precincts)
df$county <- java_source$Locality$LocalityName
df <- unnest(df, cols = c(Candidates))
df <- subset(df, select = -c(PoliticalParty, BallotOrder))
df <- pivot_wider(df, names_from = BallotName, values_from = c(Votes, Percentage))
#tried append before this, got the same result
governor_data[i] <- assign(name, df)
}
Any thoughts?
I'm basically trying to call an API to retrieve weather information from a government website.
library(data.table)
library(jsonlite)
library(httr)
base<-"https://api.data.gov.sg/v1/environment/rainfall"
date1<-"2020-01-25"
call1<-paste(base,"?","date","=",date1,sep="")
get_rainfall<-GET(call1)
get_rainfall_text<-content(get_rainfall,"text")
get_rainfall_json <- fromJSON(get_rainfall_text, flatten = TRUE)
get_rainfall_df <- as.data.frame(get_rainfall_json)
I'm getting an error
"Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 52, 287, 1"
Not too sure how to resolve this, i'm trying to format the retrieved data into a dataframe format so i can make sense of the readings.
Your "get_rainfall_json" object comes back as a "list". Trying to turn this into a data frame is where you are getting the error. If you specify the "items" object within the list, your error is resolved! (The outcome of this looks like it has some more embedded data within objects... So you'll have to parse through that into a format you're interested in.)
get_rainfall_df <- as.data.frame(get_rainfall_json$items)
Update
In order to loop through the next data frame. Here is one way you could do it. Which loops through each row, extracts the list in each row and turns that into a data frame and appends it to the "df". Then, you are left with one final df with all the data in one place.
library(data.table)
library(jsonlite)
library(httr)
library(dplyr)
base <- "https://api.data.gov.sg/v1/environment/rainfall"
date1 <- "2020-01-25"
call1 <- paste(base, "?", "date", "=", date1, sep = "")
get_rainfall <- GET(call1)
get_rainfall_text <- content(get_rainfall,"text")
get_rainfall_json <- fromJSON(get_rainfall_text, flatten = TRUE)
get_rainfall_df <- as.data.table(get_rainfall_json$items)
df <- data.frame()
for (row in 1:nrow(get_rainfall_df)) {
new_date <- get_rainfall_df[row, ]$readings[[1]]
colnames(new_date) <- c("stationid", "value")
date <- get_rainfall_df[row, ]$timestamp
new_date$date <- date
df <- bind_rows(df, new_date)
}
There is a table of taxes by country at the link below that I would like to scrape into a dataframe with Country and Tax columns.
I've tried using the rvest package as follows to get my Country column but the list I generate is empty and I don't understand why.
I would appreciate any pointers on resolving this problem.
library(rvest)
d1 <- read_html(
"http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates"
)
TaxCountry <- d1 %>%
html_nodes('.countryNameQC') %>%
html_text()
The data is dynamically loaded and the DOM altered when javascript runs in the browser. This doesn't happen with rvest.
The following selectors, in the browser, would have isolated your nodes:
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryYear
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryYear
But, those classes are not even present in rvest return.
The data of interest is actually stored in several nodes; all of which have ids within a common prefix of dspQCLinks. The data inside looks like as follows:
So, you can gather all those nodes using css attribute = value with starts with operator (^) syntax:
html_nodes(page, "[id^=dspQCLinks]")
Then extract the text and combine into one string
paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = '')
Now each row in your table is delimited by !, , so we can split on that to generate the rows:
info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
An example row would then look like:
"Albania#/uk/taxsummaries/wwts.nsf/ID/Albania-Corporate-Taxes-on-corporate-income#15"
If we split each row on the #, the data we want is at indices 1 and 3:
arr = strsplit(i, '#')[[1]]
country <- arr[1]
tax <- arr[3]
Thanks to #Brian's feedback I have removed the loop I had to build the dataframe and replaced with, to quote #Brian,
str_split_fixed(info, "#", 3) [which] gives you a character matrix, which can be directly coerced to a dataframe.
df <- data.frame(str_split_fixed(info, "#", 3))
You then remove the empty rows at the bottom of the df.
df <- df[df$Country != "",]
Sample of df:
R
library(rvest)
library(stringr)
library(magrittr)
page <- read_html('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
df <- data.frame(str_split_fixed(info, "#", 3))
colnames(df) <- c("Country","Link","Tax")
df <- subset(df, select = c("Country","Tax"))
df <- df[df$Country != "",]
View(df)
Python:
I did this first in python as was quicker for me:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
soup = bs(r.content, 'lxml')
text = ''
for i in soup.select('[id^=dspQCLinks]'):
text+= i.text
rows = text.split('!,')
countries = []
tax_info = []
for row in rows:
if row:
items = row.split('#')
countries.append(items[0])
tax_info.append(items[2])
df = pd.DataFrame(list(zip(countries,tax_info)))
print(df)
Reading:
str_split_fixed
My code below is me scraping data from IMDB from multiple pages, however, when I try to combine the data into one data frame it is giving me an error telling me the differing rows for gross and meta. I was wondering how would I go about inserting NA values to those empty places so the strings are equal in length? (Note, I have to remove some links because I need certain rep to post more links)
urls <- c("https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=51&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=101&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=151&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=201&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=251&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=301&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=351&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=401&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=451&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=501&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=551&ref_=adv_nxt",
"https://www.imdb.com/search/title?
)
results_list <- list()
for(.page in seq_along(urls)){
webpage <- read_html(urls[[.page]])
titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)
runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime <- gsub(" min","",runtime)
ratinghtml <- html_nodes(webpage,'.ratings-imdb-rating strong')
rating<- html_text(ratinghtml)
voteshtml <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')
votes <- html_text(voteshtml)
votes<-gsub(",","",votes)#removing commas
metascorehtml <- html_nodes(webpage,'.metascore')
metascore <- html_text(metascorehtml)
metascore<-gsub(" ","",metascore)#removing extra space in metascore
grosshtml <- html_nodes(webpage,'.ghost~ .text-muted+ span')
gross <- html_text(grosshtml)
gross<-gsub("M","",gross)#removing '$' and 'M' signs
gross<-substring(gross,2,6)
results_list[[.page]] <- data.frame(Title = title,
Runtime = as.numeric(runtime),
Rating = as.numeric(rating),
Metascore = as.numeric(metascore),
Votes = as.numeric(votes),
Gross_Earning_in_Mil = as.numeric(unlist(gross))
)
}
final_results <- plyr::ldply(results_list)
Error in data.frame(Title = title, Runtime = as.numeric(runtime), Rating = as.numeric(rating), :
arguments imply differing number of rows: 50, 49, 48
You need to know where your data is missing, so you need to know which items belong together. Right now you just have seperate vectors of values, so you don't know which belong together.
Looking at the page, it looks they are neatly organized into "lister-item-content"-nodes, so the clean thing to do is first extract those nodes, and only then pull out more info from each unit seperately. Something like this works for me:
items <- html_nodes(webpage,'.lister-item-content')
gross <- sapply(items, function(i) {html_text(html_node(i, '.ghost~ .text-muted+ span'))})
It inserts NA at every place where 'items' does not contain the header you're looking for.
How can I ignore a data set if some column names don't exist in it?
I have a list of weather data from a stream but I think certain key weather conditions don't exist and therefore I have this error below with rbind:
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
My code:
weatherDf <- data.frame()
for(i in weatherData) {
# Get the airport code.
airport <- i$airport
# Get the date.
date <- as.POSIXct(as.numeric(as.character(i$timestamp))/1000, origin="1970-01-01", tz="UTC-1")
# Get the data in dailysummary only.
dailySummary <- i$dailysummary
weatherDf <- rbind(weatherDf, ldply(
list(dailySummary),
function(x) c(airport, format(as.Date(date), "%Y-%m-%d"), x[["meanwindspdi"]], x[["meanwdird"]], x[["meantempm"]], x[["humidity"]])
))
}
So how can I make sure these key conditions below exist in the data:
meanwindspdi
meanwdird
meantempm
humidity
If any of them does not exit, then ignore the bunch of them. Is it possible?
EDIT:
The content of weatherData is in jsfiddle (I can't post it here as it is too long and I dunno where is the best place to show the data publicly for R...)
EDIT 2:
I get some error when I try to export the data into a txt:
> write.table(weatherData,"/home/teelou/Desktop/data/data.txt",sep="\t",row.names=FALSE)
Error in data.frame(date = list(pretty = "January 1, 1970", year = "1970", :
arguments imply differing number of rows: 1, 0
What does it mean? It seems that there are some errors in the data...
EDIT 3:
I have exported my entire data in .RData to my google drive:
https://drive.google.com/file/d/0B_w5RSQMxtRSbjdQYWJMX3pfWXM/view?usp=sharing
If you use RStudio, then you can just import the data.
EDIT 4:
target_names <- c("meanwindspdi", "meanwdird", "meantempm", "humidity")
# If it has data then loop it.
if (!is.null(weatherData)) {
# Initialize a data frame.
weatherDf <- data.frame()
for(i in weatherData) {
if (!all(target_names %in% names(i)))
next
# Get the airport code.
airport <- i$airport
# Get the date.
date <- as.POSIXct(as.numeric(as.character(i$timestamp))/1000, origin="1970-01-01", tz="UTC-1")
# Get the data in dailysummary only.
dailySummary <- i$dailysummary
weatherDf <- rbind(weatherDf, ldply(
list(dailySummary),
function(x) c(airport, format(as.Date(date), "%Y-%m-%d"), x[["meanwindspdi"]], x[["meanwdird"]], x[["meantempm"]], x[["humidity"]])
))
}
# Rename column names.
colnames(weatherDf) <- c("airport", "key_date", "ws", "wd", "tempi", 'humidity')
# Convert certain columns weatherDf type to numberic.
columns <-c("ws", "wd", "tempi", "humidity")
weatherDf[, columns] <- lapply(columns, function(x) as.numeric(weatherDf[[x]]))
}
Inspect the weatherDf:
> View(weatherDf)
Error in .subset2(x, i, exact = exact) : subscript out of bounds
You can use next to skip the current iteration of the loop and go to the next iteration:
target_names <- c("meanwindspdi", "meanwdird", "meantempm", "humidity")
for(i in weatherData) {
if (!all(target_names %in% names(i)))
next
# continue with loop...