Very new to R but I'm hoping this will be simple. I have a html table I scraped into R and it looks like this:
`'data.frame': 238 obs. of 6 variables:
$ Facility Name : chr "Affinity Healthcare Center" "Alameda Care Center" "Alcott Rehabilitation Hospital" "Alden Terrace Convalescent Hospital" ...
$ City : chr "Paramount" "Burbank" "Los Angeles" "Los Angeles" ...
$ State : chr " CA" " CA" " CA" " CA" ...
$ Confirmed Staff : chr "26" "36" "14" "27" ...
$ Confirmed Residents: chr "29" "49" "26" "85" ...
$ Total Deaths : chr 26 36 14 27 19 3 1 7 16 3 ...`
I want Confirmed Staff, Confirmed Residents and Total Deaths to be integers so I can do some math on them and sort, order, etc.
I tried this for one variable and it seemed to work ok:
`tbls_ls4$`Total Deaths` <- as.integer(tbls_ls4$`Confirmed Staff`)`
But I'd like to apply it to all three variables and not sure how to do that.
Several ways to do that:
library(data.table)
setDT(df)
cols <- c("Confirmed Staff", "Confirmed Residents", "Total Deaths")
df[,cols := lapply(.SD, as.integer),.SDcols = cols]
or if you prefer base R :
df[, cols] <- lapply(cols, function(d) as.integer(df[,d]))
You could also use mutate_at from the tidyverse package to convert certain columns from character to integer.
library(tidyverse)
df<-df %>%
# Indicate the column names you wish to convert inside vars
mutate_at(vars(ConfirmedStaff,ConfirmedResidents,TotalDeaths),
# Indicate the function to apply to each column
as.integer)
Related
I was hoping somebody could help me with a problem I'm having with an exercise on the DataCamp "Building Web Applications with Shiny in R" course, specifically with transforming one of the datasets they use in the exercise.
I've imported their dataset (RDS) using the readRDS function and it looks like this:
$ id : int 10259 25693 20130 22213 13162 6602 42779 3735 16903 12734 ...
$ cuisine : chr "greek" "southern_us" "filipino" "indian" ...
$ ingredients:List of 39774
..$ : chr "romaine lettuce" "black olives" "grape tomatoes" "garlic" ...
..$ : chr "plain flour" "ground pepper" "salt" "tomatoes" ...
..$ : chr "eggs" "pepper" "salt" "mayonaise" ...
..$ : chr "water" "vegetable oil" "wheat" "salt"
..$ : chr "black pepper" "shallots" "cornflour" "cayenne pepper" ...
..$ : chr "plain flour" "sugar" "butter" "eggs" ...
..$ : chr "olive oil" "salt" "medium shrimp" "pepper" ...
..$ : chr "sugar" "pistachio nuts" "white almond bark" "flour" ...
..$ : chr "olive oil" "purple onion" "fresh pineapple" "pork" ...
..$ : chr "chopped tomatoes" "fresh basil" "garlic" "extra-virgin olive oil" ...
In their tutorial, they have a dataset that's been transformed so that there are three columns, id, cuisine and ingredients, but ingredients only has one ingredient (meaning there are multiple rows for the same id).
Usually when I have to do something like this, I use the dplyr function 'gather', but this won't work in this instance as it is for gathering multiple columns, rather than spitting up a column containing character vectors of varying length. I also tried to use the separate() function, but this requires you to specify what columns you want to separate the vectors into, which I can't do as they all vary in length.
If somebody could give me an idea as to how I'd go about transforming the above dataframe so that it's longform, I'd be very grateful.
Many thanks!
Sounds like you are looking for spread: https://tidyr.tidyverse.org/reference/spread.html. This effectively does the opposite of gather.
Should also be mentioned that gather and spread are no longer being updated, having been replaced with their arguably more explicit counterparts pivot_longer and pivot_wider: https://tidyr.tidyverse.org/reference/pivot_longer.html and https://tidyr.tidyverse.org/reference/pivot_wider.html. Datacamp may not have updated their courses to reflect this however.
I was wondering if there is a way to automatically pull the Russell 3000 holdings from the iShares website in R using the read_html (or rvest) function?
url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf
(all holdings in the table on the bottom, not just top 10)
So far I have had to copy and paste into an Excel document, save as a CSV, and use read_csv to create a tibble in R of the ticker, company name, and sector.
I have used read_html to pull the SP500 holdings from wikipedia, but can't seem to figure out the path I need to put in to have R automatically pull from iShares website (and there arent other reputable websites I've found with all ~3000 holdings). Here is the code used for SP500:
read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")%>%
html_node("table.wikitable")%>%
html_table()%>%
select('Symbol','Security','GICS Sector','GICS Sub Industry')%>%
as_tibble()
First post, sorry if it is hard to follow...
Any help would be much appreciated
Michael
IMPORTANT
According to the Terms & Conditions listed on BlackRock's website (here):
Use any robot, spider, intelligent agent, other automatic device, or manual process to search, monitor or copy this Website or the reports, data, information, content, software, products services, or other materials on, generated by or obtained from this Website, whether through links or otherwise (collectively, "Materials"), without BlackRock's permission, provided that generally available third-party web browsers may be used without such permission;
I suggest you ensure you are abiding by those terms before using their data in a way that violates those rules. For educational purposes, here is how data would be obtained:
First you need to get to the actual data (not the interactive javascript). How familiar are you with the devloper function on your browser? If you navigate through the webiste and track the traffic, you will notice a large AJAX:
https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
This is the data you need (all). After locating this, it is just cleaning the data. Example:
library(jsonlite)
#Locate the raw data by searching the Network traffic:
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
#pull the data in via fromJSON
x<-jsonlite::fromJSON(url,flatten=TRUE)
>Large list (10.4 Mb)
#use a comination of `lapply` and `rapply` to unlist, structuring the results as one large list
y<-lapply(rapply(x, enquote, how="unlist"), eval)
>Large list (50677 elements, 6.9Mb)
y1<-y[1:15]
> str(y1)
List of 15
$ aaData1 : chr "MSFT"
$ aaData2 : chr "MICROSOFT CORP"
$ aaData3 : chr "Equity"
$ aaData.display: chr "2.95"
$ aaData.raw : num 2.95
$ aaData.display: chr "109.41"
$ aaData.raw : num 109
$ aaData.display: chr "2,615,449.00"
$ aaData.raw : int 2615449
$ aaData.display: chr "$286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData.display: chr "286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData14 : chr "Information Technology"
$ aaData15 : chr "2588173"
**Updated: In case you are unable to clean the data, here you are:
testdf<- data.frame(matrix(unlist(y), nrow=50677, byrow=T),stringsAsFactors=FALSE)
#Where we want to break the DF at (every nth row)
breaks <- 17
#number of rows in full DF
nbr.row <- nrow(testdf)
repeats<- rep(1:ceiling(nbr.row/breaks),each=breaks)[1:nbr.row]
#split DF from clean-up
newDF <- split(testdf,repeats)
Result:
> str(head(newDF))
List of 6
$ 1:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "MSFT" "MICROSOFT CORP" "Equity" "2.95" ...
$ 2:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AAPL" "APPLE INC" "Equity" "2.89" ...
$ 3:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AMZN" "AMAZON COM INC" "Equity" "2.34" ...
$ 4:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "BRKB" "BERKSHIRE HATHAWAY INC CLASS B" "Equity" "1.42" ...
$ 5:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "FB" "FACEBOOK CLASS A INC" "Equity" "1.35" ...
$ 6:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "JNJ" "JOHNSON & JOHNSON" "Equity" "1.29" ...
I am trying to scrape a table from http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary to my R studio.
Here's the code
url<-'http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary'
webpage<-read_html(url)
candidate_info<- html_nodes(webpage,xpath='//*[#id="main"]/div/div[2]/div[2]/table')
candidate_info<- html_table(candidate_info)
head(candidate_info)
But getting no output, suggest what I am doing wrong?
That site has some very broken HTML. But, it's workable.
I find it better to target nodes in a slightly less fragile way. The XPath below finds it by content of the table.
html_table() croaks (or took forever and I didn't want to wait) so I ended up building the table "manually".
library(rvest)
# helper to clean column names
mcga <- function(x) { make.unique(gsub("(^_|_$)", "", gsub("_+", "_", gsub("[[:punct:][:space:]]+", "_", tolower(x)))), sep = "_") }
pg <- read_html("http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary")
# target the table
tab <- html_node(pg, xpath=".//table[contains(thead, 'Liabilities')]")
# get the rows so we can target columns
rows <- html_nodes(tab, xpath=".//tr[td[not(#colspan)]]")
# make a data frame
do.call(
cbind.data.frame,
c(lapply(1:8, function(i) {
html_text(html_nodes(rows, xpath=sprintf(".//td[%s]", i)), trim=TRUE)
}), list(stringsAsFactors=FALSE))
) -> xdf
# make nicer names
xdf <- setNames(xdf, mcga(html_text(html_nodes(tab, "th")))) # get the header to get column names
str(xdf)
## 'data.frame': 4823 obs. of 8 variables:
## $ sno : chr "1" "2" "3" "4" ...
## $ candidate : chr "A Hasiv" "A Wahid" "Aan Shikhar Shrivastava" "Aaptab Urf Aftab" ...
## $ constituency : chr "ARYA NAGAR" "GAINSARI" "GOSHAINGANJ" "MUBARAKPUR" ...
## $ party : chr "BSP" "IND" "Satya Shikhar Party" "Islam Party Hind" ...
## $ criminal_case: chr "0" "0" "0" "0" ...
## $ education : chr "12th Pass" "10th Pass" "Graduate" "Illiterate" ...
## $ total_assets : chr "Rs 3,94,24,827 ~ 3 Crore+" "Rs 75,106 ~ 75 Thou+" "Rs 41,000 ~ 41 Thou+" "Rs 20,000 ~ 20 Thou+" ...
## $ liabilities : chr "Rs 58,46,335 ~ 58 Lacs+" "Rs 0 ~" "Rs 0 ~" "Rs 0 ~" ...
I have a data frame headcount.df which looks like below:
Classes ‘data.table’ and 'data.frame': 2762 obs. of 7 variables:
$ Worker ID : chr "1693" "1812" "1822" "1695" ...
$ Job Posting Title: chr "Accountant" "Business Analyst I" "Finance Analyst II" "Business Analyst V" ...
$ State/Province : chr "Texas" "Michigan" "Heredia" "California" ...
$ Country : chr "USA" "USA" "CRI" "USA" ...
$ Worker Start Date: POSIXct, format: "2016-05-01" "2016-05-01" "2016-05-01" "2016-05-01" ...
$ Worker End Date : POSIXct, format: "2017-04-30" "2017-04-30" "2017-04-30" "2017-04-30" ...
$ Labor Type : chr "Business Professional" "Business Professional" "Business Professional" "Business Professional" ...
as a note there may be duplicate records in here
I am able to create a chart using ggplot using the below
x <- "2017-03-03"
y <- "2017-10-31"
headcountbar <- headcount1.df %>%
filter(`Worker Start Date`>=x & `Worker End Date`<=y) %>%
group_by(`State/Province`) %>%
summarise(Headcount = n_distinct(Worker))
ggplot(data = headcountbar, aes(x=`State/Province`,y = Headcount, fill = `State/Province` )) +
geom_bar(stat="identity",position = position_dodge())
The above code only gives me a headcount of workers between the two dates, I would like to be able to break it by month/quarter as well.
I would like to use shinydashboard to make this more interactive where I can select the x axis to maybe show headcount by state over time range or headcount by labor type.
I know there is a lot in here so any guidance is greatly appreciated.
Can Anyone suggest how can we read Avg Price And Value column in R from the given website.
I am not able to understand what is happening,with the same code i am able to read all the columns except these 2 columns.
The Code I am Using is :
library(rvest)
library(dplyr)
url="http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell"
url_html<-read_html(url)
SharesTraded_html=html_nodes(url_html,'td:nth-child(6)')
SharesTraded=html_text(SharesTraded_html)
SharesTraded=as.numeric(gsub(",",'',SharesTraded))
AvgPriceDollars_html=html_node(url_html,'td:nth-child(7)')
AvgPriceDollars=html_text(AvgPriceDollars_html)
AvgPriceDollars
http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell
Simplest way to do that is to use html_table :
library(rvest)
library(dplyr)
url <- read_html("http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell")
tb <- url %>%
html_node("#insidertab") %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
as.data.frame()
str(tb)
'data.frame': 253 obs. of 9 variables:
$ Reported.Time: chr "2017-09-15 21:00:47" "2017-09-15 20:11:26" "2017-09-15 20:11:26" "2017-09-15 20:10:27" ...
$ Tran. : chr "2017-09-12 Purchase" "2017-09-13 Sale" "2017-09-14 Sale" "2017-09-15 Sale" ...
$ Company : chr "Double Eagle Acquisition Corp." "PHIBRO ANIMAL HEALTH CORP" "PHIBRO ANIMAL HEALTH CORP" "Guidewire Software, Inc." ...
$ Ticker : chr "EAGL" "PAHC" "PAHC" "GWRE" ...
$ Insider : chr "SAGANSKY JEFFREYChief Executive Officer, Director, 10% owner" "Johnson Richard GChief Financial Officer" "Johnson Richard GChief Financial Officer" "Roza ScottChief Business Officer" ...
$ Shares.Traded: chr "30,000" "15,900" "39,629" "782" ...
$ Avg.Price : chr "$10.05" "$36.46" "$36.23" "$78.20" ...
$ Value : chr "$301,500" "$579,714" "$1,435,758" "$61,152" ...
$ Filing : logi NA NA NA NA NA NA ...