I have a data frame headcount.df which looks like below:
Classes ‘data.table’ and 'data.frame': 2762 obs. of 7 variables:
$ Worker ID : chr "1693" "1812" "1822" "1695" ...
$ Job Posting Title: chr "Accountant" "Business Analyst I" "Finance Analyst II" "Business Analyst V" ...
$ State/Province : chr "Texas" "Michigan" "Heredia" "California" ...
$ Country : chr "USA" "USA" "CRI" "USA" ...
$ Worker Start Date: POSIXct, format: "2016-05-01" "2016-05-01" "2016-05-01" "2016-05-01" ...
$ Worker End Date : POSIXct, format: "2017-04-30" "2017-04-30" "2017-04-30" "2017-04-30" ...
$ Labor Type : chr "Business Professional" "Business Professional" "Business Professional" "Business Professional" ...
as a note there may be duplicate records in here
I am able to create a chart using ggplot using the below
x <- "2017-03-03"
y <- "2017-10-31"
headcountbar <- headcount1.df %>%
filter(`Worker Start Date`>=x & `Worker End Date`<=y) %>%
group_by(`State/Province`) %>%
summarise(Headcount = n_distinct(Worker))
ggplot(data = headcountbar, aes(x=`State/Province`,y = Headcount, fill = `State/Province` )) +
geom_bar(stat="identity",position = position_dodge())
The above code only gives me a headcount of workers between the two dates, I would like to be able to break it by month/quarter as well.
I would like to use shinydashboard to make this more interactive where I can select the x axis to maybe show headcount by state over time range or headcount by labor type.
I know there is a lot in here so any guidance is greatly appreciated.
Related
Very new to R but I'm hoping this will be simple. I have a html table I scraped into R and it looks like this:
`'data.frame': 238 obs. of 6 variables:
$ Facility Name : chr "Affinity Healthcare Center" "Alameda Care Center" "Alcott Rehabilitation Hospital" "Alden Terrace Convalescent Hospital" ...
$ City : chr "Paramount" "Burbank" "Los Angeles" "Los Angeles" ...
$ State : chr " CA" " CA" " CA" " CA" ...
$ Confirmed Staff : chr "26" "36" "14" "27" ...
$ Confirmed Residents: chr "29" "49" "26" "85" ...
$ Total Deaths : chr 26 36 14 27 19 3 1 7 16 3 ...`
I want Confirmed Staff, Confirmed Residents and Total Deaths to be integers so I can do some math on them and sort, order, etc.
I tried this for one variable and it seemed to work ok:
`tbls_ls4$`Total Deaths` <- as.integer(tbls_ls4$`Confirmed Staff`)`
But I'd like to apply it to all three variables and not sure how to do that.
Several ways to do that:
library(data.table)
setDT(df)
cols <- c("Confirmed Staff", "Confirmed Residents", "Total Deaths")
df[,cols := lapply(.SD, as.integer),.SDcols = cols]
or if you prefer base R :
df[, cols] <- lapply(cols, function(d) as.integer(df[,d]))
You could also use mutate_at from the tidyverse package to convert certain columns from character to integer.
library(tidyverse)
df<-df %>%
# Indicate the column names you wish to convert inside vars
mutate_at(vars(ConfirmedStaff,ConfirmedResidents,TotalDeaths),
# Indicate the function to apply to each column
as.integer)
I was wondering if there is a way to automatically pull the Russell 3000 holdings from the iShares website in R using the read_html (or rvest) function?
url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf
(all holdings in the table on the bottom, not just top 10)
So far I have had to copy and paste into an Excel document, save as a CSV, and use read_csv to create a tibble in R of the ticker, company name, and sector.
I have used read_html to pull the SP500 holdings from wikipedia, but can't seem to figure out the path I need to put in to have R automatically pull from iShares website (and there arent other reputable websites I've found with all ~3000 holdings). Here is the code used for SP500:
read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")%>%
html_node("table.wikitable")%>%
html_table()%>%
select('Symbol','Security','GICS Sector','GICS Sub Industry')%>%
as_tibble()
First post, sorry if it is hard to follow...
Any help would be much appreciated
Michael
IMPORTANT
According to the Terms & Conditions listed on BlackRock's website (here):
Use any robot, spider, intelligent agent, other automatic device, or manual process to search, monitor or copy this Website or the reports, data, information, content, software, products services, or other materials on, generated by or obtained from this Website, whether through links or otherwise (collectively, "Materials"), without BlackRock's permission, provided that generally available third-party web browsers may be used without such permission;
I suggest you ensure you are abiding by those terms before using their data in a way that violates those rules. For educational purposes, here is how data would be obtained:
First you need to get to the actual data (not the interactive javascript). How familiar are you with the devloper function on your browser? If you navigate through the webiste and track the traffic, you will notice a large AJAX:
https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
This is the data you need (all). After locating this, it is just cleaning the data. Example:
library(jsonlite)
#Locate the raw data by searching the Network traffic:
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
#pull the data in via fromJSON
x<-jsonlite::fromJSON(url,flatten=TRUE)
>Large list (10.4 Mb)
#use a comination of `lapply` and `rapply` to unlist, structuring the results as one large list
y<-lapply(rapply(x, enquote, how="unlist"), eval)
>Large list (50677 elements, 6.9Mb)
y1<-y[1:15]
> str(y1)
List of 15
$ aaData1 : chr "MSFT"
$ aaData2 : chr "MICROSOFT CORP"
$ aaData3 : chr "Equity"
$ aaData.display: chr "2.95"
$ aaData.raw : num 2.95
$ aaData.display: chr "109.41"
$ aaData.raw : num 109
$ aaData.display: chr "2,615,449.00"
$ aaData.raw : int 2615449
$ aaData.display: chr "$286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData.display: chr "286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData14 : chr "Information Technology"
$ aaData15 : chr "2588173"
**Updated: In case you are unable to clean the data, here you are:
testdf<- data.frame(matrix(unlist(y), nrow=50677, byrow=T),stringsAsFactors=FALSE)
#Where we want to break the DF at (every nth row)
breaks <- 17
#number of rows in full DF
nbr.row <- nrow(testdf)
repeats<- rep(1:ceiling(nbr.row/breaks),each=breaks)[1:nbr.row]
#split DF from clean-up
newDF <- split(testdf,repeats)
Result:
> str(head(newDF))
List of 6
$ 1:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "MSFT" "MICROSOFT CORP" "Equity" "2.95" ...
$ 2:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AAPL" "APPLE INC" "Equity" "2.89" ...
$ 3:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AMZN" "AMAZON COM INC" "Equity" "2.34" ...
$ 4:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "BRKB" "BERKSHIRE HATHAWAY INC CLASS B" "Equity" "1.42" ...
$ 5:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "FB" "FACEBOOK CLASS A INC" "Equity" "1.35" ...
$ 6:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "JNJ" "JOHNSON & JOHNSON" "Equity" "1.29" ...
I have a dataframe called forecast.df:
> str(forecast.df)
Classes ‘data.table’ and 'data.frame': 1027 obs. of 9 variables:
$ group : chr "IT" "IT" "IT" "IT" ...
$ Name : chr "name1" "name1" "name2" "name2" ...
$ position: chr "Specialist" "Specialist" "Analyst" "Analyst" ...
$ job : chr "job1" "job2" "job3" "job4" ...
$ dept : chr "IT" "FIN" "FIN" "P&C" ...
$ bucket : chr "Apr-18" "Apr-18" "Apr-18" "Apr-18" ...
$ start : Date, format: "2018-01-02" "2018-01-02" "2018-01-15" "2018-01-22" ...
$ end : Date, format: "2018-04-06" "2018-01-26" "2018-04-20" "2018-04-06" ...
$ hours : int 149 8 109 123 44 124 125 142 70 75 ...
- attr(*, ".internal.selfref")=<externalptr>
And instead of a start and end date, I am trying to transform it so each row has a single date, and a job that takes 3 days has three rows associated with it (needed for the visualization we are doing.)
The code I am using is this:
tidyForecast.df <- setDT(forecast.df)[ , list(group = group
, name = Name
, position = position
, job = job
, dept = dept
, bucket = bucket
, hours = hours
, date = seq(start
, end
, by = "day"))
, by = 1:nrow(forecast.df)]
And the error I am getting when I use this is:
Error in seq.int(0, to0 - from, by) : wrong sign in 'by' argument
I have never encountered this error before, and I used this same format earlier in the code and it worked, so maybe it's something nuanced?
Found what was going wrong; there was a single instance in the 1027 observations where the start date was after the end date. This is why it worked in the past, but stopped working when I used it for new data. The "by" argument was negative because the difference between the two dates was negative.
I know there are many posts and references for regex and gsub solutions, but nothing I am doing is working so I apologize if this is repetitive but I've been stuck for days.
I have a list of text that looks like this in the data frame:
c("pop", "rap", "trap music")
While I would like it to look like this... remove the c, quotes, and parentheses.
pop, rap, trap music
I have tried so many combinations of str_replace and gsub. I have also tried to separate the the list into different columns using tidyr but would variables like "trap music" got split up into separate columns. Thanks for your help in advance.
EDIT: This is the str for the column I need help with.
> str(Artist_Genre_final$artist_genres)
List of 100
$ : chr [1:5] "canadian hip hop" "canadian pop" "hip hop" "pop rap" ...
$ : chr [1:4] "hip hop" "pop rap" "rap" "west coast rap"
$ : chr [1:3] "pop" "rap" "trap music"
$ : chr [1:2] "pop" "rap"
$ : chr [1:4] "edm" "electropop" "pop" "tropical house"
here is the str for the entire data frame.
> str(Artist_Genre_final)
'data.frame': 100 obs. of 3 variables:
$ Artist : chr "Drake" "Kendrick Lamar" "Lil Uzi Vert" "Post Malone" ...
$ Track : chr "One Dance" "HUMBLE." "XO TOUR Llif3" "rockstar" ...
$ artist_genres:List of 100
..$ : chr "canadian hip hop" "canadian pop" "hip hop" "pop rap" ...
..$ : chr "hip hop" "pop rap" "rap" "west coast rap"
..$ : chr "pop" "rap" "trap music"
I reproduced three rows of your data frame:
Artist <- c("Drake", "Kendrick Lamar", "Lil Uzi Vert")
Track <- c("One Dance", "HUMBLE.", "XO TOUR Llif3")
artist_genres <- list(c("canadian hip hop", "canadian pop", "hip hop", "pop rap"),
c("hip hop", "pop rap", "rap", "west coast rap"),
c("pop", "rap", "trap music"))
Artist_Genre_final <- data.frame(Artist, Track, artist_genres=as.matrix(artist_genres), stringsAsFactors=FALSE)
Then tested see if it gave the same as your str() output:
str(Artist_Genre_final)
# 'data.frame': 3 obs. of 3 variables:
# $ Artist : chr "Drake" "Kendrick Lamar" "Lil Uzi Vert"
# $ Track : chr "One Dance" "HUMBLE." "XO TOUR Llif3"
# $ artist_genres:List of 3
# ..$ : chr "canadian hip hop" "canadian pop" "hip hop" "pop rap"
# ..$ : chr "hip hop" "pop rap" "rap" "west coast rap"
# ..$ : chr "pop" "rap" "trap music"
Seemed good, so printed with cat(paste())
cat(paste(Artist_Genre_final$artist_genres[[3]], collapse=", "))
# pop, rap, trap music
You need to access the atomic vector, hence [[3]], otherwise you get, c("pop", "rap", "trap music"), because you are printing a list of length 1, not the the character vector itself.
EDIT:
Here's a simple function to apply it to the whole list. Of course, there might be smarter ways to do this, but at least this will get you started.
paste_genres <- function(x) {
result <- character()
for (i in 1:length(x)) result <- append(result, paste(x[[i]], collapse = ", "))
return(result)
}
temp <- paste_genres(Artist_Genre_final$artist_genres)
cat(temp, sep = "\n")
# canadian hip hop, canadian pop, hip hop, pop rap
# hip hop, pop rap, rap, west coast rap
# pop, rap, trap music
Can Anyone suggest how can we read Avg Price And Value column in R from the given website.
I am not able to understand what is happening,with the same code i am able to read all the columns except these 2 columns.
The Code I am Using is :
library(rvest)
library(dplyr)
url="http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell"
url_html<-read_html(url)
SharesTraded_html=html_nodes(url_html,'td:nth-child(6)')
SharesTraded=html_text(SharesTraded_html)
SharesTraded=as.numeric(gsub(",",'',SharesTraded))
AvgPriceDollars_html=html_node(url_html,'td:nth-child(7)')
AvgPriceDollars=html_text(AvgPriceDollars_html)
AvgPriceDollars
http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell
Simplest way to do that is to use html_table :
library(rvest)
library(dplyr)
url <- read_html("http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell")
tb <- url %>%
html_node("#insidertab") %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
as.data.frame()
str(tb)
'data.frame': 253 obs. of 9 variables:
$ Reported.Time: chr "2017-09-15 21:00:47" "2017-09-15 20:11:26" "2017-09-15 20:11:26" "2017-09-15 20:10:27" ...
$ Tran. : chr "2017-09-12 Purchase" "2017-09-13 Sale" "2017-09-14 Sale" "2017-09-15 Sale" ...
$ Company : chr "Double Eagle Acquisition Corp." "PHIBRO ANIMAL HEALTH CORP" "PHIBRO ANIMAL HEALTH CORP" "Guidewire Software, Inc." ...
$ Ticker : chr "EAGL" "PAHC" "PAHC" "GWRE" ...
$ Insider : chr "SAGANSKY JEFFREYChief Executive Officer, Director, 10% owner" "Johnson Richard GChief Financial Officer" "Johnson Richard GChief Financial Officer" "Roza ScottChief Business Officer" ...
$ Shares.Traded: chr "30,000" "15,900" "39,629" "782" ...
$ Avg.Price : chr "$10.05" "$36.46" "$36.23" "$78.20" ...
$ Value : chr "$301,500" "$579,714" "$1,435,758" "$61,152" ...
$ Filing : logi NA NA NA NA NA NA ...