Need help pulling JSON data with RSocrata from a website API - r

I need help drafting code that pulls public data directly from a website that is in Socrata format. Here is a link:
https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/xzkq-xp2w
There is an API endpoint:
https://data.cityofchicago.org/resource/xzkq-xp2w.json
After the data is uploaded, null values in the "Annual Salary" should be replaced with 50000.

We can use the RSocrata package
library(RSocrata)
url <- "https://data.cityofchicago.org/resource/xzkq-xp2w.json"
data <- RSocrata::read.socrata(url)
head(data)
# name job_titles department full_or_part_time salary_or_hourly annual_salary typical_hours hourly_rate
#1 AARON, JEFFERY M SERGEANT POLICE F Salary 111444 <NA> <NA>
#2 AARON, KARINA POLICE OFFICER (ASSIGNED AS DETECTIVE) POLICE F Salary 94122 <NA> <NA>
#3 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER DAIS F Salary 118608 <NA> <NA>
#4 ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT F Salary 117072 <NA> <NA>
#5 ABARCA, FRANCES J POLICE OFFICER POLICE F Salary 48078 <NA> <NA>
The following will replace the NAs in annual_salary with 50000.
data[is.na(data$annual_salary),"annual_salary"] <- 50000
However, if you'd like to do what it suggests on the city of Chicago website, you could consider multipling typical_hours with hourly_rate to estimate salary.
ind <- is.na(data$annual_salary)
data[ind,]$annual_salary <- as.numeric(data[ind,]$typical_hours) * as.numeric(data[ind,]$hourly_rate) * 52

Related

Couldn't get tq_exchange() or stockSymbols() to work

I am trying to get stock symbols with these functions (both failed)
TTR::stockSymbols("AMEX")
Error in symbols[, sort.by] : incorrect number of dimensions
tidyquant::tq_exchange("AMEX")
Getting data...
Error: Can't rename columns that don't exist.
x Column Symbol doesn't exist.
Do these functions work for you? What fixes do you know to correct them? Thank you!
I get the same error. It seems like there has been some changes in the website from which these packages get the information. There is an open issue about this.
In the same thread it is mentioned that you can get the information from the underlying JSON which returns this information.
tmp <- jsonlite::fromJSON('https://api.nasdaq.com/api/screener/stocks?tableonly=true&limit=25&offset=0&exchange=AMEX&download=true')
head(tmp$data$rows)
# symbol name
#1 AAMC Altisource Asset Management Corp Com
#2 AAU Almaden Minerals Ltd. Common Shares
#3 ACU Acme United Corporation. Common Stock
#4 ACY AeroCentury Corp. Common Stock
#5 AE Adams Resources & Energy Inc. Common Stock
#6 AEF Aberdeen Emerging Markets Equity Income Fund Inc. Common Stock
# lastsale netchange pctchange volume marketCap country ipoyear
#1 $24.60 -0.3595 -1.44% 15183 40595215.00 United States
#2 $0.846 0.0359 4.432% 2272603 101984125.00 Canada 2015
#3 $33.82 0.61 1.837% 7869 112922038.00 United States 1988
#4 $11.76 2.01 20.615% 739133 18179596.00 United States
#5 $28.31 0.11 0.39% 6217 120099060.00 United States
#6 $9.10 0.09 0.999% 40775 461841180.00 United States
# industry sector
#1 Real Estate Finance
#2 Precious Metals Basic Industries
#3 Industrial Machinery/Components Capital Goods
#4 Diversified Commercial Services Technology
#5 Oil Refining/Marketing Energy
#6
# url
#1 /market-activity/stocks/aamc
#2 /market-activity/stocks/aau
#3 /market-activity/stocks/acu
#4 /market-activity/stocks/acy
#5 /market-activity/stocks/ae
#6 /market-activity/stocks/aef

converting as factor for a list of data frames

I am trying to create a custom function to give labels to modified list of data frames. For example, I have a data frame like below.
df<-data.frame(
gender = c(1,2,1,2,1,2,1,2,2,2,2,1,1,2,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,2,1,2,1,2,1,2,2,2),
country = c(3,3,1,2,5,4,4,4,4,3,3,4,3,4,2,1,4,2,3,4,4,4,3,1,2,1,5,5,4,3,1,4,5,2,3,4,5,1,4),
Q1=c(1,1,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA,1),
Q2=c(1,1,1,1,1,NA,NA,NA,NA,1,1,1,1,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,NA,NA,1,1,1,1,1,1,1,NA,NA,NA),
Q3=c(1,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,1,1,1,NA,NA,NA,1,NA,NA,1,1,1,1,1,NA,NA,1),
Q4=c(1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
Q5=c(1,2,1,1,1,2,1,2,2,1,2,NA,1,1,2,2,2,1,1,1,2,NA,2,1,1,1,2,2,2,NA,1,2,2,1,1,1,2,2,2)
)
I understand your goal to be the following: You want to take a list of data frames (ldat). For each of the dataframes in the list (df, df2) you want to take some existing columns (Q1, Q2, Q3) and replicate them with new names in the same data frame (Q1_new, Q2_new, Q3_new). This you could achieve like this:
variables = c("Q1","Q2","Q3")
new_label =c("Q1_new","Q2_new","Q3_new")
newdfs <- lapply(ldat, FUN = function(x) {
x[,new_label] = x[,variables]
return(x)})
head(newdfs$ALL)
gender country Q1 Q2 Q3 Q4 Q5 cc2 Q1_new Q2_new Q3_new
1 Male USA Yes Available Partner Depends on sales Local 1 Yes Available Partner
2 female USA Yes Available Partner <NA> Overseas NA Yes Available Partner
3 Male CAN <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
4 female EU <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
5 Male UK <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
6 female BR <NA> <NA> <NA> <NA> Overseas NA <NA> <NA> <NA>
Is this what you had in mind?

Find the second highest value in data frame in R

Hello for below data frame in R, may I know the simplest command (without using any additional library like deplyr) how to find the second highest salary and store the name of the employee in a variable named 2nd_high_employee?
EmployeeID EmployeeName Department Salary
----------- --------------- --------------- ---------
1 T Cook Finance 40000.00
2 D Michael Finance 25000.00
3 A Smith Finance 25000.00
4 D Adams Finance 15000.00
5 M Williams IT 80000.00
6 D Jones IT 40000.00
7 J Miller IT 50000.00
8 L Lewis IT 50000.00
9 A Anderson Back-Office 25000.00
10 S Martin Back-Office 15000.00
11 J Garcia Back-Office 15000.00
12 T Clerk Back-Office 10000.00
Next time you could consider to post a sample of your data using head(dput(x)), to ease SO members to read in your data.
df <- read.table(text = "
EmployeeID EmployeeName Department Salary
1 T Cook Finance 40000.00
2 D Michael Finance 25000.00
3 A Smith Finance 25000.00
4 D Adams Finance 15000.00
5 M Williams IT 80000.00
6 D Jones IT 40000.00
7 J Miller IT 50000.00
8 L Lewis IT 50000.00
9 A Anderson Back-Office 25000.00
10 S Martin Back-Office 15000.00
11 J Garcia Back-Office 15000.00
12 T Clerk Back-Office 10000.00", header = T)
second_high_employee <- tail(sort(df$Salary),2)[1]
second_high_employee
[1] 50000
BTW, it is not possible to start an object name with a number. You could check: ?make.names
Also, for for each department you could do:
aggregate(Salary ~ Department, df, function(x) {tail(sort(x), 2)[1]})
Department Salary
1 Back-Office 15000
2 Finance 25000
3 IT 50000
In case there had been 2 top salaries of 80000 and you had wanted to find the second highest of 50000 again, you could have wrapped x or df$Salaray inside tail(sort(unique()), 2)[1]
Using Base R: Finding the 2nd highest salary:
if you need the subset without taking into consideration the department:
subset(dat,sort(z<-rank(Salary),T)[2]==z)
EmployeeID EmployeeName Department Salary
7 J Miller IT 50000
8 L Lewis IT 50000
if taking into consideration the department:
unsplit(by(dat,dat$Department,function(x)subset(x,(y<-rank(Salary))==sort(y,T)[2])),rep(1:3,each=2))
EmployeeID EmployeeName Department Salary
10 S Martin Back-Office 15000
11 J Garcia Back-Office 15000
2 D Michael Finance 25000
3 A Smith Finance 25000
7 J Miller IT 50000
8 L Lewis IT 50000
Just for the employee name:
as.character(subset(dat,sort(z<-rank(Salary),T)[2]==z)[,2])
[1] "Miller" "Lewis"

Matching data from multiple columns in r

I have two datasets:
Contacts2: This contains a list of ~100,000 contacts, their respective titles and a set of columns which describes the types of work contacts could be involved in. Here's an example dataset:
First<-c("George","Thomas","James","Jimmy","Howard","Herbert")
Last<-c("Washington", "Jefferson", "Madison", "Carter", "Taft", "Hoover")
Title<-c("CEO", "Accountant","Communications Specialist", "President", "Accountant", "CFO")
Finance<-NA
Executive<-NA
Communications<-NA
Contacts2<-as.data.frame(cbind(First,Last,Title,Finance,Executive,Communications))
First Last Title Finance Executive Communications
1 George Washington CEO <NA> <NA> <NA>
2 Thomas Jefferson Accountant <NA> <NA> <NA>
3 James Madison Communications Specialist <NA> <NA> <NA>
4 Jimmy Carter President <NA> <NA> <NA>
5 Howard Taft Accountant <NA> <NA> <NA>
6 Herbert Hoover CFO <NA> <NA> <NA>
Note the last three columns are numeric.
TableOfTitle: This dataset contains a list of ~1,000 unique titles and the same set of columns in which describes the type of work the contacts could be involved in. For each title I've put an 1 in the column(s) of the roles that describe that person's job.
Title<-c("CEO","Accountant", "Communications Specialist", "President", "CFO")
Finance<-c(NA,1,NA,1,1)
Executive<-c(1,NA,NA,NA,1)
Communications<-c(NA,NA,1,NA,NA)
TableOfTitle<-as.data.frame(cbind(Title,Finance,Executive,Communications))
Title Finance Executive Communications
1 CEO <NA> 1 <NA>
2 Accountant 1 <NA> <NA>
3 Communications Specialist <NA> <NA> 1
4 President 1 <NA> <NA>
5 CFO 1 1 <NA>
Note the last three columns are numeric.
I'm now trying to match the check boxes in TableOfTitle in Contacts2 based on the contact title field. For example, since TableOfTitle shows anyone with the title of CFO should have an x in the Finance and Executive field, the record for Herbert Hoover in Contacts2 should also have 1s in those columns as well.
Here's a solution that uses dplyr. It is essentially what some commenters have already recommended, except that this fulfills the request of not copying over any pre-existing data in the last 3 columns of Contacts2.
Note that ifelse() can be very slow with large datasets, but for your stated task this shouldn't really be noticeable. Algorithmically, this solution is also a bit clumsy in other ways, but I went for maximum readability here.
Contacts2 <- left_join(Contacts2, TableOfTitle, by = "Title") %>%
transmute(First = First,
Last = Last,
Title = Title,
Finance = ifelse(is.na(Finance.x), Finance.y, Finance.x),
Executive = ifelse(is.na(Executive.x), Executive.y, Executive.x),
Communications = ifelse(is.na(Communications.x), Communications.y, Communications.x))
Example output:
First Last Title Finance Executive Communications
George Washington CEO <NA> 1 <NA>
Thomas Jefferson Accountant 1 <NA> <NA>
James Madison Communications Specialist <NA> <NA> 1
Jimmy Carter President 1 <NA> <NA>
Howard Taft Accountant 1 <NA> <NA>
Herbert Hoover CFO 1 1 <NA>

R, create new column that consists of 1st column or if condition is met, a value from the 2nd/3rd column

a b c d
1 boiler maker <NA> <NA>
2 clerk assistant <NA> <NA>
3 senior machine setter <NA>
4 operated <NA> <NA> <NA>
5 consultant legal <NA> <NA>
How do I create a new column that takes the value in column 'a' unless any of the other columns contain either legal or assistant in which case it takes that value?
Here is a base-R solution. We use apply and any to test every column at once.
df$col <- as.character(df$a)
df$col[apply(df == "Legal",1,any)] <- "Legal"
df$col[apply(df == "assistant",1,any)] <- "assistant"
Try this:
library("dplyr")
df %>%
mutate(new=ifelse(b=="Legal" | c=="Legal" | d=="Legal", "Legal",
ifelse(b=="assistant" | c=="assistant" | d=="assistant", "assistant",
as.character(a))))
as.character is need if values where factors. If not, it's unnecessary.
A base R alternative of #scoa's answer:
indx <- apply(mydf == "Legal",1,any) + apply(mydf == "assistant",1,any)*2 + 1L
mydf$col <- c("a","Legal","Assistent")[indx]
or in one go:
mydf$col <- c("a","Legal","Assistent")[apply(mydf == "Legal",1,any) + apply(mydf == "assistant",1,any)*2 + 1L]
which gives:
> mydf
a b c d col
1 boiler maker <NA> <NA> a
2 clerk assistant <NA> <NA> Assistent
3 senior machine setter <NA> a
4 operated <NA> <NA> <NA> a
5 consultant Legal <NA> <NA> Legal

Resources