Find the second highest value in data frame in R - r

Hello for below data frame in R, may I know the simplest command (without using any additional library like deplyr) how to find the second highest salary and store the name of the employee in a variable named 2nd_high_employee?
EmployeeID EmployeeName Department Salary
----------- --------------- --------------- ---------
1 T Cook Finance 40000.00
2 D Michael Finance 25000.00
3 A Smith Finance 25000.00
4 D Adams Finance 15000.00
5 M Williams IT 80000.00
6 D Jones IT 40000.00
7 J Miller IT 50000.00
8 L Lewis IT 50000.00
9 A Anderson Back-Office 25000.00
10 S Martin Back-Office 15000.00
11 J Garcia Back-Office 15000.00
12 T Clerk Back-Office 10000.00

Next time you could consider to post a sample of your data using head(dput(x)), to ease SO members to read in your data.
df <- read.table(text = "
EmployeeID EmployeeName Department Salary
1 T Cook Finance 40000.00
2 D Michael Finance 25000.00
3 A Smith Finance 25000.00
4 D Adams Finance 15000.00
5 M Williams IT 80000.00
6 D Jones IT 40000.00
7 J Miller IT 50000.00
8 L Lewis IT 50000.00
9 A Anderson Back-Office 25000.00
10 S Martin Back-Office 15000.00
11 J Garcia Back-Office 15000.00
12 T Clerk Back-Office 10000.00", header = T)
second_high_employee <- tail(sort(df$Salary),2)[1]
second_high_employee
[1] 50000
BTW, it is not possible to start an object name with a number. You could check: ?make.names
Also, for for each department you could do:
aggregate(Salary ~ Department, df, function(x) {tail(sort(x), 2)[1]})
Department Salary
1 Back-Office 15000
2 Finance 25000
3 IT 50000
In case there had been 2 top salaries of 80000 and you had wanted to find the second highest of 50000 again, you could have wrapped x or df$Salaray inside tail(sort(unique()), 2)[1]

Using Base R: Finding the 2nd highest salary:
if you need the subset without taking into consideration the department:
subset(dat,sort(z<-rank(Salary),T)[2]==z)
EmployeeID EmployeeName Department Salary
7 J Miller IT 50000
8 L Lewis IT 50000
if taking into consideration the department:
unsplit(by(dat,dat$Department,function(x)subset(x,(y<-rank(Salary))==sort(y,T)[2])),rep(1:3,each=2))
EmployeeID EmployeeName Department Salary
10 S Martin Back-Office 15000
11 J Garcia Back-Office 15000
2 D Michael Finance 25000
3 A Smith Finance 25000
7 J Miller IT 50000
8 L Lewis IT 50000
Just for the employee name:
as.character(subset(dat,sort(z<-rank(Salary),T)[2]==z)[,2])
[1] "Miller" "Lewis"

Related

Need help pulling JSON data with RSocrata from a website API

I need help drafting code that pulls public data directly from a website that is in Socrata format. Here is a link:
https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/xzkq-xp2w
There is an API endpoint:
https://data.cityofchicago.org/resource/xzkq-xp2w.json
After the data is uploaded, null values in the "Annual Salary" should be replaced with 50000.
We can use the RSocrata package
library(RSocrata)
url <- "https://data.cityofchicago.org/resource/xzkq-xp2w.json"
data <- RSocrata::read.socrata(url)
head(data)
# name job_titles department full_or_part_time salary_or_hourly annual_salary typical_hours hourly_rate
#1 AARON, JEFFERY M SERGEANT POLICE F Salary 111444 <NA> <NA>
#2 AARON, KARINA POLICE OFFICER (ASSIGNED AS DETECTIVE) POLICE F Salary 94122 <NA> <NA>
#3 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER DAIS F Salary 118608 <NA> <NA>
#4 ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT F Salary 117072 <NA> <NA>
#5 ABARCA, FRANCES J POLICE OFFICER POLICE F Salary 48078 <NA> <NA>
The following will replace the NAs in annual_salary with 50000.
data[is.na(data$annual_salary),"annual_salary"] <- 50000
However, if you'd like to do what it suggests on the city of Chicago website, you could consider multipling typical_hours with hourly_rate to estimate salary.
ind <- is.na(data$annual_salary)
data[ind,]$annual_salary <- as.numeric(data[ind,]$typical_hours) * as.numeric(data[ind,]$hourly_rate) * 52

How to convert a n x 3 data frame into a square (ordered) matrix?

I need to reshape a table or (data frame) to be able to use an R package (NetworkRiskMetrics). Suppose I have a data frame of lenders, borrowers and loan values:
lender borrower loan_USD
John Mark 100
Mark Paul 45
Joe Paul 30
Dan Mark 120
How do I convert this data frame into:
John Mark Joe Dan Paul
John
Mark
Joe
Dan
Paul
(placing zeros in empty cells)?
Thanks.
Use reshape function
d <- data.frame(lander=c('a','b','c', 'a'), borower=c('m','p','m','p'), loan=c(10,20,15,12))
d
loan lander borower
10.1 1 a m
20.1 1 b p
15.1 1 c m
12.1 1 a p
reshape(data=d, direction='long', varying=list('lander','borower'), idvar='loan', timevar='loan')
lander borower loan
1 a m 10
2 b p 20
3 c m 15
4 a p 12

How can I create a term matrix that sums numeric values associated to each document?

I'm a bit new to R and tm so struggling with this exercise!
I have one description column with messy unstructured data containing words about the name, city and country of a customer. And another column with the amount of sold items.
**Description Sold Items**
Mrs White London UK 10
Mr Wolf London UK 20
Tania Maier Berlin Germany 10
Thomas Germany 30
Nick Forest Leeds UK 20
Silvio Verdi Italy Torino 10
Tom Cardiff UK 10
Mary House London 5
Using the tm package and documenttermmatrix, I'm able to break down each row into terms and get the frequency of each word (i.e. the number of customers with that word).
UK London Germany … Mary
Frequency 4 3 2 … 1
However, I would also like to sum the total amount of sold items.
The desired output should be:
UK London Germany … Mary
Frequency 4 3 2 … 1
Sum of Sold Items 60 35 40 … 5
How can I get to this result?
Assuming you can get to the stage where you have the Frequency table:
UK London Germany … Mary
Frequency 4 3 2 … 1
and you can extract the words you can use an apply function with a grep. Here I will create a vector which represents your dictionary you extract from your frequency table:
S_data<-read.csv("data.csv",stringsAsFactors = F)
Words<-c("UK","London","Germany","Mary")
Then use this in an apply as follows. This could be more efficiently done. But you will get the idea:
string_rows<-sapply(Words, function(x) grep(x,S_data$Description))
string_sum<-unlist(lapply(string_rows, function(x) sum(S_data$Items[x])))
> string_sum
UK London Germany Mary
60 35 40 5
Just bind this onto your frequency table

Re-Populate column in a relational data frame after randomization in R

I have a data frame of individuals and their spouses with some personal information (i.e. last names) that I have randomized with plyr::mapvalues in order to protect identities. Here is a reproducible example of how it looked before and after changing the surnames:
# before
d <- data.frame(id = c(1:6),
first_name = c('Jeff', 'Marilyn', 'Gwyn',
'Alice', 'Sam', 'Sarah'),
surname = c('Goldbloom', 'Monroe', 'Paltrow', 'Goldbloom',
'Smith', 'Silverman'),
spouse_id = c(2, 1, 1, 5, 4, "NA"),
spouse = c('Marilyn Monroe', 'Jeff Goldbloom', 'Jeff Goldbloom',
'Sam Smith', 'Alice Goldbloom', 'NA'))
d
> id first_name surname spouse_id spouse
1 Jeff Goldbloom 2 Marilyn Monroe
2 Marilyn Monroe 1 Jeff Goldbloom
3 Gwyn Paltrow 1 Jeff Goldbloom
4 Alice Goldbloom 5 Sam Smith
5 Sam Smith 4 Alice Goldbloom
6 Sarah Silverman NA NA
# replacement names to serve as surnames (doesn't matter what they are, just
that the ratios remain the same as before; mapvalues takes care of this)
repnames <- c("Arman" , "Clovis" , "Garner" , "Casey" , "Birch")
s <- unique(d$surname)
d$surname <- plyr::mapvalues(d$surname, from = s, to = repnames) #replace surnames
# After replacement, the dataframe looks like:
d
> id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Monroe
2 Marilyn Clovis 1 Jeff Goldbloom
3 Gwyn Garner 1 Jeff Goldbloom
4 Alice Arman 5 Sam Smith
5 Sam Casey 4 Alice Goldbloom
6 Sarah Birch NA NA
Each person has his or her own id number, but not all people have spouses. If a person does have a spouse, their spouse's individual id is reflected in the spouse_id column. I did this so that I could filter individuals and their spouses separately later using something like dplyr::filter(d, spouse %in% spouse_id).
My question is, how can I use the relational id and spouse_id columns to re-populate the spouse column so that it reflects the new, randomized surnames? i.e. the final expected output would be:
id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Clovis
2 Marilyn Clovis 1 Jeff Arman
3 Gwyn Garner 1 Jeff Arman
4 Alice Arman 5 Sam Casey
5 Sam Casey 4 Alice Arman
6 Sarah Birch NA NA
...So some concatenation will be involved on the first_name and surname columns. I've never done something quite so conditional in R - in Excel I guess it would be nested VLOOKUP functions...
Thanks, sorry it's so specific but hopefully it presents a fun challenge to someone out there.
Assuming that your NAs are actual NAs, then
d$spouse <- paste(d$first_name, d$surname)[d$spouse_id]
d$spouse
#[1] "Marilyn Clovis" "Jeff Arman" "Jeff Arman" "Sam Casey" "Alice Arman" NA

How to find the second highest salary grouped by the business in r

what i want is the output should contain for each business the second highest salary entry only....
for example:
customer_id name sales firstname lastname income business
6 Priyank Dwivedi 2 Priyank Dwivedi 650000 PES
4 Monika Maurya 3 Monika Maurya 200000 ITS
1 Rahul Ranjan 3 Rahul Ranjan 1000000 PES
7 Ambreen Khan 1 Ambreen Khan 800000 PES
3 P Paul 3 P Paul 500000 ITS
5 Sunny Tiwari 2 Sunny Tiwari 900000 Analytics
2 Mayank Agarwal 3 Mayank Agarwal 300000 PES
8 Shashank Rawat 1 Shashank Rawat 100000 Analytics
What I want as output is:
customer_id name sales firstname lastname income business
4 Monika Maurya 3 Monika Maurya 200000 ITS
8 Shashank Rawat 1 Shashank Rawat 100000 Analytics
7 Ambreen Khan 1 Ambreen Khan 800000 PES
that is second highest salary from each business.
... one solution might be:
res <- t(sapply(unique(data[, "business"]),
function(x, data){
# this are the subsets
d <- data[x==data[, "business"], ]
# order it and take second
d[order(d[, "income"], decreasing=TRUE)[2], ]
}, data=data))
res
with data as your data

Resources