I am trying to create a custom function to give labels to modified list of data frames. For example, I have a data frame like below.
df<-data.frame(
gender = c(1,2,1,2,1,2,1,2,2,2,2,1,1,2,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,2,1,2,1,2,1,2,2,2),
country = c(3,3,1,2,5,4,4,4,4,3,3,4,3,4,2,1,4,2,3,4,4,4,3,1,2,1,5,5,4,3,1,4,5,2,3,4,5,1,4),
Q1=c(1,1,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA,1),
Q2=c(1,1,1,1,1,NA,NA,NA,NA,1,1,1,1,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,NA,NA,1,1,1,1,1,1,1,NA,NA,NA),
Q3=c(1,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,1,1,1,NA,NA,NA,1,NA,NA,1,1,1,1,1,NA,NA,1),
Q4=c(1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
Q5=c(1,2,1,1,1,2,1,2,2,1,2,NA,1,1,2,2,2,1,1,1,2,NA,2,1,1,1,2,2,2,NA,1,2,2,1,1,1,2,2,2)
)
I understand your goal to be the following: You want to take a list of data frames (ldat). For each of the dataframes in the list (df, df2) you want to take some existing columns (Q1, Q2, Q3) and replicate them with new names in the same data frame (Q1_new, Q2_new, Q3_new). This you could achieve like this:
variables = c("Q1","Q2","Q3")
new_label =c("Q1_new","Q2_new","Q3_new")
newdfs <- lapply(ldat, FUN = function(x) {
x[,new_label] = x[,variables]
return(x)})
head(newdfs$ALL)
gender country Q1 Q2 Q3 Q4 Q5 cc2 Q1_new Q2_new Q3_new
1 Male USA Yes Available Partner Depends on sales Local 1 Yes Available Partner
2 female USA Yes Available Partner <NA> Overseas NA Yes Available Partner
3 Male CAN <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
4 female EU <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
5 Male UK <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
6 female BR <NA> <NA> <NA> <NA> Overseas NA <NA> <NA> <NA>
Is this what you had in mind?
Hello for below data frame in R, may I know the simplest command (without using any additional library like deplyr) how to find the second highest salary and store the name of the employee in a variable named 2nd_high_employee?
EmployeeID EmployeeName Department Salary
----------- --------------- --------------- ---------
1 T Cook Finance 40000.00
2 D Michael Finance 25000.00
3 A Smith Finance 25000.00
4 D Adams Finance 15000.00
5 M Williams IT 80000.00
6 D Jones IT 40000.00
7 J Miller IT 50000.00
8 L Lewis IT 50000.00
9 A Anderson Back-Office 25000.00
10 S Martin Back-Office 15000.00
11 J Garcia Back-Office 15000.00
12 T Clerk Back-Office 10000.00
Next time you could consider to post a sample of your data using head(dput(x)), to ease SO members to read in your data.
df <- read.table(text = "
EmployeeID EmployeeName Department Salary
1 T Cook Finance 40000.00
2 D Michael Finance 25000.00
3 A Smith Finance 25000.00
4 D Adams Finance 15000.00
5 M Williams IT 80000.00
6 D Jones IT 40000.00
7 J Miller IT 50000.00
8 L Lewis IT 50000.00
9 A Anderson Back-Office 25000.00
10 S Martin Back-Office 15000.00
11 J Garcia Back-Office 15000.00
12 T Clerk Back-Office 10000.00", header = T)
second_high_employee <- tail(sort(df$Salary),2)[1]
second_high_employee
[1] 50000
BTW, it is not possible to start an object name with a number. You could check: ?make.names
Also, for for each department you could do:
aggregate(Salary ~ Department, df, function(x) {tail(sort(x), 2)[1]})
Department Salary
1 Back-Office 15000
2 Finance 25000
3 IT 50000
In case there had been 2 top salaries of 80000 and you had wanted to find the second highest of 50000 again, you could have wrapped x or df$Salaray inside tail(sort(unique()), 2)[1]
Using Base R: Finding the 2nd highest salary:
if you need the subset without taking into consideration the department:
subset(dat,sort(z<-rank(Salary),T)[2]==z)
EmployeeID EmployeeName Department Salary
7 J Miller IT 50000
8 L Lewis IT 50000
if taking into consideration the department:
unsplit(by(dat,dat$Department,function(x)subset(x,(y<-rank(Salary))==sort(y,T)[2])),rep(1:3,each=2))
EmployeeID EmployeeName Department Salary
10 S Martin Back-Office 15000
11 J Garcia Back-Office 15000
2 D Michael Finance 25000
3 A Smith Finance 25000
7 J Miller IT 50000
8 L Lewis IT 50000
Just for the employee name:
as.character(subset(dat,sort(z<-rank(Salary),T)[2]==z)[,2])
[1] "Miller" "Lewis"
I have two datasets:
Contacts2: This contains a list of ~100,000 contacts, their respective titles and a set of columns which describes the types of work contacts could be involved in. Here's an example dataset:
First<-c("George","Thomas","James","Jimmy","Howard","Herbert")
Last<-c("Washington", "Jefferson", "Madison", "Carter", "Taft", "Hoover")
Title<-c("CEO", "Accountant","Communications Specialist", "President", "Accountant", "CFO")
Finance<-NA
Executive<-NA
Communications<-NA
Contacts2<-as.data.frame(cbind(First,Last,Title,Finance,Executive,Communications))
First Last Title Finance Executive Communications
1 George Washington CEO <NA> <NA> <NA>
2 Thomas Jefferson Accountant <NA> <NA> <NA>
3 James Madison Communications Specialist <NA> <NA> <NA>
4 Jimmy Carter President <NA> <NA> <NA>
5 Howard Taft Accountant <NA> <NA> <NA>
6 Herbert Hoover CFO <NA> <NA> <NA>
Note the last three columns are numeric.
TableOfTitle: This dataset contains a list of ~1,000 unique titles and the same set of columns in which describes the type of work the contacts could be involved in. For each title I've put an 1 in the column(s) of the roles that describe that person's job.
Title<-c("CEO","Accountant", "Communications Specialist", "President", "CFO")
Finance<-c(NA,1,NA,1,1)
Executive<-c(1,NA,NA,NA,1)
Communications<-c(NA,NA,1,NA,NA)
TableOfTitle<-as.data.frame(cbind(Title,Finance,Executive,Communications))
Title Finance Executive Communications
1 CEO <NA> 1 <NA>
2 Accountant 1 <NA> <NA>
3 Communications Specialist <NA> <NA> 1
4 President 1 <NA> <NA>
5 CFO 1 1 <NA>
Note the last three columns are numeric.
I'm now trying to match the check boxes in TableOfTitle in Contacts2 based on the contact title field. For example, since TableOfTitle shows anyone with the title of CFO should have an x in the Finance and Executive field, the record for Herbert Hoover in Contacts2 should also have 1s in those columns as well.
Here's a solution that uses dplyr. It is essentially what some commenters have already recommended, except that this fulfills the request of not copying over any pre-existing data in the last 3 columns of Contacts2.
Note that ifelse() can be very slow with large datasets, but for your stated task this shouldn't really be noticeable. Algorithmically, this solution is also a bit clumsy in other ways, but I went for maximum readability here.
Contacts2 <- left_join(Contacts2, TableOfTitle, by = "Title") %>%
transmute(First = First,
Last = Last,
Title = Title,
Finance = ifelse(is.na(Finance.x), Finance.y, Finance.x),
Executive = ifelse(is.na(Executive.x), Executive.y, Executive.x),
Communications = ifelse(is.na(Communications.x), Communications.y, Communications.x))
Example output:
First Last Title Finance Executive Communications
George Washington CEO <NA> 1 <NA>
Thomas Jefferson Accountant 1 <NA> <NA>
James Madison Communications Specialist <NA> <NA> 1
Jimmy Carter President 1 <NA> <NA>
Howard Taft Accountant 1 <NA> <NA>
Herbert Hoover CFO 1 1 <NA>
a b c d
1 boiler maker <NA> <NA>
2 clerk assistant <NA> <NA>
3 senior machine setter <NA>
4 operated <NA> <NA> <NA>
5 consultant legal <NA> <NA>
How do I create a new column that takes the value in column 'a' unless any of the other columns contain either legal or assistant in which case it takes that value?
Here is a base-R solution. We use apply and any to test every column at once.
df$col <- as.character(df$a)
df$col[apply(df == "Legal",1,any)] <- "Legal"
df$col[apply(df == "assistant",1,any)] <- "assistant"
Try this:
library("dplyr")
df %>%
mutate(new=ifelse(b=="Legal" | c=="Legal" | d=="Legal", "Legal",
ifelse(b=="assistant" | c=="assistant" | d=="assistant", "assistant",
as.character(a))))
as.character is need if values where factors. If not, it's unnecessary.
A base R alternative of #scoa's answer:
indx <- apply(mydf == "Legal",1,any) + apply(mydf == "assistant",1,any)*2 + 1L
mydf$col <- c("a","Legal","Assistent")[indx]
or in one go:
mydf$col <- c("a","Legal","Assistent")[apply(mydf == "Legal",1,any) + apply(mydf == "assistant",1,any)*2 + 1L]
which gives:
> mydf
a b c d col
1 boiler maker <NA> <NA> a
2 clerk assistant <NA> <NA> Assistent
3 senior machine setter <NA> a
4 operated <NA> <NA> <NA> a
5 consultant Legal <NA> <NA> Legal