I have wide supervisory data where a single observation consists of a level 1 employee and their department all the way down to level 8. I use a loop with other commands to produce a list all employees and the departments beneath them in long format so that I can see what departments employees are responsible for at all levels. There may be a more elegant way to do this than a loop, but it works fine. Sample data (through level 3 for succinctness):
data <- tibble(LV1_Employee_Name = "Chuck", LV1_Employee_Nbr = "1", LV1_Department = "Tha Boss", LV1_Department_Nbr = "90",
LV2_Employee_Name = c("Alex", "Alex", "Paul", "Paul", "Jennifer", "Jennifer"), LV2_Employee_Nbr = c("2", "2", "3", "3", "4", "4"), LV2_Department = c("Leadership", "Leadership", "Finance", "Finance", "Philanthropy", "Philanthropy"), LV2_Department_Nbr = c("91", "91", "92", "92", "93", "93"),
LV3_Employee_Name = c("Dan", "Wendy", "Sarah", "Monique", "Miguel", "Brandon"), LV3_Employee_Nbr = c("2", "2", "3", "3", "4", "4"), LV3_Department = c("Analytics", "Pop Health", "Acounting", "Investments", "Yacht Aquisitions", "Golf Junkets"), LV3_Department_Nbr = c("94", "95", "96", "97", "98", "99"))
The loop below first produces six tibbles named level1_1, level1_2, level1_3, level2_2, level2_3, level3_3. Each tibble contains an employee name, number, and the department at the same department level or below. The code then lists and binds the rows of these tibbles with ls() and bind_rows(), then applies the distinct() command, and I've got what I need.
first_department <- 1
data_colnames <- c("Employee", "Employee_Id", "Department", "Department_Number")
for(i in 1:3){
for(k in first_department:3){
assign(paste0("level", i, "_", k), setNames(distinct(as_tibble(c(data[ ,paste0("LV", i, "_", "Employee_Name")], data[ ,paste0("LV", i, "_", "Employee_Nbr")],
data[ ,paste0("LV", k, "_", "Department")], data[ ,paste0("LV", k, "_", "Department_Nbr")]))),
data_colnames))
}
first_department = first_department + 1
}
employees_departments <- distinct(bind_rows(mget(ls(pattern = "^level")))) %>%
filter(is.na(Department) == FALSE)
rm(list = ls(pattern = "^level"))
What I'd like to do is, rather than produce an initial output of six tibbles, have the loop itself output the list. This will save me from having a huge list of tibbles in the output which, I'm told, is not very "R-like".
Here is a revised version that stores the results in a list within your loop. This will include an index idx incremented each time through the loop. Afterwards, you can use bind_rows on this list to get a complete result.
library(tidyverse)
idx <- 1
first_department <- 1
data_colnames <- c("Employee", "Employee_Id", "Department", "Department_Number")
data_lst <- list()
for(i in 1:3){
for(k in first_department:3){
data_lst[[idx]] <- setNames(
distinct(as_tibble(
c(data[ ,paste0("LV", i, "_", "Employee_Name")],
data[ ,paste0("LV", i, "_", "Employee_Nbr")],
data[ ,paste0("LV", k, "_", "Department")],
data[ ,paste0("LV", k, "_", "Department_Nbr")]))),
data_colnames)
idx <- idx + 1
}
first_department = first_department + 1
}
distinct(bind_rows(data_lst)) %>%
filter(!is.na(Department))
Output
Employee Employee_Id Department Department_Number
<chr> <chr> <chr> <chr>
1 Chuck 1 Tha Boss 90
2 Chuck 1 Leadership 91
3 Chuck 1 Finance 92
4 Chuck 1 Philanthropy 93
5 Chuck 1 Analytics 94
6 Chuck 1 Pop Health 95
7 Chuck 1 Acounting 96
8 Chuck 1 Investments 97
9 Chuck 1 Yacht Aquisitions 98
10 Chuck 1 Golf Junkets 99
11 Alex 2 Leadership 91
12 Paul 3 Finance 92
13 Jennifer 4 Philanthropy 93
14 Alex 2 Analytics 94
15 Alex 2 Pop Health 95
16 Paul 3 Acounting 96
17 Paul 3 Investments 97
18 Jennifer 4 Yacht Aquisitions 98
19 Jennifer 4 Golf Junkets 99
20 Dan 2 Analytics 94
21 Wendy 2 Pop Health 95
22 Sarah 3 Acounting 96
23 Monique 3 Investments 97
24 Miguel 4 Yacht Aquisitions 98
25 Brandon 4 Golf Junkets 99
Related
My (simplified) dataset consists of donor occupation and contribution amounts. I'm trying to determine what the average contribution amount by occupation is (note: donor occupations are often repeated in the column, so I use that as a grouping variable). Right now, I'm using two dplyr statements -- one to get a sum of contributions amount by each occupation and another to get a count of the number of donations from that specific occupation. I am then binding the dataframes with cbind and creating a new column with mutate, where I can divide the sum by the count.
Data example:
contributor_occupation contribution_receipt_amount
1 LISTING COORDINATOR 5.00
2 NOT EMPLOYED 2.70
3 TEACHER 2.70
4 ELECTRICAL DESIGNER 2.00
5 STUDENT 50.00
6 SOFTWARE ENGINEER 10.00
7 TRUCK DRIVER 2.70
8 NOT EMPLOYED 50.00
9 CONTRACTOR 5.00
10 ENGINEER 6.00
11 FARMER 2.70
12 ARTIST 50.00
13 CIRCUS ARTIST 100.00
14 CIRCUS ARTIST 27.00
15 INFORMATION SECURITY ANALYST 2.00
16 LAWYER 5.00
occupation2 <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarise(total = sum(contribution_receipt_amount)) %>%
arrange(desc(contributor_occupation))
occupation3 <- b %>%
select(contributor_occupation) %>%
count(contributor_occupation) %>%
group_by(contributor_occupation) %>%
arrange(desc(contributor_occupation))
final_occ <- cbind(occupation2, occupation3[, 2]) # remove duplicate column
occ_avg <- final_occ %>%
select(contributor_occupation:n) %>%
mutate("Average Donation" = total/n) %>%
rename("Number of Donations"= n, "Occupation" = contributor_occupation, "Total Donated" = total)
occ_avg %>%
arrange(desc(`Average Donation`))
This gives me the result I want but seems like a very cumbersome process. It seems I get the same result by using the following code; however, I am confused as to why it works:
avg_donation_occupation <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarize(avg_donation_by_occupation = sum(contribution_receipt_amount)/n()) %>%
arrange(desc(avg_donation_by_occupation))
Wouldn't dividing by n divide by the number of rows (i.e., number of occupations) as opposed to the number of people in that occupation (which is what I used the count function for previously)?
Thanks for the help clearing up any confusion!
We may need both sum and mean along with n() which gives the number of observations in the grouped data. According to ?context
n() gives the current group size.
and `?mean
mean - Generic function for the (trimmed) arithmetic mean.
which is basically the sum of observations divided by the number of observations
library(dplyr)
out <- b %>%
group_by(Occupation = contributor_occupation) %>%
summarise(`Total Donated` = sum(contribution_receipt_amount),
`Number of Donations` = n(),
`Average Donation` = mean(contribution_receipt_amount),
#or
#`Average Donation` = `Total Donated`/`Number of Donations`,
.groups = 'drop') %>%
arrange(desc(`Average Donation`))
-output
out
# A tibble: 14 × 4
Occupation `Total Donated` `Number of Donations` `Average Donation`
<chr> <dbl> <int> <dbl>
1 CIRCUS ARTIST 127 2 63.5
2 ARTIST 50 1 50
3 STUDENT 50 1 50
4 NOT EMPLOYED 52.7 2 26.4
5 SOFTWARE ENGINEER 10 1 10
6 ENGINEER 6 1 6
7 CONTRACTOR 5 1 5
8 LAWYER 5 1 5
9 LISTING COORDINATOR 5 1 5
10 FARMER 2.7 1 2.7
11 TEACHER 2.7 1 2.7
12 TRUCK DRIVER 2.7 1 2.7
13 ELECTRICAL DESIGNER 2 1 2
14 INFORMATION SECURITY ANALYST 2 1 2
data
b <- structure(list(contributor_occupation = c("LISTING COORDINATOR",
"NOT EMPLOYED", "TEACHER", "ELECTRICAL DESIGNER", "STUDENT",
"SOFTWARE ENGINEER", "TRUCK DRIVER", "NOT EMPLOYED", "CONTRACTOR",
"ENGINEER", "FARMER", "ARTIST", "CIRCUS ARTIST", "CIRCUS ARTIST",
"INFORMATION SECURITY ANALYST", "LAWYER"), contribution_receipt_amount = c(5,
2.7, 2.7, 2, 50, 10, 2.7, 50, 5, 6, 2.7, 50, 100, 27, 2, 5)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16"))
I want to identify the unmatched values in Vendors data frame for each vendor. In other words, find the countries that are not located in the Vendors data frame for each vendor.
I have a data frame (Vendors) that looks like this:
Vendor_ID
Vendor
Country_ID
Country
1
Burger King
2
USA
1
Burger King
3
France
1
Burger King
5
Brazil
1
Burger King
7
Turkey
2
McDonald's
5
Brazil
2
McDonald's
3
France
Vendors <- data.frame (
Vendor_ID = c("1", "1", "1", "1", "2", "2"),
Vendor = c("Burger King", "Burger King", "Burger King", "Burger King", "McDonald's", "McDonald's"),
Country_ID = c("2", "3", "5", "7", "5", "3"),
Country = c("USA", "France", "Brazil", "Turkey", "Brazil", "France"))
and I have another data frame (Countries) that looks like this:
Country_ID
Country
2
USA
3
France
5
Brazil
7
Turkey
Countries <- data.frame (Country_ID = c("2", "3", "5", "7"),
Country = c("USA", "France", "Brazil", "Turkey"))
Desired Output:
Vendor_ID
Vendor
Country_ID
Country
2
McDonald's
2
USA
2
McDonald's
7
Turkey
Can someone please tell me how could this be achieved in R? I tried subset & ant-join but the results are not correct.
In Base R we could first split the data by Vendors
VenList <- split(df, df$Vendor)
and then we can check wich country is missing and return it.
res <- lapply(VenList, function(x){
# Identify missing country of vendors
tmp1 <- df2[!(df2[, "Country"] %in% x[, "Country"]), ]
# get vendor and vendor ID
tmp2 <- x[1:nrow(tmp1), 1:2]
# cbind
if(nrow(tmp2) == nrow(tmp1)){
cbind(tmp2, tmp1)
}
})
# Which yields
res
# $BurgerKing
# NULL
#
# $`McDonald's`
# Vendor_ID Vendor Country_ID Country
# 5 2 McDonald's 2 USA
# 6 2 McDonald's 7 Turkey
# If you want it as one df you could then flatten to
do.call(rbind, res)
# Vendor_ID Vendor Country_ID Country
# McDonald's.5 2 McDonald's 2 USA
# McDonald's.6 2 McDonald's 7 Turkey
Data
df <- read.table(text = "1 BurgerKing 2 USA
1 BurgerKing 3 France
1 BurgerKing 5 Brazil
1 BurgerKing 7 Turkey
2 McDonald's 5 Brazil
2 McDonald's 3 France", col.names = c("Vendor_ID", "Vendor", "Country_ID", "Country"))
df2 <- read.table(text = "2 USA
3 France
5 Brazil
7 Turkey", col.names = c("Country_ID", "Country")) `
Solution using expand.grid to create all possible Vendor - Country combinations (assuming that "Countries" has only one entry per country) and then using dplyr to join "Vendors" and find "missing countries"
Edit: The last two lines (left_joins) are only needed to "translate" the ID columns into "text":
library(dplyr)
expand.grid(Vendor_ID=unique(Vendors$Vendor_ID), Country_ID=Countries$Country_ID) %>%
left_join(Vendors) %>%
filter(is.na(Vendor)) %>%
select(Vendor_ID, Country_ID) %>%
left_join(Countries) %>%
left_join(unique(Vendors[, c("Vendor_ID", "Vendor")]))
Returns
Vendor_ID Country_ID Country Vendor
1 2 2 USA McDonald's
2 2 7 Turkey McDonald's
https://www.dropbox.com/s/prqiojwzpax339z/Test123.xlsx?dl=0
The link contains an xlsx file which contains the details of a batsman batting in one sheet where runs scored in each innings by him in a test match is recorded.So the details of the rows contains identical values w.r.t some columns between two rows because in a test match a batsman gets the chance to bat in two innings so details mentioned in columns like opposition,Ground,StartDateAscending,MatchNumber,Result will be common when we compare two rows for a test match.
Question:so how can we club the data present in the rows based on this matching values and create a new data frame with merged rows.
Ex:In data shared through the link,i am taking the first two rows as a sample to tell what i want to achieve and below is the text representation of the r object of this sample data derived using r function
structure(list(Runs = c("10", "27"), Mins = c("30", "93"), BF = c("19",
"65"), X4s = c("1", "4"), X6s = c("0", "0"), SR = c("52.63",
"41.53"), Pos = c("6", "6"), Dismissal = c("bowled", "caught"
), Inns = c(2, 4), Opposition = c("v England", "v England"),
Ground = c("Lord's", "Lord's"), Start.DateAscending = structure(c(648930600,
648930600), class = c("POSIXct", "POSIXt"), tzone = ""),
Match.Number = c("Test # 1148", "Test # 1148"), Result = c("Loss",
"Loss")), .Names = c("Runs", "Mins", "BF", "X4s", "X6s",
"SR", "Pos", "Dismissal", "Inns", "Opposition", "Ground", "Start.DateAscending",
"Match.Number", "Result"), row.names = 1:2, class = "data.frame")
The data derived from the above block will be something like below:
Runs Mins BF X4s X6s SR Pos Dismissal Inns Opposition Ground
1 10 30 19 1 0 52.63 6 bowled 2 v England Lord's
2 27 93 65 4 0 41.53 6 caught 4 v England Lord's
Start.DateAscending Match.Number Result
1 1990-07-26 Test # 1148 Loss
2 1990-07-26 Test # 1148 Loss
So what i want to achieve is to sum up the runs column values based on the common column values like Match.Number,Opposition,Ground,Start.DateAscending.
I expect the values like below which will be stored in a new data frame
Runs Opposition Ground Start.DateAscending Match.Number Result
1 37 v England Lord's 1990-07-26 Test # 1148 Loss
We subset the columns of the dataset, using aggregate after conveting the 'Runs' to numeric class
colsofinterest <- names(df1)[c(1, 10:ncol(df1))]
aggregate(Runs~., df1[colsofinterest], sum)
# Opposition Ground Start.DateAscending Match.Number Result Runs
#1 v England Lord's 1990-07-26 Test # 1148 Loss 37
Or we can use tidyverse
colsofinterest2 <- names(df1)[10:ncol(df1)]
library(dplyr)
df1 %>%
group_by_(.dots = colsofinterest2) %>%
summarise(Runs = sum(Runs))
# A tibble: 1 x 6
# Groups: Opposition, Ground, Start.DateAscending, Match.Number [?]
# Opposition Ground Start.DateAscending Match.Number Result Runs
# <chr> <chr> <dttm> <chr> <chr> <int>
#1 v England Lord's 1990-07-26 Test # 1148 Loss 37
I have a table with flight ids, arrivals, and departures:
> test
arrival departure flight_id
1 9 2233
2 8 1982
3 1 2164
4 9 2081
5 2130
6 2 2040
7 9 2030
8 2130
9 4 3169
10 6 2323
11 8 2130
12 2220
13 3169
14 9 2204
15 1 1910
16 2 837
17 1994
18 9 8 1994
19 1994
20 1994
21 9 1 2338
22 1 8 1981
23 9 2365
24 8 2231
25 9 2048
My objective is to count only the rows where arrival and departure are blank, and then to aggregate by flight_id. But there is a catch. I believe this cannot be done with table(), aggregate() or rle() because they do not account for breaks.
For example, only consecutive flight ids where arrival ="" and departure ="" should be counted, and the count should start again from zero if a flight id with a non-blank value occurs. NOTE: Other flight ids appearing in between don't matter - each flight id should be treated separately which is why flight 2130 is counted twice.
In other words, the resulting output from the test should look exactly like this:
output
flight_id count
1 2130 2
2 2220 1
3 3169 1
4 1994 1
5 1994 2
Notice that flight id 1994 occurs three times where arrival and departure are blank but that there is a break in between at row 18. Therefore, the flight id must be counted twice.
I have tried writing a for loop but get an error that there is missing value where TRUE/FALSE needed:
raw_data = test
unique_id = unique(raw_data$flight_id)
output<- data.frame("flight_id"= integer(0), "count" = integer(0), stringsAsFactors=FALSE)
for (flight_id in unique_id)
{
oneflight <- raw_data[ which(raw_data$flight_id == flight_id), ]
if(nrow(oneflight) >= 1 ){
for(i in 2:nrow(oneflight)) {
if(oneflight[i,"arrival"] == "" & oneflight[i,"departure"] == "") {
new_row <- c(flight_id, sum(flight_id)[i])
output[nrow(output) + 1,] = new_row
}
}
}
}
How could one improve the above code or could someone suggest a quicker method with dplyr for example? Here is a sample of the data:
> dput(test)
structure(list(arrival = c("", "", "1", "", "", "2", "9", "",
"", "6", "", "", "", "", "1", "", "", "9", "", "", "9", "1",
"9", "", "9"), departure = c("9", "8", "", "9", "", "", "", "",
"4", "", "8", "", "", "9", "", "2", "", "8", "", "", "1", "8",
"", "8", ""), flight_id = c(2233, 1982, 2164, 2081, 2130, 2040,
2030, 2130, 3169, 2323, 2130, 2220, 3169, 2204, 1910, 837, 1994,
1994, 1994, 1994, 2338, 1981, 2365, 2231, 2048)), .Names = c("arrival",
"departure", "flight_id"), row.names = c(NA, 25L), class = "data.frame")
A base R approach :
do.call("rbind", lapply(split(test, test$flight_id), function(x) {
o = rle(x[["arrival"]] == "" & x[["departure"]] == "")
data.frame(flight_id = rep(unique(x[["flight_id"]]), sum(o$values)),
count = o$lengths[o$values])
}))
#flight_id count
# 1994 1
# 1994 2
# 2130 2
# 2220 1
# 3169 1
We split the dataframe by flight_id and for every group we apply rle to find continuous empty rows in arrival and departure and return a dataframe with the flight_id and the number of continuous empty rows in the group.
If I understand your question, one trick you could use is to add a decimale to the flight_id, indicating a group.
For example, get an index vector
i <- find(oneflight$arrival == "" & oneflight$departure =="")
Then take cumsum(1-diff(i)) / 100 or a sufficiently large power of ten, add it to the flight IDs, and you then have groups flights that can be counted with table()
Here's a solution using data.table:
library(data.table)
flights <- test$flight_id[test$arrival=="" & test$departure==""]
setDT(test)[flight_id %in% flights, grp := rleid(arrival=="",departure=="")][
arrival=="" & departure=="",.(count = .N),.(flight_id, grp)]
# flight_id grp count
#1: 2130 1 2
#2: 2220 3 1
#3: 3169 3 1
#4: 1994 3 1
#5: 1994 5 2
Explanation:
First we attain the flight_id's that have at least one record with empty arrival and departure values. Then, we use this vector flights to subset your data and generate a run-length id column based on arrival=="" and departure =="" called "grp". Lastly we generate the count of of records (ie. .N) where, arrival=="" & departure =="", grouped by the columns flight_id and grp.
You can consequently drop the grp column if needed.
I would like some assistance please in my quest to select parts of a string in certain rows in an r dataframe. I have mocked up some dummy data below (floyd) to illustrate.
The first dataframe row has only 1 word (its a number yes, but I am treating all numbers as characters/words) for each column, but rows 2 to 4 have more than one word. I would like to select the number in each row/cell based on a position passed to it by the named vector cool_floyd_position.
# please NB need stringr installed for my solution attempt!
# some scenario data
floyd = data.frame(people = c("roger", "david", "rick", "nick"),
spec1 = c("1", "3 5 75 101", "3 65 85", "12 2"),
spec2 = c("45", "75 101 85 12", "45 65 8", "45 87" ),
spec3 = c("1", "3 5 75 101", "75 98 5", "65 32"))
# tweak my data
rownames(floyd) = floyd$people
floyd$people = NULL
# ppl of interest
cool_floyd = rownames(floyd)[2:4]
# ppl string position criteria
cool_floyd_position = c(2,3,1)
names(cool_floyd_position) = c("david", "rick", "nick")
# my solution attempt
for(i in 1:length(cool_floyd))
{
select_ppl = cool_floyd[i]
string_select = cool_floyd_position[i]
floyd[row.names(floyd) == select_ppl,] = apply(floyd[row.names(floyd) == select_ppl], 1,
function(x) unlist(stringr::str_split(x, " ")[string_select]))
}
I am attempting to get my floyd dataframe to look like the following, where the second word is selected for all david columns, the third word for all rick columns and the first word for all nick columns (roger columns have to just remain as is)
my_target_df = data.frame(people = c("roger", "david", "rick", "nick"),
spec1 = c("1", "5", "85", "12"),
spec2 = c("45", "101", "8", "45" ),
spec3 = c("1", "5", "5", "65"))
row.names(my_target_df) = my_target_df$people
my_target_df$people = NULL
Many thanks in advance!
Here is another option using mapply
library(stringr)
#convert the factor columns to character
floyd[] <- lapply(floyd, as.character)
#transpose the floyd, subset the columns, convert to data.frame
# use mapply to extract the `word` specified in the corresponding c1
#transpose and assign it back to the row in 'floyd'
floyd[names(c1),] <- t(mapply(function(x,y) word(x, y),
as.data.frame(t(floyd)[, names(c1)], stringsAsFactors=FALSE), c1))
floyd
# spec1 spec2 spec3
#roger 1 45 1
#david 5 101 5
#rick 85 8 5
#nick 12 45 65
where
c1 <- cool_floyd_position #just to avoid typing
You can try a combination of sapply to iterate over the data frame, and mapply to extract the nth word from each column. i.e,
library(stringr)
df1 <- rbind(df[1,-1], sapply(df[-1,-1], function(i) mapply(word, i, cool_floyd_position)))
rownames(df1) <- df$people
df1
# spec1 spec2 spec3
#roger 1 45 1
#david 5 101 5
#rick 85 8 5
#nick 12 45 65
The only downside of this solution is that people are displayed as rownames rather than a single column. There are many ways to make it a column,i.e,
df1$people <- rownames(df1)
rownames(df1) <- NULL
df1[c(ncol(df1), 1:ncol(df1)-1)]
# people spec1 spec2 spec3
#1 roger 1 45 1
#2 david 5 101 5
#3 rick 85 8 5
#4 nick 12 45 65
Tidyverse solution:
library(stringi) # you have this installed if you have stringr
library(tidyverse)
pick_pos <- function(who, x, lkp) {
if (who %in% names(lkp)) {
map_chr(x, ~stri_split_fixed(., " ")[[1]][lkp[[who]]])
} else {
x
}
}
rownames_to_column(floyd, "people") %>%
mutate_all(funs(as.character)) %>% # necessary since you have factors
group_by(people) %>%
mutate_all(funs(pick_pos(people, ., cool_floyd_position))) %>%
data.frame() %>%
column_to_rownames("people")