How to select string pattern with conditions in loop [r]

How to select string pattern with conditions in loop [r] - r

I would like some assistance please in my quest to select parts of a string in certain rows in an r dataframe. I have mocked up some dummy data below (floyd) to illustrate.
The first dataframe row has only 1 word (its a number yes, but I am treating all numbers as characters/words) for each column, but rows 2 to 4 have more than one word. I would like to select the number in each row/cell based on a position passed to it by the named vector cool_floyd_position.
# please NB need stringr installed for my solution attempt!
# some scenario data
floyd = data.frame(people = c("roger", "david", "rick", "nick"),
spec1 = c("1", "3 5 75 101", "3 65 85", "12 2"),
spec2 = c("45", "75 101 85 12", "45 65 8", "45 87" ),
spec3 = c("1", "3 5 75 101", "75 98 5", "65 32"))
# tweak my data
rownames(floyd) = floyd$people
floyd$people = NULL
# ppl of interest
cool_floyd = rownames(floyd)[2:4]
# ppl string position criteria
cool_floyd_position = c(2,3,1)
names(cool_floyd_position) = c("david", "rick", "nick")
# my solution attempt
for(i in 1:length(cool_floyd))
{
select_ppl = cool_floyd[i]
string_select = cool_floyd_position[i]
floyd[row.names(floyd) == select_ppl,] = apply(floyd[row.names(floyd) == select_ppl], 1,
function(x) unlist(stringr::str_split(x, " ")[string_select]))
}
I am attempting to get my floyd dataframe to look like the following, where the second word is selected for all david columns, the third word for all rick columns and the first word for all nick columns (roger columns have to just remain as is)
my_target_df = data.frame(people = c("roger", "david", "rick", "nick"),
spec1 = c("1", "5", "85", "12"),
spec2 = c("45", "101", "8", "45" ),
spec3 = c("1", "5", "5", "65"))
row.names(my_target_df) = my_target_df$people
my_target_df$people = NULL
Many thanks in advance!

Here is another option using mapply
library(stringr)
#convert the factor columns to character
floyd[] <- lapply(floyd, as.character)
#transpose the floyd, subset the columns, convert to data.frame
# use mapply to extract the `word` specified in the corresponding c1
#transpose and assign it back to the row in 'floyd'
floyd[names(c1),] <- t(mapply(function(x,y) word(x, y),
as.data.frame(t(floyd)[, names(c1)], stringsAsFactors=FALSE), c1))
floyd
# spec1 spec2 spec3
#roger 1 45 1
#david 5 101 5
#rick 85 8 5
#nick 12 45 65
where
c1 <- cool_floyd_position #just to avoid typing

You can try a combination of sapply to iterate over the data frame, and mapply to extract the nth word from each column. i.e,
library(stringr)
df1 <- rbind(df[1,-1], sapply(df[-1,-1], function(i) mapply(word, i, cool_floyd_position)))
rownames(df1) <- df$people
df1
# spec1 spec2 spec3
#roger 1 45 1
#david 5 101 5
#rick 85 8 5
#nick 12 45 65
The only downside of this solution is that people are displayed as rownames rather than a single column. There are many ways to make it a column,i.e,
df1$people <- rownames(df1)
rownames(df1) <- NULL
df1[c(ncol(df1), 1:ncol(df1)-1)]
# people spec1 spec2 spec3
#1 roger 1 45 1
#2 david 5 101 5
#3 rick 85 8 5
#4 nick 12 45 65

Tidyverse solution:
library(stringi) # you have this installed if you have stringr
library(tidyverse)
pick_pos <- function(who, x, lkp) {
if (who %in% names(lkp)) {
map_chr(x, ~stri_split_fixed(., " ")[[1]][lkp[[who]]])
} else {
x
}
}
rownames_to_column(floyd, "people") %>%
mutate_all(funs(as.character)) %>% # necessary since you have factors
group_by(people) %>%
mutate_all(funs(pick_pos(people, ., cool_floyd_position))) %>%
data.frame() %>%
column_to_rownames("people")

Related

Separating values into existing column in R

I'm tidying some data that I read into R from a PDF using tabulizer. Unfortunately some cells haven't been read properly. In column 9 (Split 5 at 37.1km) rows 3 and 4 contain information that should have ended up in column 10 (Final Time).
How do I separate that column (9) just for these rows and paste the necessary data into an already existing column (10)?
I know how to use tidyr::separate function but can't figure out how (an if) to apply it here. Any help and guidance will be appreciated.
structure(list(Rank = c("23", "24", "25", "26"), `Race Number` = c("13",
"11", "29", "30"), Name = c("FOSS Tobias S.", "McNULTY Brandon",
"BENNETT George", "KUKRLE Michael"), `NOC Code` = c("NOR", "USA",
"NZL", "CZE"), `Split 1 at 9.7km` = c("13:47.65(22)", "13:28.23(15)",
"14:05.46(30)", "14:05.81(32)"), `Split 2 at 15.0km` = c("19:21.16(22)",
"19:04.80(18)", "19:47.53(31)", "19:48.77(32)"), `Split 3 at 22.1km` = c("29:17.44(24)",
"29:01.94(20)", "29:58.88(28)", "29:58.09(27)"), `Split 4 at 31.8km` = c("44:06.82(24)",
"43:51.67(23)", "44:40.28(25)", "44:42.74(26)"), `Split 5 at 37.1km` = c("49:49.65(24)",
"49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"
), `Final Time` = c("59:51.68 (23)", "59:57.73 (24)", "", ""),
`Time Behind` = c("+4:47.49", "+4:53.54", "+5:24.20", "+5:37.36"
), `Average Speed` = c("44.302", "44.228", "43.854", "43.696"
)), class = "data.frame", row.names = c(NA, -4L))

My answer is not really fancy, but it does the job for any number in the final time column. It works as long as there are always numbers in brackets at the end.
# dummy df
df <- data.frame("split" = c("49:49.65(24)", "49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"),
"final" = c("59:51.68 (23)", "59:57.73 (24)", "", ""))
# combining & splitting strings
merge_strings <- paste0(df$split, df$final)
split_strings <- strsplit(merge_strings, ")")
df$split <- paste0(unlist(lapply(split_strings, "[[", 1)),")")
df$final <- paste0(unlist(lapply(split_strings, "[[", 2)),")")
This gives:
split final
1 49:49.65(24) 59:51.68 (23)
2 49:40.49(23) 59:57.73 (24)
3 50:21.82(25) 1:00:28.39 (25)
4 50:30.02(26) 1:00:41.55 (26)

Calling df to your dataframe:
library(tidyr)
library(dplyr)
df %>%
separate(`Split 5 at 37.1km`, into = c("Split 5 at 37.1km","aux"), sep = "\\)") %>%
mutate(`Final Time` = coalesce(if_else(`Final Time`!="",`Final Time`, NA_character_), paste0(aux, ")")),
aux = NULL,
`Split 5 at 37.1km` = paste0(`Split 5 at 37.1km`, ")"))
Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km Split 3 at 22.1km Split 4 at 31.8km Split 5 at 37.1km Final Time
1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22) 29:17.44(24) 44:06.82(24) 49:49.65(24) 59:51.68 (23)
2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18) 29:01.94(20) 43:51.67(23) 49:40.49(23) 59:57.73 (24)
3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31) 29:58.88(28) 44:40.28(25) 50:21.82(25) 1:00:28.39 (25)
4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32) 29:58.09(27) 44:42.74(26) 50:30.02(26) 1:00:41.55 (26)
Time Behind Average Speed
1 +4:47.49 44.302
2 +4:53.54 44.228
3 +5:24.20 43.854
4 +5:37.36 43.696

You could use dplyr and stringr:
library(dplyr)
library(stringr)
data %>%
mutate(`Final Time` = ifelse(`Final Time` == "", str_remove(`Split 5 at 37.1km`, "\\d+:\\d+\\.\\d+\\(\\d+\\)"), `Final Time`),
`Split 5 at 37.1km` = str_extract(`Split 5 at 37.1km`, "\\d+:\\d+\\.\\d+\\(\\d+\\)"))
which returns
Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km Split 3 at 22.1km Split 4 at 31.8km
1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22) 29:17.44(24) 44:06.82(24)
2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18) 29:01.94(20) 43:51.67(23)
3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31) 29:58.88(28) 44:40.28(25)
4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32) 29:58.09(27) 44:42.74(26)
Split 5 at 37.1km Final Time Time Behind Average Speed
1 49:49.65(24) 59:51.68 (23) +4:47.49 44.302
2 49:40.49(23) 59:57.73 (24) +4:53.54 44.228
3 50:21.82(25) 1:00:28.39 (25) +5:24.20 43.854
4 50:30.02(26) 1:00:41.55 (26) +5:37.36 43.696

I like to use regex and stringr. Whilst theres some suboptimal code here the key step is with str_extract(). Using this we can select the two substrings we want, that of the first time and that of the second time. If either time is missing then we will have a missing value. So we can then fill in the columns based on where missingness occurs.
The regex string is as follows^((\\d+:)?\\d{2}:\\d{2}.\\d{2}\\(\\d+\\))\\.?+((\\d+:)?\\d{2}:\\d{2}.\\d{2} \\(\\d+\\))$. Here we have 4 capture groups, the first and third group capture the two whole times respectively. the second and fourth select the optional groups containing the hour (this ensures that times over an hour are completely captured. Additionally we check for an optional space.
My code is as follows:
library(tidyverse)
data <- structure(list(Rank = c("23", "24", "25", "26"), `Race Number` = c("13",
"11", "29", "30"), Name = c("FOSS Tobias S.", "McNULTY Brandon",
"BENNETT George", "KUKRLE Michael"), `NOC Code` = c("NOR", "USA",
"NZL", "CZE"), `Split 1 at 9.7km` = c("13:47.65(22)", "13:28.23(15)",
"14:05.46(30)", "14:05.81(32)"), `Split 2 at 15.0km` = c("19:21.16(22)",
"19:04.80(18)", "19:47.53(31)", "19:48.77(32)"), `Split 3 at 22.1km` = c("29:17.44(24)",
"29:01.94(20)", "29:58.88(28)", "29:58.09(27)"), `Split 4 at 31.8km` = c("44:06.82(24)",
"43:51.67(23)", "44:40.28(25)", "44:42.74(26)"), `Split 5 at 37.1km` = c("49:49.65(24)",
"49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"
), `Final Time` = c("59:51.68 (23)", "59:57.73 (24)", "", ""),
`Time Behind` = c("+4:47.49", "+4:53.54", "+5:24.20", "+5:37.36"
), `Average Speed` = c("44.302", "44.228", "43.854", "43.696"
)), class = "data.frame", row.names = c(NA, -4L))
# Take data and use a matching string to the regex pattern
data |>
mutate(match = map(`Split 5 at 37.1km`, ~unlist(str_match(., "^((\\d+:)?\\d{2}:\\d{2}.\\d{2}\\(\\d+\\))((\\d+:)?\\d{2}:\\d{2}.\\d{2} ?\\(\\d+\\))$")))) |>
# Grab the strings that match the whole first and second/final times
mutate(match1 = map(match, ~.[[2]]), match2 = map(match, ~.[[4]]), .keep = "unused") |>
# Check where the NAs are and put into the dataframe accordingly
mutate(`Split 5 at 37.1km`= ifelse(is.na(match1), `Split 5 at 37.1km`, match1),
`Final Time` = ifelse(is.na(match2), `Final Time`, match2), .keep = "unused")
#> Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km
#> 1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22)
#> 2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18)
#> 3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31)
#> 4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32)
#> Split 3 at 22.1km Split 4 at 31.8km Split 5 at 37.1km Final Time
#> 1 29:17.44(24) 44:06.82(24) 49:49.65(24) 59:51.68 (23)
#> 2 29:01.94(20) 43:51.67(23) 49:40.49(23) 59:57.73 (24)
#> 3 29:58.88(28) 44:40.28(25) 50:21.82(25) 1:00:28.39 (25)
#> 4 29:58.09(27) 44:42.74(26) 50:30.02(26) 1:00:41.55 (26)
#> Time Behind Average Speed
#> 1 +4:47.49 44.302
#> 2 +4:53.54 44.228
#> 3 +5:24.20 43.854
#> 4 +5:37.36 43.696
Created on 2021-07-28 by the reprex package (v2.0.0)
Note in the above I use the base pipe from R 4.1 onwards |> this can be replaced simply with the magrittr pipe %>% if you are on an earlier R version.

Loop Output Stored as List

I have wide supervisory data where a single observation consists of a level 1 employee and their department all the way down to level 8. I use a loop with other commands to produce a list all employees and the departments beneath them in long format so that I can see what departments employees are responsible for at all levels. There may be a more elegant way to do this than a loop, but it works fine. Sample data (through level 3 for succinctness):
data <- tibble(LV1_Employee_Name = "Chuck", LV1_Employee_Nbr = "1", LV1_Department = "Tha Boss", LV1_Department_Nbr = "90",
LV2_Employee_Name = c("Alex", "Alex", "Paul", "Paul", "Jennifer", "Jennifer"), LV2_Employee_Nbr = c("2", "2", "3", "3", "4", "4"), LV2_Department = c("Leadership", "Leadership", "Finance", "Finance", "Philanthropy", "Philanthropy"), LV2_Department_Nbr = c("91", "91", "92", "92", "93", "93"),
LV3_Employee_Name = c("Dan", "Wendy", "Sarah", "Monique", "Miguel", "Brandon"), LV3_Employee_Nbr = c("2", "2", "3", "3", "4", "4"), LV3_Department = c("Analytics", "Pop Health", "Acounting", "Investments", "Yacht Aquisitions", "Golf Junkets"), LV3_Department_Nbr = c("94", "95", "96", "97", "98", "99"))
The loop below first produces six tibbles named level1_1, level1_2, level1_3, level2_2, level2_3, level3_3. Each tibble contains an employee name, number, and the department at the same department level or below. The code then lists and binds the rows of these tibbles with ls() and bind_rows(), then applies the distinct() command, and I've got what I need.
first_department <- 1
data_colnames <- c("Employee", "Employee_Id", "Department", "Department_Number")
for(i in 1:3){
for(k in first_department:3){
assign(paste0("level", i, "_", k), setNames(distinct(as_tibble(c(data[ ,paste0("LV", i, "_", "Employee_Name")], data[ ,paste0("LV", i, "_", "Employee_Nbr")],
data[ ,paste0("LV", k, "_", "Department")], data[ ,paste0("LV", k, "_", "Department_Nbr")]))),
data_colnames))
}
first_department = first_department + 1
}
employees_departments <- distinct(bind_rows(mget(ls(pattern = "^level")))) %>%
filter(is.na(Department) == FALSE)
rm(list = ls(pattern = "^level"))
What I'd like to do is, rather than produce an initial output of six tibbles, have the loop itself output the list. This will save me from having a huge list of tibbles in the output which, I'm told, is not very "R-like".

Here is a revised version that stores the results in a list within your loop. This will include an index idx incremented each time through the loop. Afterwards, you can use bind_rows on this list to get a complete result.
library(tidyverse)
idx <- 1
first_department <- 1
data_colnames <- c("Employee", "Employee_Id", "Department", "Department_Number")
data_lst <- list()
for(i in 1:3){
for(k in first_department:3){
data_lst[[idx]] <- setNames(
distinct(as_tibble(
c(data[ ,paste0("LV", i, "_", "Employee_Name")],
data[ ,paste0("LV", i, "_", "Employee_Nbr")],
data[ ,paste0("LV", k, "_", "Department")],
data[ ,paste0("LV", k, "_", "Department_Nbr")]))),
data_colnames)
idx <- idx + 1
}
first_department = first_department + 1
}
distinct(bind_rows(data_lst)) %>%
filter(!is.na(Department))
Output
Employee Employee_Id Department Department_Number
<chr> <chr> <chr> <chr>
1 Chuck 1 Tha Boss 90
2 Chuck 1 Leadership 91
3 Chuck 1 Finance 92
4 Chuck 1 Philanthropy 93
5 Chuck 1 Analytics 94
6 Chuck 1 Pop Health 95
7 Chuck 1 Acounting 96
8 Chuck 1 Investments 97
9 Chuck 1 Yacht Aquisitions 98
10 Chuck 1 Golf Junkets 99
11 Alex 2 Leadership 91
12 Paul 3 Finance 92
13 Jennifer 4 Philanthropy 93
14 Alex 2 Analytics 94
15 Alex 2 Pop Health 95
16 Paul 3 Acounting 96
17 Paul 3 Investments 97
18 Jennifer 4 Yacht Aquisitions 98
19 Jennifer 4 Golf Junkets 99
20 Dan 2 Analytics 94
21 Wendy 2 Pop Health 95
22 Sarah 3 Acounting 96
23 Monique 3 Investments 97
24 Miguel 4 Yacht Aquisitions 98
25 Brandon 4 Golf Junkets 99

Look up value based on partial string match in R

I have a table (table 1) with a bunch of cities (punctuation, capitalization and spaces have been removed).
I want to scan through the 2nd table (table 2) and pull out any record (the first) that exactly matches or contains the string anywhere within it.
# Table 1
city1
1 waterloo
2 kitchener
3 toronto
4 guelph
5 ottawa
# Table 2
city2
1 waterlookitchener
2 toronto
3 hamilton
4 cityofottawa
This would give the 3rd table seen below.
# Table 3
city1 city2
1 waterloo waterlookitchener
2 kitchener waterlookitchener
3 toronto toronto
4 guelph <N/A>
5 ottawa cityofottawa

I believe there are more sophisticated ways of completing your task, but here is a simple approach using tidyverse.
df <- read_table2("city1
waterloo
kitchener
toronto
guelph
ottawa")
df2 <- read_table2("city2
waterlookitchener
toronto
hamilton
cityofottawa")
df3 <- df$city1 %>%
lapply(grep, df2$city2, value=TRUE) %>%
lapply(function(x) if(identical(x, character(0))) NA_character_ else x) %>%
unlist
df3 <- cbind(df, df3)
Search for every element of df$city1 in df2$city2 (partial or complete match) and return this element of df2$city2. See ?grep for more information.
Replace the character(0) (element not found) with NA. See How to convert character(0) to NA in a list with R language? for details.
Convert list into a vector (unlist).
Attach result to list of cities (cbind).

You can also try using fuzzyjoin. In this case, you can use the function stri_detect_fixed from stringi package to identify at least one occurrence of a fixed pattern in a string.
library(fuzzyjoin)
library(stringi)
library(dplyr)
fuzzy_right_join(table2, table1, by = c("city2" = "city1"), match_fun = stri_detect_fixed) %>%
select(city1, city2)
Output
city1 city2
1 waterloo waterlookitchener
2 kitchener waterlookitchener
3 toronto toronto
4 guelph <NA>
5 ottawa cityofottawa
Data
table1 <- structure(list(city1 = c("waterloo", "kitchener", "toronto",
"guelph", "ottawa")), class = "data.frame", row.names = c(NA,
-5L))
table2 <- structure(list(city2 = c("waterlookitchener", "toronto", "hamilton",
"cityofottawa")), class = "data.frame", row.names = c(NA, -4L
))

Convert time column in character format to manipulable time format in R

My question is about the standardization of column b. I need these data to be in a format that makes it easier to construct graphics.
a<- c("Jackson Brice / The Shocker","Flash Thompson", "Mr. Harrington","Mac Gargan","Betty Brant", "Ann Marie Hoag","Steve Rogers / Captain America", "Pepper Potts", "Karen")
b<- c("2:30", "2:15", "2", "1:15", "1:15", "1", ":55",":45", "v")
ab <- cbind.data.frame(a,b)
a b
1 Jackson Brice / The Shocker 2:30
2 Flash Thompson 2:15
3 Mr. Harrington 2
4 Mac Gargan 1:15
5 Betty Brant 1:15
6 Ann Marie Hoag 1
7 Steve Rogers / Captain America 1
8 Pepper Potts :45
9 Karen v
as outuput:
a b
1 Jackson Brice / The Shocker 00:02:30
2 Flash Thompson 00:02:15
3 Mr. Harrington 00:02:00
4 Mac Gargan 00:01:15
5 Betty Brant 00:01:15
6 Ann Marie Hoag 00:01:00
7 Steve Rogers / Captain America 00:01:00
8 Pepper Potts 00:00:45
9 Karen 00:00:00
If possible, the objects of the column b in the manipulable format of time.

So I've had to make a few assumptions about what you are trying to do, e.g. units and what you want done with character values but hopefully this function will give you something to work with.
The big challenge with time is that you need some fairly clear rules when parsing it from text. As I results I have had to put a number of if statements in the function to make it work but wherever possible, try and keep your time formats as consistent as possible.
library(lubridate)
formatTime <- function(x) {
# Check for a : seperator in the text
if(grepl(":",x, fixed = TRUE)) {
y <- unlist(strsplit(x,":", fixed = TRUE))
# If there is no value before the : then add "00" before the :
if(y[1]=="") {
z <- ms(paste("00",y[2],collapse = ":"), quiet=TRUE)
} else {
z <- ms(paste(y,collapse = ":"), quiet=TRUE)
}
} else {
# If there is no : then add "00" after the :
z <- ms(paste(x,"00",collapse = ":"), quiet=TRUE)
}
# If it did not pare with ms, i.e. it was a character, then assign zero time "00:00"
if(is.na(z)) z <- ms("0:00")
# Converted to duration due to issues returning period with lapply.
# Make dataframe to retun units and name with lapply.
return(data.frame(time = as.duration(z)))
}
# Convert factor variable to character
ab$b <- as.character(ab$b)
ab <- cbind(ab,rbindlist(lapply(ab$b,formatTime)))
I started by trying to work with a time period but it wouldn't return correctly with the apply statement so I converted to a duration. This may not display the same as your example but it should play nice with graphs.
Let me know if I've missed what you needed and I'll update the answer.

A solution using tidyr::separate and tidyr::unite can be achieved. The approach is to first replace a value containing alphabetic with 00:00:00. Separate parts in 3 columns. Using dplyr::mutate_at all the 3 columns is changed to 00 format. Finally, unite all the three columns.
library(tidyverse)
ab %>% mutate_if(is.factor, as.character) %>% #Change any factor in character
mutate(b = ifelse(grepl("[[:alpha:]]", b), "00:00:00", b)) %>%
mutate(b = ifelse(grepl(":", b), b, paste(b,"00",sep=":")) ) %>%
separate(b, into = c("b1", "b2", "b3"), sep = ":", fill="left", extra = "drop") %>%
mutate_at(vars(starts_with("b")),
funs(sprintf("%02d", as.numeric(ifelse(is.na(.) | . == "",0,.))))) %>%
unite("b", starts_with("b"), sep=":")
# a b
# 1 Jackson Brice / The Shocker 00:02:30
# 2 Flash Thompson 00:02:15
# 3 Mr. Harrington 00:02:00
# 4 Mac Gargan 00:01:15
# 5 Betty Brant 00:01:15
# 6 Ann Marie Hoag 00:01:00
# 7 Steve Rogers / Captain America 00:00:55
# 8 Pepper Potts 00:00:45
# 9 Karen 00:00:00
Data:
a<- c("Jackson Brice / The Shocker","Flash Thompson", "Mr. Harrington","Mac Gargan","Betty Brant",
"Ann Marie Hoag","Steve Rogers / Captain America", "Pepper Potts", "Karen")
b<- c("2:30", "2:15", "2", "1:15", "1:15", "1", ":55",":45", "v")
ab <- cbind.data.frame(a,b

Construct a new data frame by merging two rows of an existing data frame using common column values in both the row

https://www.dropbox.com/s/prqiojwzpax339z/Test123.xlsx?dl=0
The link contains an xlsx file which contains the details of a batsman batting in one sheet where runs scored in each innings by him in a test match is recorded.So the details of the rows contains identical values w.r.t some columns between two rows because in a test match a batsman gets the chance to bat in two innings so details mentioned in columns like opposition,Ground,StartDateAscending,MatchNumber,Result will be common when we compare two rows for a test match.
Question:so how can we club the data present in the rows based on this matching values and create a new data frame with merged rows.
Ex:In data shared through the link,i am taking the first two rows as a sample to tell what i want to achieve and below is the text representation of the r object of this sample data derived using r function
structure(list(Runs = c("10", "27"), Mins = c("30", "93"), BF = c("19",
"65"), X4s = c("1", "4"), X6s = c("0", "0"), SR = c("52.63",
"41.53"), Pos = c("6", "6"), Dismissal = c("bowled", "caught"
), Inns = c(2, 4), Opposition = c("v England", "v England"),
Ground = c("Lord's", "Lord's"), Start.DateAscending = structure(c(648930600,
648930600), class = c("POSIXct", "POSIXt"), tzone = ""),
Match.Number = c("Test # 1148", "Test # 1148"), Result = c("Loss",
"Loss")), .Names = c("Runs", "Mins", "BF", "X4s", "X6s",
"SR", "Pos", "Dismissal", "Inns", "Opposition", "Ground", "Start.DateAscending",
"Match.Number", "Result"), row.names = 1:2, class = "data.frame")
The data derived from the above block will be something like below:
Runs Mins BF X4s X6s SR Pos Dismissal Inns Opposition Ground
1 10 30 19 1 0 52.63 6 bowled 2 v England Lord's
2 27 93 65 4 0 41.53 6 caught 4 v England Lord's
Start.DateAscending Match.Number Result
1 1990-07-26 Test # 1148 Loss
2 1990-07-26 Test # 1148 Loss
So what i want to achieve is to sum up the runs column values based on the common column values like Match.Number,Opposition,Ground,Start.DateAscending.
I expect the values like below which will be stored in a new data frame
Runs Opposition Ground Start.DateAscending Match.Number Result
1 37 v England Lord's 1990-07-26 Test # 1148 Loss

We subset the columns of the dataset, using aggregate after conveting the 'Runs' to numeric class
colsofinterest <- names(df1)[c(1, 10:ncol(df1))]
aggregate(Runs~., df1[colsofinterest], sum)
# Opposition Ground Start.DateAscending Match.Number Result Runs
#1 v England Lord's 1990-07-26 Test # 1148 Loss 37
Or we can use tidyverse
colsofinterest2 <- names(df1)[10:ncol(df1)]
library(dplyr)
df1 %>%
group_by_(.dots = colsofinterest2) %>%
summarise(Runs = sum(Runs))
# A tibble: 1 x 6
# Groups: Opposition, Ground, Start.DateAscending, Match.Number [?]
# Opposition Ground Start.DateAscending Match.Number Result Runs
# <chr> <chr> <dttm> <chr> <chr> <int>
#1 v England Lord's 1990-07-26 Test # 1148 Loss 37

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to select string pattern with conditions in loop [r] - r

Related

Separating values into existing column in R

Loop Output Stored as List

Look up value based on partial string match in R

Convert time column in character format to manipulable time format in R

Construct a new data frame by merging two rows of an existing data frame using common column values in both the row

Categories

Resources