Turning horizontal data into vertical data - r

I have data that is structure like such:
actor_data <- structure(list(id = c(123L, 456L, 789L, 912L, 235L), name = c("Tom Cruise",
"Will Smith", "Ryan Reynolds", "Chris Rock", "Emma Stone"), locationid1 = c(5459L,
NA, 6114L, NA, NA), location1 = c("Paris, France", "", "Brooklyn, NY",
"", ""), locationid2 = c(NA, 5778L, NA, NA, 4432L), location3 = c("",
"Dolby Theater", "", "", "Hollywood"), locationid3 = c(NA, 2526L,
3101L, NA, NA), location3.1 = c("", "London", "Boston", "", ""
), locationid4 = c(6667L, 2333L, 1118L, NA, NA), location4 = c("Virginia",
"Maryland", "Washington", "", "")), class = "data.frame", row.names = c(NA,
-5L))
I am trying to make the location data run vertically instead of horizontally while also making sure its not accounting for blank fields.
So the final result will look like this:
actor_data_exp <- structure(list(id = c(123L, 123L, 456L, 456L, 456L, 789L, 789L,
789L, 235L), name = c("Tom Cruise", "Tom Cruise", "Will Smith",
"Will Smith", "Will Smith", "Ryan Reynolds", "Ryan Reynolds",
"Ryan Reynolds", "Emma Stone"), locationid = c(5459L, 6667L,
5778L, 2526L, 2333L, 6114L, 3101L, 1118L, 4432L), location = c("Paris, France",
"Virginia", "Dolby Theater", "London", "Maryland", "Brooklyn, NY",
"Boston", "Washington", "Hollywood")), class = "data.frame", row.names = c(NA,
-9L))
Or to give you a visual, will end up looking like this:

I would rename the columns that start with "location", so that there is an underscore before the number in each column name. Then use pivot_longer with the underscore as name_sep, and using c(".value", "var") in the names_to argument that to ensure both location and locationid have their own columns. This will also create the redundant column var which will contain the numbers 1-4 that were appended to the original column names.
Finally, filter out missing values and remove the redundant var column.
library(tidyverse)
actor_data %>%
rename_with(~ ifelse(grepl("location", .x),
sub("^(.*?)([0-9]\\.?\\d?)$", "\\1_\\2", .x), .x)) %>%
pivot_longer(starts_with("location"),
names_sep = "_", names_to = c(".value", "var")) %>%
filter(!is.na(locationid) & !is.na(location) & nzchar(location)) %>%
select(-var)
#> # A tibble: 6 x 4
#> id name locationid location
#> <int> <chr> <int> <chr>
#> 1 123 Tom Cruise 5459 Paris, France
#> 2 123 Tom Cruise 6667 Virginia
#> 3 456 Will Smith 2526 Dolby Theater
#> 4 456 Will Smith 2333 Maryland
#> 5 789 Ryan Reynolds 6114 Brooklyn, NY
#> 6 789 Ryan Reynolds 1118 Washington

Related

Trying to calculate the Expected Value of an observation

I have a tibble and am trying to use values from two specific rows (Pinnacle book) to perform a calculation. The values of the calculation will be written to a new column. Here is the output of dput
structure(list(id = c("5d8f6b2536fbdc4ab6a3e9759ebc6c51", "5d8f6b2536fbdc4ab6a3e9759ebc6c51",
"5d8f6b2536fbdc4ab6a3e9759ebc6c51", "5d8f6b2536fbdc4ab6a3e9759ebc6c51",
"5d8f6b2536fbdc4ab6a3e9759ebc6c51", "5d8f6b2536fbdc4ab6a3e9759ebc6c51"
), start = structure(c(1676691000, 1676691000, 1676691000, 1676691000,
1676691000, 1676691000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), book = c("BetUS", "BetUS", "Bovada", "Bovada", "Pinnacle",
"Pinnacle"), home = c("San José St Spartans", "San José St Spartans",
"San José St Spartans", "San José St Spartans", "San José St Spartans",
"San José St Spartans"), away = c("New Mexico Lobos", "New Mexico Lobos",
"New Mexico Lobos", "New Mexico Lobos", "New Mexico Lobos", "New Mexico Lobos"
), team = c("San José St Spartans", "New Mexico Lobos", "San José St Spartans",
"New Mexico Lobos", "San José St Spartans", "New Mexico Lobos"
), price = c(-140, 120, -140, 120, -138, 117), update = c("2023-02-18T00:24:43Z",
"2023-02-18T00:24:43Z", "2023-02-18T00:25:10Z", "2023-02-18T00:25:10Z",
"2023-02-18T00:25:04Z", "2023-02-18T00:25:04Z"), bep = c(0.58333,
0.45455, 0.58333, 0.45455, 0.57983, 0.46083), no_vig = c(-128.33333,
128.33333, -128.33333, 128.33333, -125.82353, 125.82353), no_vig_bep = c(0.56204,
0.43796, 0.56204, 0.43796, 0.55718, 0.44282), win = c(71.43,
120, 71.43, 120, 72.46, 117)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
id = c("5d8f6b2536fbdc4ab6a3e9759ebc6c51", "5d8f6b2536fbdc4ab6a3e9759ebc6c51",
"5d8f6b2536fbdc4ab6a3e9759ebc6c51", "5d8f6b2536fbdc4ab6a3e9759ebc6c51",
"5d8f6b2536fbdc4ab6a3e9759ebc6c51", "5d8f6b2536fbdc4ab6a3e9759ebc6c51"
), book = c("BetUS", "BetUS", "Bovada", "Bovada", "Pinnacle",
"Pinnacle"), team = c("New Mexico Lobos", "San José St Spartans",
"New Mexico Lobos", "San José St Spartans", "New Mexico Lobos",
"San José St Spartans"), .rows = structure(list(2L, 1L,
4L, 3L, 6L, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
The following is the calculation
-4.482633 = (.55718 * 71.43) - (.44282 * 100)
The values in the calculation above correspond with the following variables
-4.482633 = Expected Value I am trying to derive
.55718 = "no_vig_bep" of Pinnacle
71.43 = "win" of observation 1
.44282 = 1 - "no_vig_bep" of Pinnacle or the last row
100 = a set amount
I Would then like to calculate the other side of the odds as follow
-2.5796 = (.44282 * 120.00) - (.55718 * 100)
The ultimate goal is to use the values of the Pinnacle book to perform the above calculation against all other books. The EV will be written to a new column.
Included additional id for further clarification
structure(list(id = c("073c154f3c8586868a3ba21522161a70",
"073c154f3c8586868a3ba21522161a70",
"073c154f3c8586868a3ba21522161a70", "073c154f3c8586868a3ba21522161a70",
"073c154f3c8586868a3ba21522161a70", "073c154f3c8586868a3ba21522161a70"
), book = c("Bovada", "Pinnacle", "MyBookie.ag", "MyBookie.ag",
"Pinnacle", "Bovada"), home = c("Western Michigan Broncos", "Western
Michigan Broncos",
"Western Michigan Broncos", "Western Michigan Broncos", "Western
Michigan Broncos",
"Western Michigan Broncos"), away = c("Ball State Cardinals",
"Ball State Cardinals", "Ball State Cardinals", "Ball State Cardinals",
"Ball State Cardinals", "Ball State Cardinals"), team = c("Western
Michigan Broncos",
"Ball State Cardinals", "Western Michigan Broncos", "Ball State
Cardinals",
"Western Michigan Broncos", "Ball State Cardinals"), price = c(-185,
-143, -142, 100, 108, 140), bep = c(0.64912, 0.58848, 0.58678,
0.5, 0.48077, 0.41667), no_vig = c(-155.78947, -122.40329, -117.35537,
117.35537, 122.40329, 155.78947), no_vig_bep = c(0.60905, 0.55037,
0.53992, 0.46008, 0.44963, 0.39095), win = c(54.05, 69.93, 70.42,
100, 108, 140), EV_1 = c(-15.2155015, -6.47562589999999,
-6.20594459999999,
-10.074, -6.47696000000001, 7.91119999999999)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups =
structure(list(
book = c("Bovada", "MyBookie.ag", "Pinnacle"), .rows = structure(list(
c(1L, 6L), 3:4, c(2L, 5L)), ptype = integer(0), class =
c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE))
Edit1: Maybe purrr can help you here:
library(purrr)
no_vig_pin_list <- df |>
ungroup() |>
split(df$id) |>
map(~.x |> filter(book == "Pinnacle") |> pull(no_vig_bep))
df |>
ungroup() |>
group_split(id) |>
purrr::map2_dfr(no_vig_pin_list, ~ .x |>
group_by(book) |>
mutate(EV_1 = ifelse(row_number() == 1,
(.y[1] * win)- ((1-.y[1])*100),
(.y[2] * win)- ((1-.y[2])*100)))) |>
select(EV_1)
A tibble: 8 × 2
# Groups: book [2]
book EV_1
<chr> <dbl>
1 BetUS -5.03
2 BetUS -1.10
3 Pinnacle -4.07
4 Pinnacle -4.07
5 BetUS -4.48
6 BetUS -2.58
7 Pinnacle -3.91
8 Pinnacle -3.91
Maybe this helps, I am not quite sure that I understood what you are trying to achieve.
library(dplyr)
no_vig_pin <- df |>
filter(book == "Pinnacle") |>
pull(no_vig_bep)
df |>
group_by(book) |>
mutate(EV_1 = ifelse(row_number() == 1,
(no_vig_pin[1] * win)- ((1-no_vig_pin[1])*100),
(no_vig_pin[2] * win)- ((1-no_vig_pin[2])*100))) |>
select(EV_1)
Output:
# A tibble: 6 × 2
# Groups: book [3]
book EV_1
<chr> <dbl>
1 BetUS -4.48
2 BetUS -2.58
3 Bovada -4.48
4 Bovada -2.58
5 Pinnacle -3.91
6 Pinnacle -3.91

Extracting and evaluating words in a text string against another dataset

I have two sets of data that I will be evaluating against one another. A heavily reduced example looks like this:
library(dplyr)
library(tidyverse)
library(sqldf)
library(dbplyr)
library(httr)
library(purrr)
library(jsonlite)
library(magrittr)
library(tidyr)
library(tidytext)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Craig Mills"), biography = c("Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!")), class = "data.frame", row.names = c(NA,
-3L))
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
I am trying to create a match against the contents of the biography text string in people_records_ex against college_name in college_records_ex so the final output will look like this:
final_records_ex <- structure(list(id = c(123L, 456L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Jeff Smith", "Craig Mills"), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
Or to provide a more visual example of the final output I'm expecting:
But when I run the following code, it produces zero results, which is not correct:
college_extract <- people_records_ex %>%
left_join(college_records_ex, by = c("biography" = "college_name")) %>%
filter(!is.na(college_state)) %>% dplyr::select(id, name, college_name, college_city, college_state) %>% distinct()
What am I doing incorrectly and what would the correct version look like?
Here's a very tidy and straightforward solution with fuzzy_join:
library(fuzzyjoin)
library(stringr)
library(dplyr)
fuzzy_join(
people_records_ex, college_records_ex,
by = c("biography" = "college_name"),
match_fun = str_detect,
mode = "left"
) %>%
select(-biography)
id name college_id college_name college_city college_state
1 123 Anna Wilson 234 Ohio State University Columbus OH
2 456 Jeff Smith 567 Stanford Stanford CA
3 456 Jeff Smith 891 William & Mary Williamsburg VA
4 789 Craig Mills 345 University of North Texas Denton TX
Assuming the college names in the biographies are spelled out exactly as they appear in the colleges table and the datasets are relatively small, all matches can be generated with a regex of all college names as follows
library(dplyr)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c(
"Anna Wilson",
"Jeff Smith", "Craig Mills"
), biography = c(
"Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!"
)), class = "data.frame", row.names = c(
NA,
-3L
)) %>% tibble::tibble()
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c(
"Ohio State University",
"Stanford", "William & Mary", "University of North Texas"
), college_city = c(
"Columbus",
"Stanford", "Williamsburg", "Denton"
), college_state = c(
"OH",
"CA", "VA", "TX"
)), class = "data.frame", row.names = c(NA, -4L)) %>%
tibble::tibble()
# join college names in a regex pattern
colleges_regex <- paste0(college_records_ex$college_name, collapse = "|")
colleges_regex
#> [1] "Ohio State University|Stanford|William & Mary|University of North Texas"
# match all against bio, giving a list-column of matches
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex))
#> # A tibble: 3 × 4
#> id name biography matches
#> <int> <chr> <chr> <list>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. <chr[…]>
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at … <chr[…]>
#> 3 789 Craig Mills University of North Texas Volleyball! <chr[…]>
# unnest the list column wider to give 1 row per person per match
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex)) %>%
tidyr::unnest_longer(matches)
#> # A tibble: 4 × 4
#> id name biography match…¹
#> <int> <chr> <chr> <chr>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. Ohio S…
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Stanfo…
#> 3 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Willia…
#> 4 789 Craig Mills University of North Texas Volleyball! Univer…
#> # … with abbreviated variable name ¹​matches[,1]
Created on 2022-10-26 with reprex v2.0.2
This may be joined back to the college table such that it is annotated with college info.
In base R you can do:
do.call(rbind, lapply(college_records_ex$college_name,
\(x) people_records_ex[grep(x, people_records_ex$biography),1:2])) |>
cbind(college_records_ex[-1])
This does some matching and I subsetted the first two columns which are the id and name, cbinding it with the second data.frame getting rid of the first column
id name college_name college_city college_state
1 123 Anna Wilson Ohio State University Columbus OH
2 456 Jeff Smith Stanford Stanford CA
21 456 Jeff Smith William & Mary Williamsburg VA
3 789 Craig Mills University of North Texas Denton TX

Pivoting columns to rows

I have a set of data where the columns I'm trying to pivot vertically are stored as such:
testdata <- structure(list(id = c(723L, 621L, NA, NA, NA, NA, NA, NA, NA),
fullName = c("Will Smith", "Chris Rock", "", "", "", "",
"", "", ""), latestPosts.0.locationId = c(212928653L, 34505L,
NA, NA, NA, NA, NA, NA, NA), latestPosts.0.locationName = c("Miami",
"Atlanta", "", "", "", "", "", "", ""), latestPosts.1.locationId = c(1040683L,
20326736L, NA, NA, NA, NA, NA, NA, NA), latestPosts.1.locationName = c("New York",
"London", "", "", "", "", "", "", ""), latestPosts.2.locationId = c(NA,
215307317L, NA, NA, NA, NA, NA, NA, NA), latestPosts.2.locationName = c("",
"Paris", "", "", "", "", "", "", ""), latestPosts.3.locationId = c(1147378L,
34505L, NA, NA, NA, NA, NA, NA, NA), latestPosts.3.locationName = c("Seattle",
"Atlanta", "", "", "", "", "", "", ""), latestPosts.4.locationId = c(1147378L,
NA, NA, NA, NA, NA, NA, NA, NA), latestPosts.4.locationName = c("Seattle",
"", "", "", "", "", "", "", ""), latestPosts.5.locationId = c(238334931,
9432076525, NA, NA, NA, NA, NA, NA, NA), latestPosts.5.locationName = c("San Francisco",
"Brooklyn", "", "", "", "", "", "", ""), latestPosts.6.locationId = c(881699386L,
NA, NA, NA, NA, NA, NA, NA, NA), latestPosts.6.locationName = c("San Diego",
"", "", "", "", "", "", "", ""), latestPosts.7.locationId = c(NA,
234986797L, NA, NA, NA, NA, NA, NA, NA), latestPosts.8.locationId = c(1147378,
9021444765, NA, NA, NA, NA, NA, NA, NA), latestPosts.8.locationName = c("Seattle",
"Cleveland", "", "", "", "", "", "", ""), latestPosts.9.locationId = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA), latestPosts.9.locationName = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA), latestPosts.10.locationId = c(408631288L,
234986797L, NA, NA, NA, NA, NA, NA, NA), latestPosts.10.locationName = c("Portland",
"Orlando", "", "", "", "", "", "", ""), latestPosts.11.locationId = c(52043757619,
34505, NA, NA, NA, NA, NA, NA, NA), latestPosts.11.locationName = c("Nashville",
"Atlanta", "", "", "", "", "", "", "")), class = "data.frame", row.names = c(NA,
-9L))
I am trying to pivot where any time latestPosts.n.locationId OR latestPosts.n.locationName (in this case, n is a placeholder for the number in between the two) is not blank or not NA, it pivots so that the final output looks as such:
testdata_exp <- structure(list(id = c(723L, 724L, 725L, 726L, 727L, 728L, 729L,
730L, 731L, 621L, 622L, 623L, 624L, 625L, 626L, 627L, 628L, 629L
), fullName = c("Will Smith", "Will Smith", "Will Smith", "Will Smith",
"Will Smith", "Will Smith", "Will Smith", "Will Smith", "Will Smith",
"Chris Rock", "Chris Rock", "Chris Rock", "Chris Rock", "Chris Rock",
"Chris Rock", "Chris Rock", "Chris Rock", "Chris Rock"), locationId = c(212928653,
1040683, 1147378, 1147378, 238334931, 881699386, 1147378, 408631288,
52043757619, 34505, 20326736, 215307317, 34505, 9432076525, 234986797,
9021444765, 234986797, 34505), locationName = c("Miami Beach, Florida",
"Starbucks", "University of Evansville", "University of Evansville",
"Downtown Evansville", "Garden Of The Gods", "University of Evansville",
"Phi Gamma Delta - Epsilon Iota", "Nashville Pride", "University of the South",
"Riverview Camp For Girls", "Chattanooga, Tennessee", "University of the South",
"Grand Sirenis Riviera Maya Resort", "", "Sleepyhead Coffee",
"Sewanee, Tennessee", "University of the South")), class = "data.frame", row.names = c(NA,
-18L))
Or for a visual representation:
A couple things to keep in mind:
The number of latestPosts.n.locationId OR latestPosts.n.locationName might change from dataset to dataset, so it's best to account for not knowing how many there will be. This example goes up to 11, but other times it might be more or less.
If there is a locationId present, it doesn't always mean there will be a matching locationName field. Using this data as an example, there is a latestPosts.7.locationId field but no subsequent latestPosts.7.locationName
field.
Here's another variation (also using pivot_longer)
library(dplyr)
library(tidyr)
testdata %>%
pivot_longer(-c(id, fullName),
names_to = c("n", ".value"),
names_pattern = "latestPosts\\.([0-9]+)\\.(.+)") %>%
select(-n) %>%
filter(!((is.na(locationId) | locationId == '') & (is.na(locationName) | locationName == '')))
#> # A tibble: 18 × 4
#> id fullName locationId locationName
#> <int> <chr> <dbl> <chr>
#> 1 723 Will Smith 212928653 Miami
#> 2 723 Will Smith 1040683 New York
#> 3 723 Will Smith 1147378 Seattle
#> 4 723 Will Smith 1147378 Seattle
#> 5 723 Will Smith 238334931 San Francisco
#> 6 723 Will Smith 881699386 San Diego
#> 7 723 Will Smith 1147378 Seattle
#> 8 723 Will Smith 408631288 Portland
#> 9 723 Will Smith 52043757619 Nashville
#> 10 621 Chris Rock 34505 Atlanta
#> 11 621 Chris Rock 20326736 London
#> 12 621 Chris Rock 215307317 Paris
#> 13 621 Chris Rock 34505 Atlanta
#> 14 621 Chris Rock 9432076525 Brooklyn
#> 15 621 Chris Rock 234986797 <NA>
#> 16 621 Chris Rock 9021444765 Cleveland
#> 17 621 Chris Rock 234986797 Orlando
#> 18 621 Chris Rock 34505 Atlanta
You could do
library(tidyverse)
testdata %>%
rename_all(~sub("latestPosts\\.", "", .x)) %>%
mutate(across(contains("location"), as.character)) %>%
mutate(rownum = row_number()) %>%
pivot_longer(contains("location")) %>%
separate(name, into = c("group", "var")) %>%
group_by(id, fullName, group, rownum) %>%
summarise(var = c("locationId", "locationName"),
value = if(n() == 1) c(value, NA) else value, .groups = "drop") %>%
pivot_wider(names_from = var, values_from = value) %>%
select(id, fullName, locationId, locationName) %>%
filter((!is.na(locationName) & nzchar(locationName)) | !is.na(locationId)) %>%
mutate(locationId = as.numeric(locationId))
#> # A tibble: 18 x 4
#> id fullName locationId locationName
#> <int> <chr> <dbl> <chr>
#> 1 621 Chris Rock 34505 Atlanta
#> 2 621 Chris Rock 20326736 London
#> 3 621 Chris Rock 234986797 Orlando
#> 4 621 Chris Rock 34505 Atlanta
#> 5 621 Chris Rock 215307317 Paris
#> 6 621 Chris Rock 34505 Atlanta
#> 7 621 Chris Rock 9432076525 Brooklyn
#> 8 621 Chris Rock 234986797 NA
#> 9 621 Chris Rock 9021444765 Cleveland
#> 10 723 Will Smith 212928653 Miami
#> 11 723 Will Smith 1040683 New York
#> 12 723 Will Smith 408631288 Portland
#> 13 723 Will Smith 52043757619 Nashville
#> 14 723 Will Smith 1147378 Seattle
#> 15 723 Will Smith 1147378 Seattle
#> 16 723 Will Smith 238334931 San Francisco
#> 17 723 Will Smith 881699386 San Diego
#> 18 723 Will Smith 1147378 Seattle

Addition of specific rows in a dataframe

I'm trying to add specific row of a data frame together.
And short of using grepl to find lines and then rbinding them to the bottom, I'm not sure if there's a better way to do this.
this is my input df:
input = structure(list(
V1 = c("Sales", "Sales", "Sales", "Sales", "Sales","Sales"),
V2 = c("Johnny", "Meg", "Fred", "Johnny", "Meg", "Fred"),
V3 = c("Australia", "Australia", "Australia", "NZ", "NZ","NZ"),
V4 = c(154L, 1898L, 175L, 1235L, 23L, 255L)), row.names = c(NA,6L),
class = "data.frame")
and this is my expected output:
structure(list(
V1 = c("Sales", "Sales", "Sales", "Sales", "Sales",
"Sales", "Sales", "Sales", "Sales", "Sales", "Sales", "Sales"),
V2 = c("Johnny", "Meg", "Fred", "Johnny", "Meg", "Fred", "Johnny + Fred",
"Meg + Fred", "Johnny + Meg + Fred", "Johnny + Fred", "Meg + Fred",
"Johnny + Meg + Fred"),
V3 = c("Australia", "Australia", "Australia", "NZ",
"NZ", "NZ", "Australia", "Australia", "Australia", "NZ", "NZ", "NZ"),
V4 = c(154L, 1898L, 175L, 1235L, 23L, 255L, 329L, 2073L, 2227L, 1490L, 278L, 1513L)),
class = "data.frame", row.names = c(NA, -12L)
)
I would've thought there's a better way to there's a better way of adding these rows that filtering and then adding, and then joining etc.
Can anyone point me in the right direction of what I should be looking for?
I solve the problem using combn
Data input part
input = structure(list(
V1 = c("Sales", "Sales", "Sales", "Sales", "Sales","Sales"),
V2 = c("Johnny", "Meg", "Fred", "Johnny", "Meg", "Fred"),
V3 = c("Australia", "Australia", "Australia", "NZ", "NZ","NZ"),
V4 = c(154L, 1898L, 175L, 1235L, 23L, 255L)), row.names = c(NA,6L),
class = "data.frame")
structure(list(
V1 = c("Sales", "Sales", "Sales", "Sales", "Sales",
"Sales", "Sales", "Sales", "Sales", "Sales", "Sales", "Sales"),
V2 = c("Johnny", "Meg", "Fred", "Johnny", "Meg", "Fred", "Johnny + Fred",
"Meg + Fred", "Johnny + Meg + Fred", "Johnny + Fred", "Meg + Fred",
"Johnny + Meg + Fred"),
V3 = c("Australia", "Australia", "Australia", "NZ",
"NZ", "NZ", "Australia", "Australia", "Australia", "NZ", "NZ", "NZ"),
V4 = c(154L, 1898L, 175L, 1235L, 23L, 255L, 329L, 2073L, 2227L, 1490L, 278L, 1513L)),
class = "data.frame", row.names = c(NA, -12L)
)
Solution
library(dplyr)
TT = unique(input$V2)
> TT
[1] "Johnny" "Meg" "Fred"
comb2 = combn(TT,2,simplify = FALSE)
> comb2
[[1]]
[1] "Johnny" "Meg"
[[2]]
[1] "Johnny" "Fred"
[[3]]
[1] "Meg" "Fred"
comb3 = combn(TT,3,simplify = FALSE)
> comb3
[[1]]
[1] "Johnny" "Meg" "Fred"
result = function(data){
purrr::map_df(lapply(data,function(x){paste(x,collapse = '|')}), function(x){
df = input[grepl(x,input$V2),] %>% group_by(V3)%>%summarize(V1= 'Sales',
V2= paste(V2,collapse = '+'),
V4= sum(V4))
return(df)
}
)
}
Result
result(comb2)
# A tibble: 6 x 4
V3 V1 V2 V4
<chr> <chr> <chr> <int>
1 Australia Sales Johnny+Meg 2052
2 NZ Sales Johnny+Meg 1258
3 Australia Sales Johnny+Fred 329
4 NZ Sales Johnny+Fred 1490
5 Australia Sales Meg+Fred 2073
6 NZ Sales Meg+Fred 278
result(comb3)
# A tibble: 2 x 4
V3 V1 V2 V4
<chr> <chr> <chr> <int>
1 Australia Sales Johnny+Meg+Fred 2227
2 NZ Sales Johnny+Meg+Fred 1513
finalResult = bind_rows(A,B,input) %>%
select(V1,V2,V3,V4) %>% filter(! V2 %in% c('Johnny+Meg'))
> finalResult
# A tibble: 12 x 4
V1 V2 V3 V4
<chr> <chr> <chr> <int>
1 Sales Johnny+Fred Australia 329
2 Sales Johnny+Fred NZ 1490
3 Sales Meg+Fred Australia 2073
4 Sales Meg+Fred NZ 278
5 Sales Johnny+Meg+Fred Australia 2227
6 Sales Johnny+Meg+Fred NZ 1513
7 Sales Johnny Australia 154
8 Sales Meg Australia 1898
9 Sales Fred Australia 175
10 Sales Johnny NZ 1235
11 Sales Meg NZ 23
12 Sales Fred NZ 255
Using tidyverse we can first split the dataframe based on V3 then create combination of names and add sum to create a new tibble and bind it to the original dataframe.
library(tidyverse)
input %>%
bind_rows(input %>%
group_split(V3) %>%
map_dfr(function(x) map_dfr(2:nrow(x), ~tibble(
V1 = first(x$V1),
V2 = combn(x$V2, ., paste, collapse = " + "),
V3 = first(x$V3),
V4 = combn(x$V4, .,sum)) %>%
filter(grepl("\\bFred\\b", V2)))))
# V1 V2 V3 V4
#1 Sales Johnny Australia 154
#2 Sales Meg Australia 1898
#3 Sales Fred Australia 175
#4 Sales Johnny NZ 1235
#5 Sales Meg NZ 23
#6 Sales Fred NZ 255
#7 Sales Johnny + Fred Australia 329
#8 Sales Meg + Fred Australia 2073
#9 Sales Johnny + Meg + Fred Australia 2227
#10 Sales Johnny + Fred NZ 1490
#11 Sales Meg + Fred NZ 278
#12 Sales Johnny + Meg + Fred NZ 1513
Using the same logic but in base R, we can do
rbind(input, do.call(rbind, lapply(split(input, input$V3), function(x)
do.call(rbind, lapply(2:nrow(x), function(y)
subset(data.frame(V1 = x$V1[1],
V2 = combn(x$V2, y, paste, collapse = " + "),
V3 = x$V3[1],
V4 = combn(x$V4, y, sum)),
grepl("\\bFred\\b", V2)))))))

Compare value in two data frame; move values between data frames based on value

I have been unable to find an answer. There probably is one on stackoverflow... but I have not found one that I can use.
I have two data frames (db.1 and db.larger). what I need to do is:
if db.1$ID == db.larger$ID
db1$Gender <- db.larger$Gender
I need to copy the Gender value from db.larger to db.1 if the ID is a match.
Both data frames are between 500.000 rows and six million.
db.1 contains duplicates as more columns not shown in this example contain unique and vital information that I must keep.
both data frames contain more columns than shown
the ID values are characters as they can contain leading zeros.
I have been unable to use match as there are more than one occurrences of persons in db.1
Merge has not worked for me as it adds more data (columns) to the data frames than I want.
Here are the example output files:
db.1 <- structure(list(ID = c("453", "286", "345", "853", "675", "754","445", "564", "651", "685", "453", "286", "345"), Gender = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Name = c("Rashad Lawrence", "Ali Santana", "Cordell Cobb", "Amani Bennett", "Donavan Frank", "Jeffrey Michael", "Aliana Trujillo", "Cheyanne Wyatt", "Kayden Padilla", "Jasmine Glass", "Rashad Lawrence", "Ali Santana", "Cordell Cobb"), Score = c(0, 0.044, 0.822, 0.322, 0.394, 0.309, 0.826, 0.729, 0.318, 0.6, 0.648, 0.547, 0.53)), .Names = c("ID", "Gender","Name", "Score"), row.names = c(NA, -13L), class = "data.frame")
and
db.larger <- structure(list(ID = c("123", "158", "286", "345", "445", "453", "469", "546", "564", "566", "651", "675", "682", "685", "741", "754", "789", "852", "853", "963"), Gender = c(1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1), Name = c("Dexter Holmes", "Roman Macias", "Ali Santana", "Cordell Cobb", "Aliana Trujillo", "Rashad Lawrence", "Preston Mckee", "Kyra Howe", "Cheyanne Wyatt", "Tobias Hart", "Kayden Padilla", "Donavan Frank", "Jamie Yoder", "Jasmine Glass", "Jamar Carter", "Jeffrey Michael", "Erick Tate", "Darion Graves", "Amani Bennett", "Regina Sanders")), .Names = c("ID", "Gender", "Name"), row.names = c(NA, 20L), class = "data.frame")
Since you always have missing values in db.1$Gender, you can delete this column and then perform an inner_join from dplyr. This procedure keeps the duplicates in db.1.
library(dplyr)
db.1 <- db.1 %>%
select(-Gender)
db.combine <- inner_join(db.1,db.larger, by = "ID")
db.combine
ID Name.x Gender Name.y
1 453 Rashad Lawrence 1 Rashad Lawrence
2 286 Ali Santana 2 Ali Santana
3 345 Cordell Cobb 1 Cordell Cobb
4 853 Amani Bennett 1 Amani Bennett
5 675 Donavan Frank 2 Donavan Frank
6 754 Jeffrey Michael 2 Jeffrey Michael
7 445 Aliana Trujillo 1 Aliana Trujillo
8 564 Cheyanne Wyatt 2 Cheyanne Wyatt
9 651 Kayden Padilla 2 Kayden Padilla
10 685 Jasmine Glass 2 Jasmine Glass
11 453 Rashad Lawrence 1 Rashad Lawrence
12 286 Ali Santana 2 Ali Santana
13 345 Cordell Cobb 1 Cordell Cobb
Your Name variables are apparently not perfect matches, you could simply delete either Name.x or Name.y using select, however.

Resources