How to find search words from a table, in another table, and then create new columns of the results? - r

I'm trying to find specifice words listed in a tibble arbeit in the another tibble rawEng$Text. If a word, or words, were found, I want to create, or mutate, a new data frame iDataArbeit with two new columns, one for the found word/s wArbeit, and one for the sum of there tf-idf iArbeitscores from arbeit$tfidf
My Data:
arbeit:
X1 feature tfidf
<dbl> <chr> <dbl>
1 0 sick 0.338
2 2 contract 0.188
3 3 pay 0.175
4 4 job 0.170
5 5 boss 0.169
6 6 sozialversicherungsnummer 0.169
rawEng:
Gender Gruppe Datum Text
<chr> <chr> <dttm> <chr>
1 F Berlin Expats 2017-07-07 00:00:00 Anyone out there who's had to apply for Führung~
2 F FAB 2018-01-18 00:00:00 Dear FAB, I am in need of a Führungszeugnis no ~
3 M Free Advice ~ 2017-01-30 00:00:00 Dear Friends, i would like to ask you how can I~
4 M FAB 2018-04-12 00:00:00 "Does anyone know why the \"Standesamt Pankow (~
5 F Berlin Expats 2018-11-12 00:00:00 having trouble finding consistent information a~
6 F Toytown Berl~ 2017-06-08 00:00:00 "Hello\r\n\r\nI have a question regarding Airbn~
I've tried with dplyr::mutate, using this code:
idataEnArbeit <- mutate(rawEng, wArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
arbeit$feature, NA),
iArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
arbeit$tfidf, NA))
but all I get is one Word, and it's tf-idf score, in the new columens iDatatArbeit$wArbeitand iDataArbeit$iArbeit
Gender Gruppe Datum Text wArbeit iArbeit
<chr> <chr> <dttm> <chr> <chr> <dbl>
1 F Berlin | Girl ~ 2018-09-11 13:22:05 "11 septembre, 13:21 GGI ~ sick 0.338
2 F ExpatBabies Be~ 2017-10-19 16:24:23 "16:24 Babysitter needed! B~ sick 0.338
3 F Berlin | Girl ~ 2018-06-22 18:24:19 "gepostet. Leonor Valen~ sick 0.338
4 F 'Neu in Berlin' 2018-09-18 23:19:51 "Hello guys, I am working wit~ sick 0.338
5 M Free Advice Be~ 2018-04-27 08:49:24 "In need of legal advice: Wha~ sick 0.338
6 F Free Advice Be~ 2018-07-04 18:33:03 "Is there somebody I can pay ~ sick 0.338
In summary: I want all words from arbeit$feature which are found in rawEng$Text to be added in iDataArbeit$wArbeit, and the sum of there tf-idf score to be added in iDataArbeit$iArbeit

Since I don't have your data, I'll import the gutenbergr library and play w/ Treasure Island.
library(tidytext)
library(gutenbergr)
## Now get the dataset
Treasure_Island <- gutenberg_works(title == "Treasure Island") %>% pull(gutenberg_id) %>%
gutenberg_download(.)
## and construct a toy arbeit:
arbeit <- data.frame(feature = c("island", "treasure", "to"),
tfidf = c(0.3,0.5,0.6))
## Break up a word into it's components (the head is just to keep the example short... you omit)
tidy_treasure <- unnest_tokens(Treasure_Island, feature, text, drop = FALSE) %>%
head(500)
## now bring the tfidf into tidy_treasure
df <- left_join(tidy_treasure, arbeit, by = "feature")
## and now you can average by sentence normally.
## To get the words we have to throw out the words that don't contribute to our tfidf.
## Two options:
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% summarize(AveTFIDF = sum(tfidf, na.rm = TRUE),
Words = paste(feature, collapse = ";"))
## Or if you want to keep a row for each found word, we can't use summarize, but we can still add them all up.
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% mutate(AveTFIDF = sum(tfidf, na.rm = TRUE))

Related

How to conditionally mutate a new column when data is in long format, and condition is depending on grouping combination

I have data in long format, and I'm trying to test each row against the mean of a certain grouping combination, in order to generate a new column with the conclusion from that test.
Example
In this toy example, I have data about 20 cars. Each car could be of one of three possible makers. We have mpg data for each car, measured 8 times: in the city or highway, in the morning or evening, during the winter or spring.
library(tidyr)
set.seed(2021)
df_id_and_makers <-
data.frame(id = 1:20,
maker = sample(c("toyota", "audi", "bmw"), size = 20, replace = TRUE))
df <- tidyr::expand_grid(df_id_and_makers,
road_type = c("city", "highway"),
time_of_day = c("morning", "evening"),
season = c("winter", "spring"))
df$mpg_val <- sample(15:40, size = nrow(df), replace = TRUE)
df
#> # A tibble: 160 x 6
#> id maker road_type time_of_day season mpg_val
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 bmw city morning winter 28
#> 2 1 bmw city morning spring 22
#> 3 1 bmw city evening winter 40
#> 4 1 bmw city evening spring 18
#> 5 1 bmw highway morning winter 19
#> 6 1 bmw highway morning spring 36
#> 7 1 bmw highway evening winter 30
#> 8 1 bmw highway evening spring 16
#> 9 2 audi city morning winter 33
#> 10 2 audi city morning spring 18
#> # ... with 150 more rows
Created on 2021-07-07 by the reprex package (v2.0.0)
I want to analyze this data to test my hypothesis that mpg in city is larger than mpg in highway. To this end, I want to create a new column that tests whether the value in mpg_val when road_type is city is larger than the mean of mpg_val across rows where road_type is highway. Furthermore, I want to compare just among cars of the same makers.
So, for example, id = 1 is bmw, and therefore the new column I want to compute should test each value of mpg_val in rows where road_type == city (i.e., rows 1-4, but not 5-6), and see whether mpg_val is larger than mean(mpg_val) in rows where road_type == highway and maker == bmw.
Expected output
Here's the manual and dumb way of doing this. I'll show only how I do this for maker = bmw for the sake of demonstration.
library(dplyr)
# step 1 -- calculate the mean of `mpg_val` for `road_type = highway` and only across bmw
mean_bmw_highway_mpg <-
df %>%
filter(maker == "bmw",
road_type == "highway") %>%
pull(mpg_val) %>%
mean()
mean_bmw_highway_mpg
## [1] 26.22222
# step 2 -- compare each row where `maker = "bmw"` and `road_type = "city"` for its `mpg_val` against `mean_bmw_highway_mpg`
result_bmw_only <-
df %>%
mutate(is_mpg_city_larger_than_mpg_highway = case_when(maker != "bmw" ~ "not_relevant",
road_type != "city" ~ "not_relevant",
mpg_val > mean_bmw_highway_mpg ~ "yes",
TRUE ~ "no"))
result_bmw_only
## # A tibble: 160 x 7
## id maker road_type time_of_day season mpg_val is_mpg_city_larger_than_mpg_highway
## <int> <chr> <chr> <chr> <chr> <int> <chr>
## 1 1 bmw city morning winter 28 yes ## because 28 > 26.222
## 2 1 bmw city morning spring 22 no ## because 22 < 26.222
## 3 1 bmw city evening winter 40 yes
## 4 1 bmw city evening spring 18 no
## 5 1 bmw highway morning winter 19 not_relevant
## 6 1 bmw highway morning spring 36 not_relevant
## 7 1 bmw highway evening winter 30 not_relevant
## 8 1 bmw highway evening spring 16 not_relevant
## 9 2 audi city morning winter 33 not_relevant
## 10 2 audi city morning spring 18 not_relevant
## # ... with 150 more rows
How could I achieve the same result as result_bmw_only (but applied to the entire df) in a more elegant way? Hopefully using dplyr approach, because this is what I'm used to, but otherwise any method will do.
Thanks!
EDIT 1
One solution I could think of involves purrr, but I can't get this done yet.
library(purrr)
solution_purrr <-
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
mpg_val > mean(mpg_val) ~ "yes",
TRUE ~ "no"))))
It seems that solution_purrr gets the desired output, but not exactly. This is because the second logic in case_when (i.e., mpg_val > mean(mpg_val) ~ "yes") is not what I want. I want to compare mpg_val to mean(mpg_val) when that mean is computed based only on rows where road_type == "highway". But here mean(mpg_val) computes across all rows.
EDIT 2
Based on #Till's answer below, I'd like to clarify that I'm looking for a solution that avoids a separate calculation of the mean we want to test against. What I did above with mean_bmw_highway_mpg is the undesired way of working towards the output. I showed mean_bmw_highway_mpg only for demonstrating the kind of mean I need to calculate.
What you tried is already close. Take a look at the documentation of dplyr::group_by()
it is designed for these kinds of operations.
Below is how you can expand your BMW-only solution to the full dataset using group_by().
library(tidyverse)
mean_highway_mpg_df <-
df %>%
filter(road_type == "highway") %>%
group_by(maker) %>%
summarise(mean_highway_mpg = mean(mpg_val))
result_df <-
df %>%
filter(road_type == "city") %>%
group_by(maker) %>%
left_join(mean_highway_mpg_df) %>%
mutate(mpg_city_higher_highway = mpg_val > mean_highway_mpg)
#> Joining, by = "maker"
result_df %>%
select(-(time_of_day:season))
#> # A tibble: 80 x 6
#> # Groups: maker [3]
#> id maker road_type mpg_val mean_highway_mpg mpg_city_higher_highway
#> <int> <chr> <chr> <int> <dbl> <lgl>
#> 1 1 bmw city 28 26.2 TRUE
#> 2 1 bmw city 22 26.2 FALSE
#> 3 1 bmw city 40 26.2 TRUE
#> 4 1 bmw city 18 26.2 FALSE
#> 5 2 audi city 33 28.1 TRUE
#> 6 2 audi city 18 28.1 FALSE
#> 7 2 audi city 35 28.1 TRUE
#> 8 2 audi city 36 28.1 TRUE
#> 9 3 audi city 25 28.1 FALSE
#> 10 3 audi city 32 28.1 TRUE
#> # … with 70 more rows
I think I got this. The following solution is based on both my EDIT 1 above, as well as #MrFlick's comment here.
First, we define a helper function:
is_x_larger_than_mean_y <- function(x, y) {
x > mean(y)
}
Then, we run:
library(dplyr)
library(purrr)
library(tidyr)
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes",
TRUE ~ "no")))) %>%
select(-data) %>%
unnest(cols = tbl_with_desired_new_col)
This way, the line within case_when() that says is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes" ensures that we compute the mean of mpg_val only based on rows in which road_type == "highway".

Multithread computation with R: how to get all different random numbers?

Anyone knows how to get all the random numbers different in the following code? E.g. with doRNG package? I don't care about reproducibility.
Edit: Duplicates by pure chance are accepted.
rm(list = ls())
set.seed(666)
cat("\014")
library(plyr)
library(dplyr)
library(doRNG)
# ====== Data Preparation ======
dt = data.frame(id = 1:10,
part = rep("dt",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
random = NA)
# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)
cl = makeCluster(3, outfile = "")
registerDoParallel(cl)
# ====== SIMULATION ======
nsim = 1000 # number of simulations
iterChunk = 100 # split nsim into this many chunks
out = data.frame() # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
strt = Sys.time()
out_iter =
foreach(i = 1:iterChunk, .combine = rbind, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr")) %dopar% {
# simulation number
id_sim = iterChunk * (iter - 1) + i
## Generate random numbers
tmp_sim = is.na(dt$HG) # no results yet
dt$random[tmp_sim] = runif(sum(tmp_sim))
dt$HG[tmp_sim] = 3
# Save Results
dt$id_sim = id_sim
dt$iter = iter
dt$i = i
print(Sys.time())
return(dt)
}#i;sim_forcycle
out = rbind.data.frame(out,subset(out_iter, !is.na(random)))
fnsh = Sys.time()
cat(" [",iter,"] ",fnsh - strt, sep = "")
}#iter
# ====== Stop Parallel Computing ======
stopCluster(cl)
# ====== Distinct Random Numbers ======
length(unique(out$random)) # expectation: 6000
I have been strugling with this for 2 days. I asked this question earlier with only general response about random numbers.
Here I would like to ask for a solution (if anybody knows) how to set doRNG package options (or similar package) in a way that all the random numbers are different. Across all the loops.
I have tried tons of doRNG settings and I still can't get it to work. Tried R versions 3.5.3 and 3.6.3 on two different computers.
UPDATE Following discussion with #Limey
Purpose of the code is to simulate football matches. As the simulation is large, I use iterChunk to "split" the simulation into managable parts and after each iter send the data into PostgreSQL database so the simulation doesn't overload RAM. Some matches already have real world results and have HG (home goals) filled in. I want to simulate the rest.
When setting iterChunk to 1 everything is fine. Increasing iterChunk leads to generation of same numbers within iter. For example when I set nsim to 100 and iterChunk to 10. (All matches simulated 100 times, 10 times in 10 loops). I expect 600 random numbers (each match independently simulated accross all the loops). However I only get 180 - following the logic: 3 cores * 6 matches * 10 iterChunks.) Using 2 workers I do get 120 distinct random numbers (2 * 6 * 10)
Furthermore: exluding dt$HG[tmp_sim] = 3 I do get all random numbers different with whatever setting.
To understand the problem, I suggest:
Run the code as is. (possibly setting nsim to 100 and iterChunk to 10) You will get 180 different random numbers. With lower number of nsim & iterChunk things may work as expected.
Comment out dt$HG[tmp_sim] = 3.
You will get 6000 different random numbers (600 if you change nsim and iterChunk)
The code in 2nd step assigns goals scored by home team. It looks like some kind of bug I can't get over. Even information that someone gets the same result and doesn't know why will be helpful - it will lift the weight of my own stupidity out of me.
Thank you, I highly appreciate any effort.
I realised what the problem with OP's code was whilst I was in the shower. It's simple, and obvious in retrospect: all the loops and parallel processes are working on the same object - the dt data frame. So they're constantly overwriting the changes that each makes, and at the end of the outer loop, you just have multiple copies of the changes made by the last loop to complete. The solution is equally simple: work on a copy of the dt data frame.
To minimise the changes, I renamed dt to baseDT
# ====== Data Preparation ======
baseDT = data.frame(id = 1:10,
part = rep("dt",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
random = NA)
and then took a copy of it at the top of the foreach loop
out_iter = foreach(i = 1:iterChunk,
.combine = rbind, .multicombine = TRUE, .maxcombine = 100000,
.inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr")) %dopar% {
dt <- baseDT
This gives
> length(unique(out$random)) # expectation: 6000
[1] 6000
as expected.
Modifying the "Hello World" example in the "getting started with doParallel" vignette to generate random numbers, I came up with:
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
myFunc <- function(n) {runif(n)}
foreach(i=1:3) %dopar% myFunc(10)
[[1]]
[1] 0.18492375 0.13388278 0.65455450 0.93093066 0.41157625 0.89479764 0.14736529 0.47935995 0.03062963 0.16110714
[[2]]
[1] 0.89245145 0.20980791 0.83828019 0.04411547 0.38184303 0.48110619 0.51509058 0.93732055 0.40159834 0.81414140
[[3]]
[1] 0.74393129 0.66999730 0.44411989 0.85040773 0.80224527 0.72483644 0.64566262 0.22546420 0.14526819 0.05931329
Suggesting that getting random numbers across threads is straightforward. Indeed, the examples on pages 2 and 3 of the doRNG reference manual say the same thing.
In fact, if I understand you correctly, the purpose of doRNG is to do precisely the opposite of what you want: to make random processes reproducible across threads.
Of course, this doesn't guarantee that all numbers are different across all threads. But it makes duplication very unlikely. A guarantee of no duplicates would mean some degree of determinism in the process: a completely random process might produce duplicates by chance.
Update
Following on from our conversation in the comments...
We've established that the problem is in your program logic, not the parallelisation per se. So we need to refocus the question: what are you trying to do. I'm afraid it's not at all clear to me. So that means we need to simplify.
I set nsim to 5 and iterChunk to 1. I get 5 data frames which look like
id part HG random id_sim iter i
1 1 dt 1 NA 1 1 1
2 2 dt 3 NA 1 1 1
3 3 dt 6 NA 1 1 1
4 4 dt 3 0.6919744 1 1 1
5 5 dt 3 0.5413398 1 1 1
6 6 dt 2 NA 1 1 1
7 7 dt 3 0.3983175 1 1 1
8 8 dt 3 0.3342174 1 1 1
9 9 dt 3 0.6126020 1 1 1
10 10 dt 3 0.4185468 1 1 1
In each, the values of id_sim and iter are always the same, and run from 1 in the first data frame to 5 in the fifth. i is 1 for all rows in all data frames. Values in random do appear to be random, and different between data frames. But the NAs are all in the same positions in every data frame: the 1st, 2nd, 3rd and 6th rows. The values of HG are as shown above for all five data frames.
Is that what you would expect? If not, what do you expect? Given we know the problem is not the paraellisation, you need to give us more information.
Update 2
Do you know Arduan? They posted a related question over the weekend...
I'm not going to tell you what's wrong with your code. I'll show you how I would apprach your problem. I hope you'll agree it's more readable, if nothing else.
So, we're simulating some football matches. I'll assume its a league format and use the english Premier League as an example. Start by generating the fixture list for a single season.
library(tidyverse)
teams <- c("Arsenal", "Aston Villa", "Bournemouth", "Brighton & Hove Albion",
"Burnley", "Chelsea", "Crystal Palace", "Everton", "Leicester City",
"Liverpool", "Manchester City", "Manchester United", "Newcastle United",
"Norwich City", "Sheffield United", "Southampton", "Tottenham Hotspur",
"Watford", "West Ham United", "Wolverhampton Wanderers")
fixtures <- tibble(HomeTeam=teams, AwayTeam=teams) %>%
complete(HomeTeam, AwayTeam) %>%
filter(HomeTeam != AwayTeam) # A team can't play itself
fixtures %>% head(5)
# A tibble: 5 x 2
HomeTeam AwayTeam
<chr> <chr>
1 Arsenal Aston Villa
2 Arsenal Bournemouth
3 Arsenal Brighton & Hove Albion
4 Arsenal Burnley
5 Arsenal Chelsea
Suppose we know some results. I'll use yesterday's matches as an illustration.
knownResults <- tribble(~HomeTeam, ~AwayTeam, ~HomeGoals, ~AwayGoals,
"Burnley", "Sheffield United", 1, 1,
"Newcastle United", "West Ham United", 2, 2,
"Liverpool", "Aston Villa", 2, 0,
"Southampton", "Manchester City", 1, 0)
resultsSoFar <- fixtures %>%
left_join(knownResults, by=c("HomeTeam", "AwayTeam"))
resultsSoFar %>% filter(!is.na(HomeGoals))
# A tibble: 4 x 4
HomeTeam AwayTeam HomeGoals AwayGoals
<chr> <chr> <dbl> <dbl>
1 Burnley Sheffield United 1 1
2 Liverpool Aston Villa 2 0
3 Newcastle United West Ham United 2 2
4 Southampton Manchester City 1 0
Now some utility functions. You could certainly combine them, but I think it's clearer to keep them separate so you can see exactly what each one is doing.
First, a function to simulate the results of all matches whose results are unknown. The details of how you simulate the scores are entirely arbitrary. I've assumed that home teams score an average of 1.5 goals a game, away teams score 1.2 goals per game. Later on, I'm going to use this to simulate many seasons in one go, so I'll add a variable (Iteration) to index the season.
simulateResults <- function(i=NA, data) {
n <- nrow(data)
data %>%
add_column(Iteration=i, .before=1) %>%
mutate(
# Give the home team a slight advantage
HomeGoals=ifelse(is.na(HomeGoals), floor(rexp(n, rate=1/1.5)), HomeGoals),
AwayGoals=ifelse(is.na(AwayGoals), floor(rexp(n, rate=1/1.2)), AwayGoals)
)
}
Use it, and check that we haven't overwritten known results:
simulateResults(1, resultsSoFar) %>% filter(HomeTeam=="Burnley", AwayTeam=="Sheffield United")
# A tibble: 1 x 5
Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<dbl> <chr> <chr> <dbl> <dbl>
1 1 Burnley Sheffield United 1 1
I'm going to parallelise the overall simulation, so now let's have a function to simulate a chunk of simulations. Again, create an index column to identify the chunk.
simulateChunk <- function(chunkID=NA, n) {
bind_rows(lapply(1:n, simulateResults, data=resultsSoFar)) %>%
add_column(Chunk=chunkID, .before=1)
}
simulateChunk(chunkID=1, n=3)
# A tibble: 1,140 x 6
Chunk Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<dbl> <int> <chr> <chr> <dbl> <dbl>
1 1 1 Arsenal Aston Villa 2 0
2 1 1 Arsenal Bournemouth 0 0
3 1 1 Arsenal Brighton & Hove Albion 2 0
4 1 1 Arsenal Burnley 2 0
5 1 1 Arsenal Chelsea 1 0
6 1 1 Arsenal Crystal Palace 0 0
7 1 1 Arsenal Everton 2 3
8 1 1 Arsenal Leicester City 2 0
9 1 1 Arsenal Liverpool 0 1
10 1 1 Arsenal Manchester City 4 0
OK. Now I'm ready to do the main simulation work. I'll run 10 chunks of 100 simulations eash, to give 1000 simulated seasons in total, the same as you had.
library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)
chunkSize <- 100
nChunks <- 10
startedAt <- Sys.time()
x <- bind_rows(foreach(i=1:nChunks, .packages=c("tidyverse")) %dopar% simulateChunk(i, n=chunkSize))
finishedAt <- Sys.time()
print(finishedAt - startedAt)
Time difference of 6.772928 secs
stopCluster(cl)
> x
# A tibble: 380,000 x 6
Chunk Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<int> <int> <chr> <chr> <dbl> <dbl>
1 1 1 Arsenal Aston Villa 2 0
2 1 1 Arsenal Bournemouth 3 1
3 1 1 Arsenal Brighton & Hove Albion 0 1
4 1 1 Arsenal Burnley 3 0
5 1 1 Arsenal Chelsea 1 0
6 1 1 Arsenal Crystal Palace 0 0
7 1 1 Arsenal Everton 1 2
8 1 1 Arsenal Leicester City 0 0
9 1 1 Arsenal Liverpool 0 0
10 1 1 Arsenal Manchester City 0 0
Let's check I've got sensible results. As a basic check, I'll look at the results of Arsenal vs Aston Villa:
x %>%
filter(HomeTeam == "Arsenal", AwayTeam=="Aston Villa") %>%
group_by(HomeGoals, AwayGoals) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
values_from="N", names_prefix="AwayGoals",
names_sep="", names_from=AwayGoals
)
# A tibble: 8 x 10
HomeGoals AwayGoals0 AwayGoals1 AwayGoals2 AwayGoals3 AwayGoals4 AwayGoals5 AwayGoals6 AwayGoals8 AwayGoals7
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 299 129 57 19 12 7 NA NA NA
2 1 135 63 25 6 4 4 1 2 NA
3 2 75 21 12 9 4 1 NA NA 1
4 3 30 13 10 1 NA NA NA NA NA
5 4 21 7 1 1 NA NA NA NA NA
6 5 11 2 1 NA 2 NA NA NA NA
7 6 4 2 2 NA NA NA NA NA NA
8 7 4 1 1 NA NA NA NA NA NA
That looks reasonable. Now confirm that the matches with known results don't vary. For example:
x %>%
filter(HomeTeam == "Liverpool", AwayTeam=="Aston Villa") %>%
group_by(HomeGoals, AwayGoals) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(values_from="N", names_prefix="AwayGoals", names_sep="", names_from=AwayGoals)
HomeGoals AwayGoals0
<dbl> <int>
1 2 1000
All good.
So, That's 23 statements to generate the fixtures, take account of known results, simulate the remainder of the matches and do some basic sanity checking. I could easily get that down to under 20 statements if I had to. That's about a third less than you were using just to try to simulate the unknown results. [The actual simulation takes fewer than 10 statements.] I think my approach is easier to understand: by using tidy verbs the code is almost self-documenting.

Joining tables and applying functions to columns with the same name in R and tidyverse

I am looking to join tables with customer id (easy enough) but then I want to multiply the columns to get updated values.
Customer_Week_1<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(2,5,8,9),
Factor_2=c(.5,.5,.75,.75),
Factor_3=c(0,1,2,3))
Customer_Week_2<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(3,7,1,7),
Factor_2=c(.51,.65,.72,.4),
Factor_3=c(1,2,3,4))
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))
The expected results can be found by in a vector by just
Customer_week3_expected<-Customer_Week_1[,3:5]*Customer_Week_2[,3:5]
And I know I can just manually type out every column. But I have dozens of columns and need to make this code as easy to follow as possible.
I also know that I can just bind the results vector to
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))%>%
select(1:2)
But that does not look like best practice to me, and I would rather this be done with a join some way to ensure everything lines up when I am iterating over the customers(tables)
Assuming I understand the output you're trying to get, I can think of two methods. If you know that the names are in the first two columns and are the same in both data frames (this might not be the case in real life), you can use the same multiplication operation you tried above, bound to the first two columns of either of the data frames.
cbind(Customer_Week_1[1:2], Customer_Week_1[-1:-2] * Customer_Week_2[-1:-2])
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> 1 John Jackson 6 0.255 0
#> 2 Mary Smith 35 0.325 2
#> 3 David Williams 8 0.540 6
#> 4 Paul Zimmerman 63 0.300 12
Or you can be more verbose but maybe more flexible, and eshape to a long data frame, then do a grouped operation to summarize products for each person and factor. Starting from the join you have above:
library(dplyr)
library(tidyr)
Customer_week3 <- Customer_Week_1 %>%
left_join(Customer_Week_2, by = c("First_name", "Last_name"))
Make long-shaped data, separate the Factor_1.x into Factor_1 and x, and make products as your summary calculation.
products <- Customer_week3 %>%
gather(key = factor, value = value, -First_name, -Last_name) %>%
separate(factor, into = c("factor", "week"), sep = "\\.") %>%
group_by(First_name, Last_name, factor) %>%
summarise(value = prod(value))
head(products)
#> # A tibble: 6 x 4
#> # Groups: First_name, Last_name [2]
#> First_name Last_name factor value
#> <fct> <fct> <chr> <dbl>
#> 1 David Williams Factor_1 8
#> 2 David Williams Factor_2 0.54
#> 3 David Williams Factor_3 6
#> 4 John Jackson Factor_1 6
#> 5 John Jackson Factor_2 0.255
#> 6 John Jackson Factor_3 0
If you need to get back to a wide format, spread back.
products %>%
spread(key = factor, value = value)
#> # A tibble: 4 x 5
#> # Groups: First_name, Last_name [16]
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 David Williams 8 0.54 6
#> 2 John Jackson 6 0.255 0
#> 3 Mary Smith 35 0.325 2
#> 4 Paul Zimmerman 63 0.3 12
Similar to #camille's reshaping, but in data.table (and disregarding Customer_week3):
library(data.table)
# long format
long = rbindlist(list(Customer_Week_1, Customer_Week_2), id=TRUE)
# aggregate
long[, lapply(.SD, prod), by=.(First_name, Last_name), .SDcols=patterns("^Factor")]
First_name Last_name Factor_1 Factor_2 Factor_3
1: John Jackson 6 0.255 0
2: Mary Smith 35 0.325 2
3: David Williams 8 0.540 6
4: Paul Zimmerman 63 0.300 12
Going longer (again as seen in #camille's answer) might also make sense, so as to avoid repeatedly fiddling with names of Factor_* columns:
longer = melt(long, meas=patterns("^Factor")) # analogous to gather
longer[, .(value = prod(value)), by=.(First_name, Last_name, variable)]

Download list of files and apply unique cleaning function to each then bind to single dataframe

I am extracting information from more than 30 pdfs that describe expenditures for the Army, Navy, Marines, and Air Force.
Each service formats its pdf differently, so I have written four separate cleaning functions that extract the data I need. (However, the pdfs sometime vary across years. So I may someday need to write specific cleaning functions for specific years.)
What technique should I use to download, apply the associated cleaning function, and rbind the many files?
Conceptually, my idea is to insert the relevant function in each row, and somehow use purrr to download, apply the associated function, then bind_row?
I have not seen this done before, but believe it must be a common practice. Examples/references/tutorials very welcome indeed!
#### Data (Example)#####
df <- expand.grid(
service = c("Army", "Navy", "Marines", "Air.Force"),
year = c(2010:2019)
) %>% tbl_df() %>%
mutate(my.hyperlink = str_c("http://", "_", service, "_", year, ".html"),
my.cleaning.function = str_c(service, "cleaner",sep = "_" ))
# A tibble: 40 x 4
service year my.hyperlink my.cleaning.function
<fct> <int> <chr> <chr>
1 Army 2010 http://_Army_2010.html Army_cleaner
2 Navy 2010 http://_Navy_2010.html Navy_cleaner
3 Marines 2010 http://_Marines_2010.html Marines_cleaner
4 Air.Force 2010 http://_Air.Force_2010.html Air.Force_cleaner
5 Army 2011 http://_Army_2011.html Army_cleaner
6 Navy 2011 http://_Navy_2011.html Navy_cleaner
7 Marines 2011 http://_Marines_2011.html Marines_cleaner
8 Air.Force 2011 http://_Air.Force_2011.html Air.Force_cleaner
9 Army 2012 http://_Army_2012.html Army_cleaner
10 Navy 2012 http://_Navy_2012.html Navy_cleaner
# ... with 30 more rows
Here is a quick example of how you can possibly do this. Please let me know if the example is not clear enough.
library(tidyverse, quietly = TRUE)
df <- expand.grid(
service = c("Army", "Navy", "Marines", "Air.Force"),
year = c(2010:2019)
) %>%
tbl_df() %>%
mutate(my.hyperlink = str_c("http://", "_", service, "_", year, ".html"),
my.cleaning.function = str_c(service, "cleaner",sep = "_" ))
# define two example functions
Army_cleaner <- function(txt) {
tibble(
my_text = str_to_lower(txt),
my_num = runif(4)
)
}
Navy_cleaner <- function(txt) {
tibble(
my_text = str_to_upper(txt),
my_num = runif(4)
)
}
# fiter the data.frame only for the functions that we have defined
# then run the example
df %>%
filter(my.cleaning.function %in% c("Army_cleaner", "Navy_cleaner")) %>%
mutate(my_data = map2(my.hyperlink, my.cleaning.function, ~ {
FUN <- get(.y)
FUN(.x)
})) %>%
unnest()
#> # A tibble: 80 x 6
#> service year my.hyperlink my.cleaning.funct… my_text my_num
#> <fct> <int> <chr> <chr> <chr> <dbl>
#> 1 Army 2010 http://_Army_201… Army_cleaner http://_army… 0.478
#> 2 Army 2010 http://_Army_201… Army_cleaner http://_army… 0.386
#> 3 Army 2010 http://_Army_201… Army_cleaner http://_army… 0.225
#> 4 Army 2010 http://_Army_201… Army_cleaner http://_army… 0.421
#> 5 Navy 2010 http://_Navy_201… Navy_cleaner HTTP://_NAVY… 0.450
#> 6 Navy 2010 http://_Navy_201… Navy_cleaner HTTP://_NAVY… 0.515
#> 7 Navy 2010 http://_Navy_201… Navy_cleaner HTTP://_NAVY… 0.429
#> 8 Navy 2010 http://_Navy_201… Navy_cleaner HTTP://_NAVY… 0.0371
#> 9 Army 2011 http://_Army_201… Army_cleaner http://_army… 0.433
#> 10 Army 2011 http://_Army_201… Army_cleaner http://_army… 0.354
#> # ... with 70 more rows

Is it possible to sort Data Frame by a value in the row before?

I want to sort a data frame by the value in the row before. That's why
I have two columns "startpoint" and "endpoint".
Now I want to sort the rows so that each row with a certain "endpoint" is followed by the row with the same value in "startpoint".
for example:
+------------+-----------+
| Startpoint | Endpoint |
+------------+-----------+
| Berlin | Munich |
| Munich | Paris |
| Paris | Barcelona |
| Barcelona | Rom |
+------------+-----------+
I'm a bit conflicted about giving an answer to op's that don't follow posting guidelines, but I found this to be an interesting puzzle to solve.
Provided there is no conflicts such as having multiple flights (I presume they are flights) with the same place of departure (or of destination), or having a series that goes "round trip", in which cases you'll run into infinite looping, here's a (somewhat surpisingly simple) solution.
flights <- read.table(sep = ",", stringsAsFactors = FALSE, header = TRUE, text = "
dep,arr
Montreal,Washington
Berlin,Munich
Miami,Paris
Munich,New York
Barcelona,Rome
New York,Montreal
Washington,Miami
Paris,Barcelona
")
cont <- TRUE
while (cont) {
# move down a flight for which the dep is an arr further down
for (i in 1:(nrow(flights)-1)) {
ind <- which(flights$arr == flights$dep[i])
if (length(ind) == 0) next
if (ind > i) {
flights <- flights[c(seq_len(i-1),
(i+1):ind,
i,
seq(from = ind + 1,
length.out = nrow(flights) - ind)),]
break
}
cont <- FALSE
}
}
Results
> flights
dep arr
2 Berlin Munich
4 Munich New York
6 New York Montreal
1 Montreal Washington
7 Washington Miami
3 Miami Paris
8 Paris Barcelona
5 Barcelona Rome
Here is a very convoluted first solution that I welcome attempts to improve, since I don't know much about sort algorithms. I think this also won't work for any large table because permutations increases in size too rapidly (it's the factorial, after all).
I start by making a reordered version with some value column, since presumably that's why we have to rearrange the rows. First get the list of all the potential row orders. Then do a check on the rows to see if the next start is equal to the current end, and if so keep those rows together. Filter out the row orders that don't have those rows together, and then randomly shuffle the table to one of the remaining row orders. Stop when there's only one option left.
This clearly assumes that there is a unique solution (really there are at least two since you could just reverse this order and satisfy the condition but early on one will randomly be chosen).
library(tidyverse)
library(arrangements)
set.seed(100)
tbl <- tibble(
start = c("Berlin", "Munich", "Paris", "Barcelona"),
end = c("Munich", "Paris", "Barcelona", "Rome"),
val = rnorm(4)
) %>%
slice(sample(1:nrow(.), nrow(.))) %>%
rowid_to_column()
tbl
#> # A tibble: 4 x 4
#> rowid start end val
#> <int> <chr> <chr> <dbl>
#> 1 1 Paris Barcelona -0.0789
#> 2 2 Berlin Munich -0.502
#> 3 3 Munich Paris 0.132
#> 4 4 Barcelona Rome 0.887
row_orders <- permutations(nrow(tbl)) %>%
as_tibble() %>%
unite(order, remove = FALSE) %>%
nest(-order) %>%
mutate(data = map(data, as.integer))
sample_orders <- row_orders
sample_tbl <- tbl
while (nrow(sample_orders) > 1) {
keep_together <- sample_tbl %>%
mutate(
nrc = lead(start) == end,
nrc = replace_na(nrc, FALSE),
cumsum = cumsum(lag(nrc, default = FALSE) == FALSE)
) %>%
group_by(cumsum) %>%
summarise(row_groups = str_c(rowid, collapse = "_")) %>%
filter(str_length(row_groups) > 1) %>%
`[[`("row_groups")
sample_orders <- sample_orders %>%
filter(str_detect(order, keep_together))
sample_tbl <- tbl %>%
slice(sample_orders$data[[sample(1:nrow(sample_tbl), 1)]])
}
#> Error in slice_impl(.data, dots): Evaluation error: subscript out of bounds.
print(sample_tbl)
#> # A tibble: 4 x 4
#> rowid start end val
#> <int> <chr> <chr> <dbl>
#> 1 2 Berlin Munich -0.502
#> 2 3 Munich Paris 0.132
#> 3 1 Paris Barcelona -0.0789
#> 4 4 Barcelona Rome 0.887
Created on 2018-04-19 by the reprex package (v0.2.0).

Resources