functional programming problems -- map_df & regex - r

I am trying to combine multiple spreadsheets (about 20) using a functional programming approach. Each spreadsheet contains an individual year of data. They are messy, with columns not named or name of same column changing across the spreadsheets.
I originally did all the cleaning up individually for each spreadsheet but want to learn how to do it with a functional programming to make it more reproducible.
My approach was to build a regex to match all the different names of the specified column, then rename the column using a custom function/regex. I thought I could then use map_dfr to apply this function to all the different spreadsheets to produce a final dataframe to work with.
However I have encountered 2 problems:
the regex engine in R seems to have the global parameter on and no way to switch it off. I want to match the the different possibilities in the regex expression in sequence and stop when it finds the first match, not all matches. For example, after I import the spreadsheets sometimes there are mulitple unamed columns which get given names ...1 etc. I only want to match the first instance. I cannot seem to work out if it possible to disable the global parameter, or a cleverer way of writing the regex to stop after the first match. Also is there another, perhaps better, way of approaching this?
When I pass my custom function, which seems to work well enough on individual dataframes, I get an error from map_df which I am not quite sure why.
I have produced a minimal reprex below, which I think highlights the issues.
All thoughts greatly received, including alternative approaches to this, as this must be a very common problem people come across. Thanks.
library(tidyverse)
year_1 <- tribble(
~`...1`, ~admissions,
"Hospital 1", 10,
"Hospital 2", 100,
"hospital 3", 200
)
year_2 <- tribble(
~provider_code, ~`...2`, ~admissions,
"H1", "Hospital 1", 20,
"H2", "Hospital 2", 400,
"H3", "hospital 3", 500
)
year_3 <- tribble(
~"Hospital provider code", ~"Commissioning region/Provider", ~admissions,
"H1", "Hospital 1", 350,
"H2", "Hospital 2", 350,
"H3", "hospital 3", 550
)
clean_up_area_column_name <- function(x){
rename({{x}}, area = matches("\\.{3}[0-9]|commissioning region|hospital provider", ignore.case = TRUE))
}
clean_up_area_column_name(year_1)
#> # A tibble: 3 × 2
#> area admissions
#> <chr> <dbl>
#> 1 Hospital 1 10
#> 2 Hospital 2 100
#> 3 hospital 3 200
clean_up_area_column_name(year_2)
#> # A tibble: 3 × 3
#> provider_code area admissions
#> <chr> <chr> <dbl>
#> 1 H1 Hospital 1 20
#> 2 H2 Hospital 2 400
#> 3 H3 hospital 3 500
clean_up_area_column_name(year_3)
#> # A tibble: 3 × 3
#> area1 area2 admissions
#> <chr> <chr> <dbl>
#> 1 H1 Hospital 1 350
#> 2 H2 Hospital 2 350
#> 3 H3 hospital 3 550
test_df <- map_dfr(c(year_1, year_2, year_3), clean_up_area_column_name)
#> Error in UseMethod("rename"): no applicable method for 'rename' applied to an object of class "character"
Created on 2022-08-08 by the reprex package (v2.0.1)

Passing multiple data.frames to map requires a list
test_df <- map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)
# A tibble: 9 x 5
area admissions provider_code area1 area2
<chr> <dbl> <chr> <chr> <chr>
1 Hospital 1 10 NA NA NA
2 Hospital 2 100 NA NA NA
3 hospital 3 200 NA NA NA
4 Hospital 1 20 H1 NA NA
5 Hospital 2 400 H2 NA NA
6 hospital 3 500 H3 NA NA
7 NA 350 NA H1 Hospital 1
8 NA 350 NA H2 Hospital 2
9 NA 550 NA H3 hospital 3

If you only want to grab the first instances, as you say, then the following tweak to your function should work. Rename any "area1" to "area", then de-select the remaining "area" columns names with trailing digits (area2, area3 etc).
clean_up_area_column_name <- function(x){
rename({{x}},
area = matches("\\.{3}[0-9]|commissioning region|hospital provider")) %>%
rename(., area = matches("area1")) %>%
select(-any_of(matches("area\\d")))
}
I'm not sure what you expect year_3 to return as it seems your regex is matching the provider_code as well as area:
map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)
# A tibble: 9 × 3
area admissions provider_code
<chr> <dbl> <chr>
1 Hospital 1 10 NA
2 Hospital 2 100 NA
3 hospital 3 200 NA
4 Hospital 1 20 H1
5 Hospital 2 400 H2
6 hospital 3 500 H3
7 H1 350 NA
8 H2 350 NA
9 H3 550 NA

Related

Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.
Here's the table.
And here's the code:
# load library
library(pdftools)
# link to report
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
# read data through pdftool
data <- pdf_text(url)
# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
table_number,
number_columns,
column_names,
first_row,
last_row) {
data <- list_of_tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[min(grep(first_row, data)):
max(grep(last_row, data))]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data)
names(data) <- column_names
return(data)
}
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")
And the output.
6m 30-9-2022 6m 30-9-2021 12m 30-3-2022 NA
1 Care fees 210,187 194,603 398,206
2 Management fees 59,746 50,959 105,552
3 Interest received 364 42 41
4 Other income 3,942 2,260 4,998
5 Total revenue 274,239 247,864 508,797
6
7 Fair-value movement of
8 investment properties 3 261,346 285,143 745,885
9 Total income 535,585 533,007 1,254,682
10
11 Operating expenses (265,148) (225,380) (466,238)
12 Depreciation and
13 amortisation expenses (22,996) (17,854) (35,698)
14 Finance costs (19,355) (15,250) (30,664)
15 Impairment loss 2 (10,784) - -
16 Total expenses (318,283) (258,484) (532,600)
17
18 Profit before income tax 217,302 274,523 722,082
19 Income tax (expense) / credit (23,316) 6,944 (29,209)
20 Profit for the period 193,986 281,467 692,873
21
22 Earnings per share
23 Basic and diluted (cents per share) 38.8 56.3 138.6
How can I best circumvent this issue?
Many thanks in advance!
While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.
library(pdftools)
library(dplyr, warn.conflicts = F)
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
data <- pdf_text(url)
scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
lines <- unlist(strsplit(pdf_text_item, "\n"))
# remove 0-length lines
lines <- lines[nchar(lines) > 0]
lines <- lines[min(grep(first_row_str, lines)):
max(grep(last_row_str , lines))]
# paste lines back into single string for read_fwf()
paste(lines, collapse = "\n") %>%
readr::read_fwf() %>%
# re-connect strings in first colum if values were split between rows
mutate(X1 = if_else(!is.na(lag(X1)) & is.na(lag(X3)), paste(lag(X1), X1), X1)) %>%
filter(!is.na(X3))
}
output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )
Result:
output %>%
mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#> X1 X2 X3 X4 X5
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Care fees NA 210,187 194,603 398,206
#> 2 Management fees NA 59,746 50,959 105,552
#> 3 Interest received NA 364 42 41
#> 4 Other income NA 3,942 2,260 4,998
#> 5 Total revenue NA 274,239 247,864 508,797
#> 6 Fair-value movement of investmen... 3 261,346 285,143 745,885
#> 7 Total income NA 535,585 533,007 1,254,682
#> 8 Operating expenses NA (265,148) (225,380) (466,238)
#> 9 Depreciation and amortisation ex... NA (22,996) (17,854) (35,698)
#> 10 Finance costs NA (19,355) (15,250) (30,664)
#> 11 Impairment loss 2 (10,784) - -
#> 12 Total expenses NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax NA 217,302 274,523 722,082
#> 14 Income tax (expense) / credit NA (23,316) 6,944 (29,209)
#> 15 Profit for the period NA 193,986 281,467 692,873
#> 16 Earnings per share Basic and dil... NA 38.8 56.3 138.6
Created on 2022-11-19 with reprex v2.0.2

Generate codes based on Nominal Variables present in a dataframe

I have a data frame that has 1000 observations and it has this structure below.
Town <- c("TownA", "TownB", "TownC","TownD","Town A", "Town Z")
Ward <- c("Ward B","Ward Z","Ward A","Ward W","Ward X", "Ward ")
DF <- data.frame(Town, Ward)
I have another dataset that contains codes that represent the nominal observations of Town and Ward. The codes are the ones to be used for analysis. For example, Town A has the code 23, Town B has the code 15, Town Z has the code 7. Instead of manually creating a new column and populating the codes based on towns, is there a simpler way to do this in R?
My goal is to mutate a new column that will match the codes with the towns. The dataset has around 200 Towns.
You can create a new code table and then do joining:
library(tidyverse)
Town <- c("TownA", "TownB", "TownC","TownD","Town A", "Town Z")
Ward <- c("Ward B","Ward Z","Ward A","Ward W","Ward X", "Ward ")
DF <- data.frame(Town, Ward)
codes <- tribble(
~Town, ~Code,
"TownA", 23,
"TownB", 15,
"Town Z", 7
)
codes
#> # A tibble: 3 × 2
#> Town Code
#> <chr> <dbl>
#> 1 TownA 23
#> 2 TownB 15
#> 3 Town Z 7
DF %>%
left_join(codes)
#> Joining, by = "Town"
#> Town Ward Code
#> 1 TownA Ward B 23
#> 2 TownB Ward Z 15
#> 3 TownC Ward A NA
#> 4 TownD Ward W NA
#> 5 Town A Ward X NA
#> 6 Town Z Ward 7
Created on 2021-09-20 by the reprex package (v2.0.1)

Multithread computation with R: how to get all different random numbers?

Anyone knows how to get all the random numbers different in the following code? E.g. with doRNG package? I don't care about reproducibility.
Edit: Duplicates by pure chance are accepted.
rm(list = ls())
set.seed(666)
cat("\014")
library(plyr)
library(dplyr)
library(doRNG)
# ====== Data Preparation ======
dt = data.frame(id = 1:10,
part = rep("dt",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
random = NA)
# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)
cl = makeCluster(3, outfile = "")
registerDoParallel(cl)
# ====== SIMULATION ======
nsim = 1000 # number of simulations
iterChunk = 100 # split nsim into this many chunks
out = data.frame() # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
strt = Sys.time()
out_iter =
foreach(i = 1:iterChunk, .combine = rbind, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr")) %dopar% {
# simulation number
id_sim = iterChunk * (iter - 1) + i
## Generate random numbers
tmp_sim = is.na(dt$HG) # no results yet
dt$random[tmp_sim] = runif(sum(tmp_sim))
dt$HG[tmp_sim] = 3
# Save Results
dt$id_sim = id_sim
dt$iter = iter
dt$i = i
print(Sys.time())
return(dt)
}#i;sim_forcycle
out = rbind.data.frame(out,subset(out_iter, !is.na(random)))
fnsh = Sys.time()
cat(" [",iter,"] ",fnsh - strt, sep = "")
}#iter
# ====== Stop Parallel Computing ======
stopCluster(cl)
# ====== Distinct Random Numbers ======
length(unique(out$random)) # expectation: 6000
I have been strugling with this for 2 days. I asked this question earlier with only general response about random numbers.
Here I would like to ask for a solution (if anybody knows) how to set doRNG package options (or similar package) in a way that all the random numbers are different. Across all the loops.
I have tried tons of doRNG settings and I still can't get it to work. Tried R versions 3.5.3 and 3.6.3 on two different computers.
UPDATE Following discussion with #Limey
Purpose of the code is to simulate football matches. As the simulation is large, I use iterChunk to "split" the simulation into managable parts and after each iter send the data into PostgreSQL database so the simulation doesn't overload RAM. Some matches already have real world results and have HG (home goals) filled in. I want to simulate the rest.
When setting iterChunk to 1 everything is fine. Increasing iterChunk leads to generation of same numbers within iter. For example when I set nsim to 100 and iterChunk to 10. (All matches simulated 100 times, 10 times in 10 loops). I expect 600 random numbers (each match independently simulated accross all the loops). However I only get 180 - following the logic: 3 cores * 6 matches * 10 iterChunks.) Using 2 workers I do get 120 distinct random numbers (2 * 6 * 10)
Furthermore: exluding dt$HG[tmp_sim] = 3 I do get all random numbers different with whatever setting.
To understand the problem, I suggest:
Run the code as is. (possibly setting nsim to 100 and iterChunk to 10) You will get 180 different random numbers. With lower number of nsim & iterChunk things may work as expected.
Comment out dt$HG[tmp_sim] = 3.
You will get 6000 different random numbers (600 if you change nsim and iterChunk)
The code in 2nd step assigns goals scored by home team. It looks like some kind of bug I can't get over. Even information that someone gets the same result and doesn't know why will be helpful - it will lift the weight of my own stupidity out of me.
Thank you, I highly appreciate any effort.
I realised what the problem with OP's code was whilst I was in the shower. It's simple, and obvious in retrospect: all the loops and parallel processes are working on the same object - the dt data frame. So they're constantly overwriting the changes that each makes, and at the end of the outer loop, you just have multiple copies of the changes made by the last loop to complete. The solution is equally simple: work on a copy of the dt data frame.
To minimise the changes, I renamed dt to baseDT
# ====== Data Preparation ======
baseDT = data.frame(id = 1:10,
part = rep("dt",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
random = NA)
and then took a copy of it at the top of the foreach loop
out_iter = foreach(i = 1:iterChunk,
.combine = rbind, .multicombine = TRUE, .maxcombine = 100000,
.inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr")) %dopar% {
dt <- baseDT
This gives
> length(unique(out$random)) # expectation: 6000
[1] 6000
as expected.
Modifying the "Hello World" example in the "getting started with doParallel" vignette to generate random numbers, I came up with:
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
myFunc <- function(n) {runif(n)}
foreach(i=1:3) %dopar% myFunc(10)
[[1]]
[1] 0.18492375 0.13388278 0.65455450 0.93093066 0.41157625 0.89479764 0.14736529 0.47935995 0.03062963 0.16110714
[[2]]
[1] 0.89245145 0.20980791 0.83828019 0.04411547 0.38184303 0.48110619 0.51509058 0.93732055 0.40159834 0.81414140
[[3]]
[1] 0.74393129 0.66999730 0.44411989 0.85040773 0.80224527 0.72483644 0.64566262 0.22546420 0.14526819 0.05931329
Suggesting that getting random numbers across threads is straightforward. Indeed, the examples on pages 2 and 3 of the doRNG reference manual say the same thing.
In fact, if I understand you correctly, the purpose of doRNG is to do precisely the opposite of what you want: to make random processes reproducible across threads.
Of course, this doesn't guarantee that all numbers are different across all threads. But it makes duplication very unlikely. A guarantee of no duplicates would mean some degree of determinism in the process: a completely random process might produce duplicates by chance.
Update
Following on from our conversation in the comments...
We've established that the problem is in your program logic, not the parallelisation per se. So we need to refocus the question: what are you trying to do. I'm afraid it's not at all clear to me. So that means we need to simplify.
I set nsim to 5 and iterChunk to 1. I get 5 data frames which look like
id part HG random id_sim iter i
1 1 dt 1 NA 1 1 1
2 2 dt 3 NA 1 1 1
3 3 dt 6 NA 1 1 1
4 4 dt 3 0.6919744 1 1 1
5 5 dt 3 0.5413398 1 1 1
6 6 dt 2 NA 1 1 1
7 7 dt 3 0.3983175 1 1 1
8 8 dt 3 0.3342174 1 1 1
9 9 dt 3 0.6126020 1 1 1
10 10 dt 3 0.4185468 1 1 1
In each, the values of id_sim and iter are always the same, and run from 1 in the first data frame to 5 in the fifth. i is 1 for all rows in all data frames. Values in random do appear to be random, and different between data frames. But the NAs are all in the same positions in every data frame: the 1st, 2nd, 3rd and 6th rows. The values of HG are as shown above for all five data frames.
Is that what you would expect? If not, what do you expect? Given we know the problem is not the paraellisation, you need to give us more information.
Update 2
Do you know Arduan? They posted a related question over the weekend...
I'm not going to tell you what's wrong with your code. I'll show you how I would apprach your problem. I hope you'll agree it's more readable, if nothing else.
So, we're simulating some football matches. I'll assume its a league format and use the english Premier League as an example. Start by generating the fixture list for a single season.
library(tidyverse)
teams <- c("Arsenal", "Aston Villa", "Bournemouth", "Brighton & Hove Albion",
"Burnley", "Chelsea", "Crystal Palace", "Everton", "Leicester City",
"Liverpool", "Manchester City", "Manchester United", "Newcastle United",
"Norwich City", "Sheffield United", "Southampton", "Tottenham Hotspur",
"Watford", "West Ham United", "Wolverhampton Wanderers")
fixtures <- tibble(HomeTeam=teams, AwayTeam=teams) %>%
complete(HomeTeam, AwayTeam) %>%
filter(HomeTeam != AwayTeam) # A team can't play itself
fixtures %>% head(5)
# A tibble: 5 x 2
HomeTeam AwayTeam
<chr> <chr>
1 Arsenal Aston Villa
2 Arsenal Bournemouth
3 Arsenal Brighton & Hove Albion
4 Arsenal Burnley
5 Arsenal Chelsea
Suppose we know some results. I'll use yesterday's matches as an illustration.
knownResults <- tribble(~HomeTeam, ~AwayTeam, ~HomeGoals, ~AwayGoals,
"Burnley", "Sheffield United", 1, 1,
"Newcastle United", "West Ham United", 2, 2,
"Liverpool", "Aston Villa", 2, 0,
"Southampton", "Manchester City", 1, 0)
resultsSoFar <- fixtures %>%
left_join(knownResults, by=c("HomeTeam", "AwayTeam"))
resultsSoFar %>% filter(!is.na(HomeGoals))
# A tibble: 4 x 4
HomeTeam AwayTeam HomeGoals AwayGoals
<chr> <chr> <dbl> <dbl>
1 Burnley Sheffield United 1 1
2 Liverpool Aston Villa 2 0
3 Newcastle United West Ham United 2 2
4 Southampton Manchester City 1 0
Now some utility functions. You could certainly combine them, but I think it's clearer to keep them separate so you can see exactly what each one is doing.
First, a function to simulate the results of all matches whose results are unknown. The details of how you simulate the scores are entirely arbitrary. I've assumed that home teams score an average of 1.5 goals a game, away teams score 1.2 goals per game. Later on, I'm going to use this to simulate many seasons in one go, so I'll add a variable (Iteration) to index the season.
simulateResults <- function(i=NA, data) {
n <- nrow(data)
data %>%
add_column(Iteration=i, .before=1) %>%
mutate(
# Give the home team a slight advantage
HomeGoals=ifelse(is.na(HomeGoals), floor(rexp(n, rate=1/1.5)), HomeGoals),
AwayGoals=ifelse(is.na(AwayGoals), floor(rexp(n, rate=1/1.2)), AwayGoals)
)
}
Use it, and check that we haven't overwritten known results:
simulateResults(1, resultsSoFar) %>% filter(HomeTeam=="Burnley", AwayTeam=="Sheffield United")
# A tibble: 1 x 5
Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<dbl> <chr> <chr> <dbl> <dbl>
1 1 Burnley Sheffield United 1 1
I'm going to parallelise the overall simulation, so now let's have a function to simulate a chunk of simulations. Again, create an index column to identify the chunk.
simulateChunk <- function(chunkID=NA, n) {
bind_rows(lapply(1:n, simulateResults, data=resultsSoFar)) %>%
add_column(Chunk=chunkID, .before=1)
}
simulateChunk(chunkID=1, n=3)
# A tibble: 1,140 x 6
Chunk Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<dbl> <int> <chr> <chr> <dbl> <dbl>
1 1 1 Arsenal Aston Villa 2 0
2 1 1 Arsenal Bournemouth 0 0
3 1 1 Arsenal Brighton & Hove Albion 2 0
4 1 1 Arsenal Burnley 2 0
5 1 1 Arsenal Chelsea 1 0
6 1 1 Arsenal Crystal Palace 0 0
7 1 1 Arsenal Everton 2 3
8 1 1 Arsenal Leicester City 2 0
9 1 1 Arsenal Liverpool 0 1
10 1 1 Arsenal Manchester City 4 0
OK. Now I'm ready to do the main simulation work. I'll run 10 chunks of 100 simulations eash, to give 1000 simulated seasons in total, the same as you had.
library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)
chunkSize <- 100
nChunks <- 10
startedAt <- Sys.time()
x <- bind_rows(foreach(i=1:nChunks, .packages=c("tidyverse")) %dopar% simulateChunk(i, n=chunkSize))
finishedAt <- Sys.time()
print(finishedAt - startedAt)
Time difference of 6.772928 secs
stopCluster(cl)
> x
# A tibble: 380,000 x 6
Chunk Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<int> <int> <chr> <chr> <dbl> <dbl>
1 1 1 Arsenal Aston Villa 2 0
2 1 1 Arsenal Bournemouth 3 1
3 1 1 Arsenal Brighton & Hove Albion 0 1
4 1 1 Arsenal Burnley 3 0
5 1 1 Arsenal Chelsea 1 0
6 1 1 Arsenal Crystal Palace 0 0
7 1 1 Arsenal Everton 1 2
8 1 1 Arsenal Leicester City 0 0
9 1 1 Arsenal Liverpool 0 0
10 1 1 Arsenal Manchester City 0 0
Let's check I've got sensible results. As a basic check, I'll look at the results of Arsenal vs Aston Villa:
x %>%
filter(HomeTeam == "Arsenal", AwayTeam=="Aston Villa") %>%
group_by(HomeGoals, AwayGoals) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
values_from="N", names_prefix="AwayGoals",
names_sep="", names_from=AwayGoals
)
# A tibble: 8 x 10
HomeGoals AwayGoals0 AwayGoals1 AwayGoals2 AwayGoals3 AwayGoals4 AwayGoals5 AwayGoals6 AwayGoals8 AwayGoals7
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 299 129 57 19 12 7 NA NA NA
2 1 135 63 25 6 4 4 1 2 NA
3 2 75 21 12 9 4 1 NA NA 1
4 3 30 13 10 1 NA NA NA NA NA
5 4 21 7 1 1 NA NA NA NA NA
6 5 11 2 1 NA 2 NA NA NA NA
7 6 4 2 2 NA NA NA NA NA NA
8 7 4 1 1 NA NA NA NA NA NA
That looks reasonable. Now confirm that the matches with known results don't vary. For example:
x %>%
filter(HomeTeam == "Liverpool", AwayTeam=="Aston Villa") %>%
group_by(HomeGoals, AwayGoals) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(values_from="N", names_prefix="AwayGoals", names_sep="", names_from=AwayGoals)
HomeGoals AwayGoals0
<dbl> <int>
1 2 1000
All good.
So, That's 23 statements to generate the fixtures, take account of known results, simulate the remainder of the matches and do some basic sanity checking. I could easily get that down to under 20 statements if I had to. That's about a third less than you were using just to try to simulate the unknown results. [The actual simulation takes fewer than 10 statements.] I think my approach is easier to understand: by using tidy verbs the code is almost self-documenting.

approximate character matching using R

I have two datafiles. One of the files contains only one column with the name of the company (usually a hospital) and the other one contains a list of companies with the respective adresses. The problem is that the company names do not exactly match. How can i match them approximately ?
> dput(head(HOSPITALS[130:140,], 10))
I would like to obtain one datafile, where the company is matchen with an adress, if available in adress
Check out the fuzzyjoin package and the stringdist_join functions.
Here's a starting point. In your example data ignore_case = TRUE solves the matching problem. Depending on how the full data looks, you will have to experiment with the arguments (e.g. max_dist) and possibly filter the result until your achieve what you want.
library(dplyr)
library(fuzzyjoin)
HOSPITALS %>%
stringdist_left_join(GH_MY,
by = c("hospital" = "hospital_name"),
ignore_case = TRUE,
max_dist = 2,
distance_col = "dist")
Result:
# A tibble: 10 x 6
hospital hospital_name adress district town dist
<chr> <chr> <chr> <chr> <chr> <dbl>
1 HOSPITAL PAPAR Hospital Papar Peti Surat No. 6, Papar Sabah 0
2 HOSPITAL PARIT BUNT~ Hospital Parit ~ Jalan Sempadan Parit Bun~ Perak 0
3 HOSPITAL PEKAN Hospital Pekan 26600 Pekan Pekan Pahang 0
4 HOSPITAL PENAWAR SD~ NA NA NA NA NA
5 HOSPITAL PORT DICKS~ Hospital Port D~ KM 11, Jalan Pantai Port Dick~ Negeri ~ 0
6 HOSPITAL PULAU PINA~ Hospital Pulau ~ Jalan Residensi Pulau Pin~ Pulau P~ 0
7 HOSPITAL PUSRAWI SD~ NA NA NA NA NA
8 HOSPITAL PUSRAWI SM~ NA NA NA NA NA
9 HOSPITAL PUTRAJAYA Hospital Putraj~ Pusat Pentadbiran Ker~ Putrajaya WP Putr~ 0
10 HOSPITAL QUEEN ELIZ~ NA NA NA NA NA

Calculate Percentage Column for List of Dataframes When Total Value is Hidden Within the Rows

library(tidyverse)
I feel like there is a simple solution for this but I'm stuck. The code below creates a simple list of two dataframes (they are the same for simplicity of the example, but the real data has different values)
Loc<-c("Montreal","Toronto","Vancouver","Quebec","Ottawa","Hamilton","Total")
Count<-c("2344","2322","122","45","4544","44","9421")
Data<-data_frame(Loc,Count)
Data2<-data_frame(Loc,Count)
Data3<-list(Data,Data2)
Each dataframe has "Total" within the "Loc" column with the corresponding overall total of the "Count" column. I would like to calculate percentages for each dataframe by dividing each value in the "Count" column by the total, which is the last number in the "Count" column.
I would like the percentages to be added as new columns for each dataframe.
For this example, the total is the last number in the column, but in reality, it may be mixed anywhere in the column and can be found by the corresponding "Total" value in the "Loc" column.
I would like to use purrr and Tidyverse:
Below is an example of the code, but I'm stuck on the percentage...
Data3%>%map(~mutate(.x,paste0(round(100* (MISSING PERCENTAGE),2),"%"))
This solution uses only base-R:
for (i in seq_along(Data3)) {
Data3[[i]]$Count <- as.numeric(Data3[[i]]$Count)
n <- nrow(Data3[[i]])
Data3[[i]]$perc <- Data3[[i]]$Count / Data3[[i]]$Count[n]
}
> Data3
[[1]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000
[[2]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000

Resources