Generate codes based on Nominal Variables present in a dataframe - r

I have a data frame that has 1000 observations and it has this structure below.
Town <- c("TownA", "TownB", "TownC","TownD","Town A", "Town Z")
Ward <- c("Ward B","Ward Z","Ward A","Ward W","Ward X", "Ward ")
DF <- data.frame(Town, Ward)
I have another dataset that contains codes that represent the nominal observations of Town and Ward. The codes are the ones to be used for analysis. For example, Town A has the code 23, Town B has the code 15, Town Z has the code 7. Instead of manually creating a new column and populating the codes based on towns, is there a simpler way to do this in R?
My goal is to mutate a new column that will match the codes with the towns. The dataset has around 200 Towns.

You can create a new code table and then do joining:
library(tidyverse)
Town <- c("TownA", "TownB", "TownC","TownD","Town A", "Town Z")
Ward <- c("Ward B","Ward Z","Ward A","Ward W","Ward X", "Ward ")
DF <- data.frame(Town, Ward)
codes <- tribble(
~Town, ~Code,
"TownA", 23,
"TownB", 15,
"Town Z", 7
)
codes
#> # A tibble: 3 × 2
#> Town Code
#> <chr> <dbl>
#> 1 TownA 23
#> 2 TownB 15
#> 3 Town Z 7
DF %>%
left_join(codes)
#> Joining, by = "Town"
#> Town Ward Code
#> 1 TownA Ward B 23
#> 2 TownB Ward Z 15
#> 3 TownC Ward A NA
#> 4 TownD Ward W NA
#> 5 Town A Ward X NA
#> 6 Town Z Ward 7
Created on 2021-09-20 by the reprex package (v2.0.1)

Related

Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.
Here's the table.
And here's the code:
# load library
library(pdftools)
# link to report
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
# read data through pdftool
data <- pdf_text(url)
# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
table_number,
number_columns,
column_names,
first_row,
last_row) {
data <- list_of_tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[min(grep(first_row, data)):
max(grep(last_row, data))]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data)
names(data) <- column_names
return(data)
}
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")
And the output.
6m 30-9-2022 6m 30-9-2021 12m 30-3-2022 NA
1 Care fees 210,187 194,603 398,206
2 Management fees 59,746 50,959 105,552
3 Interest received 364 42 41
4 Other income 3,942 2,260 4,998
5 Total revenue 274,239 247,864 508,797
6
7 Fair-value movement of
8 investment properties 3 261,346 285,143 745,885
9 Total income 535,585 533,007 1,254,682
10
11 Operating expenses (265,148) (225,380) (466,238)
12 Depreciation and
13 amortisation expenses (22,996) (17,854) (35,698)
14 Finance costs (19,355) (15,250) (30,664)
15 Impairment loss 2 (10,784) - -
16 Total expenses (318,283) (258,484) (532,600)
17
18 Profit before income tax 217,302 274,523 722,082
19 Income tax (expense) / credit (23,316) 6,944 (29,209)
20 Profit for the period 193,986 281,467 692,873
21
22 Earnings per share
23 Basic and diluted (cents per share) 38.8 56.3 138.6
How can I best circumvent this issue?
Many thanks in advance!
While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.
library(pdftools)
library(dplyr, warn.conflicts = F)
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
data <- pdf_text(url)
scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
lines <- unlist(strsplit(pdf_text_item, "\n"))
# remove 0-length lines
lines <- lines[nchar(lines) > 0]
lines <- lines[min(grep(first_row_str, lines)):
max(grep(last_row_str , lines))]
# paste lines back into single string for read_fwf()
paste(lines, collapse = "\n") %>%
readr::read_fwf() %>%
# re-connect strings in first colum if values were split between rows
mutate(X1 = if_else(!is.na(lag(X1)) & is.na(lag(X3)), paste(lag(X1), X1), X1)) %>%
filter(!is.na(X3))
}
output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )
Result:
output %>%
mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#> X1 X2 X3 X4 X5
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Care fees NA 210,187 194,603 398,206
#> 2 Management fees NA 59,746 50,959 105,552
#> 3 Interest received NA 364 42 41
#> 4 Other income NA 3,942 2,260 4,998
#> 5 Total revenue NA 274,239 247,864 508,797
#> 6 Fair-value movement of investmen... 3 261,346 285,143 745,885
#> 7 Total income NA 535,585 533,007 1,254,682
#> 8 Operating expenses NA (265,148) (225,380) (466,238)
#> 9 Depreciation and amortisation ex... NA (22,996) (17,854) (35,698)
#> 10 Finance costs NA (19,355) (15,250) (30,664)
#> 11 Impairment loss 2 (10,784) - -
#> 12 Total expenses NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax NA 217,302 274,523 722,082
#> 14 Income tax (expense) / credit NA (23,316) 6,944 (29,209)
#> 15 Profit for the period NA 193,986 281,467 692,873
#> 16 Earnings per share Basic and dil... NA 38.8 56.3 138.6
Created on 2022-11-19 with reprex v2.0.2

functional programming problems -- map_df & regex

I am trying to combine multiple spreadsheets (about 20) using a functional programming approach. Each spreadsheet contains an individual year of data. They are messy, with columns not named or name of same column changing across the spreadsheets.
I originally did all the cleaning up individually for each spreadsheet but want to learn how to do it with a functional programming to make it more reproducible.
My approach was to build a regex to match all the different names of the specified column, then rename the column using a custom function/regex. I thought I could then use map_dfr to apply this function to all the different spreadsheets to produce a final dataframe to work with.
However I have encountered 2 problems:
the regex engine in R seems to have the global parameter on and no way to switch it off. I want to match the the different possibilities in the regex expression in sequence and stop when it finds the first match, not all matches. For example, after I import the spreadsheets sometimes there are mulitple unamed columns which get given names ...1 etc. I only want to match the first instance. I cannot seem to work out if it possible to disable the global parameter, or a cleverer way of writing the regex to stop after the first match. Also is there another, perhaps better, way of approaching this?
When I pass my custom function, which seems to work well enough on individual dataframes, I get an error from map_df which I am not quite sure why.
I have produced a minimal reprex below, which I think highlights the issues.
All thoughts greatly received, including alternative approaches to this, as this must be a very common problem people come across. Thanks.
library(tidyverse)
year_1 <- tribble(
~`...1`, ~admissions,
"Hospital 1", 10,
"Hospital 2", 100,
"hospital 3", 200
)
year_2 <- tribble(
~provider_code, ~`...2`, ~admissions,
"H1", "Hospital 1", 20,
"H2", "Hospital 2", 400,
"H3", "hospital 3", 500
)
year_3 <- tribble(
~"Hospital provider code", ~"Commissioning region/Provider", ~admissions,
"H1", "Hospital 1", 350,
"H2", "Hospital 2", 350,
"H3", "hospital 3", 550
)
clean_up_area_column_name <- function(x){
rename({{x}}, area = matches("\\.{3}[0-9]|commissioning region|hospital provider", ignore.case = TRUE))
}
clean_up_area_column_name(year_1)
#> # A tibble: 3 × 2
#> area admissions
#> <chr> <dbl>
#> 1 Hospital 1 10
#> 2 Hospital 2 100
#> 3 hospital 3 200
clean_up_area_column_name(year_2)
#> # A tibble: 3 × 3
#> provider_code area admissions
#> <chr> <chr> <dbl>
#> 1 H1 Hospital 1 20
#> 2 H2 Hospital 2 400
#> 3 H3 hospital 3 500
clean_up_area_column_name(year_3)
#> # A tibble: 3 × 3
#> area1 area2 admissions
#> <chr> <chr> <dbl>
#> 1 H1 Hospital 1 350
#> 2 H2 Hospital 2 350
#> 3 H3 hospital 3 550
test_df <- map_dfr(c(year_1, year_2, year_3), clean_up_area_column_name)
#> Error in UseMethod("rename"): no applicable method for 'rename' applied to an object of class "character"
Created on 2022-08-08 by the reprex package (v2.0.1)
Passing multiple data.frames to map requires a list
test_df <- map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)
# A tibble: 9 x 5
area admissions provider_code area1 area2
<chr> <dbl> <chr> <chr> <chr>
1 Hospital 1 10 NA NA NA
2 Hospital 2 100 NA NA NA
3 hospital 3 200 NA NA NA
4 Hospital 1 20 H1 NA NA
5 Hospital 2 400 H2 NA NA
6 hospital 3 500 H3 NA NA
7 NA 350 NA H1 Hospital 1
8 NA 350 NA H2 Hospital 2
9 NA 550 NA H3 hospital 3
If you only want to grab the first instances, as you say, then the following tweak to your function should work. Rename any "area1" to "area", then de-select the remaining "area" columns names with trailing digits (area2, area3 etc).
clean_up_area_column_name <- function(x){
rename({{x}},
area = matches("\\.{3}[0-9]|commissioning region|hospital provider")) %>%
rename(., area = matches("area1")) %>%
select(-any_of(matches("area\\d")))
}
I'm not sure what you expect year_3 to return as it seems your regex is matching the provider_code as well as area:
map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)
# A tibble: 9 × 3
area admissions provider_code
<chr> <dbl> <chr>
1 Hospital 1 10 NA
2 Hospital 2 100 NA
3 hospital 3 200 NA
4 Hospital 1 20 H1
5 Hospital 2 400 H2
6 hospital 3 500 H3
7 H1 350 NA
8 H2 350 NA
9 H3 550 NA

How to loop many factors into one function

I have a large data frame regarding Covid patients. I have included a very simplified version of what this frame looks like.
CovidFake <- data.frame(DateReporting=sample(seq(as.Date("2020-10-1"), as.Date("2020-11-01"), by="day"), 50, replace=TRUE),
Industry=sample(c("Minor or Student", "Educational Services", "Medical Services", "Food Production"),50, replace =TRUE))
I want use ggplot to make a graph of the daily cases by industry of the patient. I have this function to structure the frame so ggplot can graph it.
library(zoo)
MainFunction <- function(MainFrame, CatVal){
Frame <- subset(MainFrame, Industry==CatVal)
Frame <- as.data.frame(table(Frame$DateReporting))
colnames(Frame) <- c("Var1", "Freq")
Frame$Var1 <- as.Date(Frame$Var1, "%Y-%m-%d")
Frame <- Frame %>% complete(Var1 = seq.Date(as.Date("2020-10-01", "%Y-%m-%d"),
as.Date("2020-11-01", "%Y-%m-%d"), by="day"))
Frame$Freq <- replace_na(Frame$Freq, 0)
Frame$CumSum <- cumsum(Frame$Freq)
Frame$Cat <- CatVal
Frame$SevenDayAverage <- rollmean(Frame$Freq, 7, fill=NA, align = "right")
colnames(Frame) <- c("Date", "DailyCases", "CumSum", "Industry", "SevenDayAve")
Frame <- subset(Frame, Date >= "2020-03-13")
return(Frame)
}
I need to create a frame that has all of these industries, so I've been doing something like this.
IndGraph <- rbind(MainFunction(CovidFake, "Minor or Student"),
MainFunction(CovidFake, "Educational Services"),
MainFunction(CovidFake, "Medical Services"),
MainFunction(CovidFake, "Food Production"))
The true frame has about 15 industries, so the code gets pretty long and seemingly unnecessarily repetitive. Is there anyway to loop in all the factors into the function and do this in one? Or is there a simpler way to structure the frame? I'm new to R so any and all help is much appreciated.
Thanks!
Using a for loop:
IndGraph <- vector()
for(i in CovidFake$Industry){
IndGraph <- rbind(IndGraph, MainFunction(CovidFake, i))}
Output:
> IndGraph
# A tibble: 1,600 x 5
Date DailyCases CumSum Industry SevenDayAve
<date> <dbl> <dbl> <chr> <dbl>
1 2020-10-01 0 0 Minor or Student NA
2 2020-10-02 0 0 Minor or Student NA
3 2020-10-03 1 1 Minor or Student NA
4 2020-10-04 0 1 Minor or Student NA
5 2020-10-05 0 1 Minor or Student NA
6 2020-10-06 0 1 Minor or Student NA
7 2020-10-07 1 2 Minor or Student 0.286
8 2020-10-08 1 3 Minor or Student 0.429
9 2020-10-09 2 5 Minor or Student 0.714
10 2020-10-10 0 5 Minor or Student 0.571
# ... with 1,590 more rows
One option would be:
do.call("rbind", lapply(unique(CovidFake$Industry), FUN = function(x, y = CovidFake) MainFunction(y, x)))

Joining tables and applying functions to columns with the same name in R and tidyverse

I am looking to join tables with customer id (easy enough) but then I want to multiply the columns to get updated values.
Customer_Week_1<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(2,5,8,9),
Factor_2=c(.5,.5,.75,.75),
Factor_3=c(0,1,2,3))
Customer_Week_2<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(3,7,1,7),
Factor_2=c(.51,.65,.72,.4),
Factor_3=c(1,2,3,4))
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))
The expected results can be found by in a vector by just
Customer_week3_expected<-Customer_Week_1[,3:5]*Customer_Week_2[,3:5]
And I know I can just manually type out every column. But I have dozens of columns and need to make this code as easy to follow as possible.
I also know that I can just bind the results vector to
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))%>%
select(1:2)
But that does not look like best practice to me, and I would rather this be done with a join some way to ensure everything lines up when I am iterating over the customers(tables)
Assuming I understand the output you're trying to get, I can think of two methods. If you know that the names are in the first two columns and are the same in both data frames (this might not be the case in real life), you can use the same multiplication operation you tried above, bound to the first two columns of either of the data frames.
cbind(Customer_Week_1[1:2], Customer_Week_1[-1:-2] * Customer_Week_2[-1:-2])
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> 1 John Jackson 6 0.255 0
#> 2 Mary Smith 35 0.325 2
#> 3 David Williams 8 0.540 6
#> 4 Paul Zimmerman 63 0.300 12
Or you can be more verbose but maybe more flexible, and eshape to a long data frame, then do a grouped operation to summarize products for each person and factor. Starting from the join you have above:
library(dplyr)
library(tidyr)
Customer_week3 <- Customer_Week_1 %>%
left_join(Customer_Week_2, by = c("First_name", "Last_name"))
Make long-shaped data, separate the Factor_1.x into Factor_1 and x, and make products as your summary calculation.
products <- Customer_week3 %>%
gather(key = factor, value = value, -First_name, -Last_name) %>%
separate(factor, into = c("factor", "week"), sep = "\\.") %>%
group_by(First_name, Last_name, factor) %>%
summarise(value = prod(value))
head(products)
#> # A tibble: 6 x 4
#> # Groups: First_name, Last_name [2]
#> First_name Last_name factor value
#> <fct> <fct> <chr> <dbl>
#> 1 David Williams Factor_1 8
#> 2 David Williams Factor_2 0.54
#> 3 David Williams Factor_3 6
#> 4 John Jackson Factor_1 6
#> 5 John Jackson Factor_2 0.255
#> 6 John Jackson Factor_3 0
If you need to get back to a wide format, spread back.
products %>%
spread(key = factor, value = value)
#> # A tibble: 4 x 5
#> # Groups: First_name, Last_name [16]
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 David Williams 8 0.54 6
#> 2 John Jackson 6 0.255 0
#> 3 Mary Smith 35 0.325 2
#> 4 Paul Zimmerman 63 0.3 12
Similar to #camille's reshaping, but in data.table (and disregarding Customer_week3):
library(data.table)
# long format
long = rbindlist(list(Customer_Week_1, Customer_Week_2), id=TRUE)
# aggregate
long[, lapply(.SD, prod), by=.(First_name, Last_name), .SDcols=patterns("^Factor")]
First_name Last_name Factor_1 Factor_2 Factor_3
1: John Jackson 6 0.255 0
2: Mary Smith 35 0.325 2
3: David Williams 8 0.540 6
4: Paul Zimmerman 63 0.300 12
Going longer (again as seen in #camille's answer) might also make sense, so as to avoid repeatedly fiddling with names of Factor_* columns:
longer = melt(long, meas=patterns("^Factor")) # analogous to gather
longer[, .(value = prod(value)), by=.(First_name, Last_name, variable)]

reformat data frame in R

I am new to R.
I need to reformat the following data frame:
`Sample Name` `Target Name` 'CT values'
<chr> <chr> <dbl>
1 Sample 1 actin 19.69928
2 Sample 1 Ho-1 27.71864
3 Sample 1 Nrf-2 26.00012
9 Sample 9 Ho-1 25.31180
10 Sample 9 Nrf-2 26.41421
11 Sample 9 C3 26.16980
...
15 Sample 1 actin 19.49202
Actually, I want to have the different 'Target Names' as column names, and the individual 'Sample Names' as row names. The table should then display the respective CT values.
But note that there are duplicates, e.g., Sample 1 exists twice, as the corresponding Target name, e.g. "actin" does. What I want to have is that the table later only shows these duplicates once, with the means of the two different CT values.
I guess this is a very basic R data frame manipulation, but as I said, I am quite new to R and messing around with different tutorials.
Thank you very much in advance!
One way of doing that using the tidyverse ecosystem of packages:
library(tidyverse)
tab <- tribble(
~`Sample Name`, ~`Target Name`, ~ `CT values`,
"Sample 1", "actin", 19.69928,
"Sample 1", "Ho-1", 27.71864,
"Sample 1", "Nrf-2", 26.00012,
"Sample 9", "Ho-1", 25.31180,
"Sample 9", "Nrf-2", 26.41421,
"Sample 9", "C3", 26.16980,
"Sample 1", "actin", 19.49202
)
tab %>%
# calculate the mean of your dpulicate
group_by(`Sample Name`, `Target Name`) %>%
summarise(`CT values` = mean(`CT values`)) %>%
# reshape the data
spread(`Target Name`, `CT values`)
#> # A tibble: 2 x 5
#> # Groups: Sample Name [2]
#> `Sample Name` actin C3 `Ho-1` `Nrf-2`
#> * <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Sample 1 19.6 NA 27.7 26.0
#> 2 Sample 9 NA 26.2 25.3 26.4
you can also use data.table to a more consise way of doing this with
dcast reshape function
library(data.table)
#>
#> Attachement du package : 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
setDT(tab)
dcast(tab, `Sample Name` ~ `Target Name`, fun.aggregate = mean)
#> Using 'CT values' as value column. Use 'value.var' to override
#> Sample Name C3 Ho-1 Nrf-2 actin
#> 1: Sample 1 NaN 27.71864 26.00012 19.59565
#> 2: Sample 9 26.1698 25.31180 26.41421 NaN
Created on 2018-01-13 by the reprex package (v0.1.1.9000).

Resources