R Filling missing values with NA for a data frame - r

I am currently trying to create a data-frame with the following lists
location <- list("USA","Singapore","UK")
organization <- list("Microsoft","University of London","Boeing","Apple")
person <- list()
date <- list("1989","2001","2018")
Jobs <- list("CEO","Chairman","VP of sales","General Manager","Director")
When I try and create a data-frame I get the (obvious) error that the lengths of the lists are not equal. I want to find a way to either make the lists the same length, or fill the missing data-frame entries with "NA". After doing some searching I have not been able to find a solution

Here are purrr (part of tidyverse) and base R solutions, assuming you just want to fill remaining values in each list with NA. I'm taking the maximum length of any list as len, then for each list doing rep(NA) for the difference between the length of that list and the maximum length of any list.
library(tidyverse)
location <- list("USA","Singapore","UK")
organization <- list("Microsoft","University of London","Boeing","Apple")
person <- list()
date <- list("1989","2001","2018")
Jobs <- list("CEO","Chairman","VP of sales","General Manager","Director")
all_lists <- list(location, organization, person, date, Jobs)
len <- max(lengths(all_lists))
With purrr::map_dfc, you can map over the list of lists, tack on NAs as needed, convert to character vector, then get a data frame of all those vectors cbinded in one piped call:
map_dfc(all_lists, function(l) {
c(l, rep(NA, len - length(l))) %>%
as.character()
})
#> # A tibble: 5 x 5
#> V1 V2 V3 V4 V5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 USA Microsoft NA 1989 CEO
#> 2 Singapore University of London NA 2001 Chairman
#> 3 UK Boeing NA 2018 VP of sales
#> 4 NA Apple NA NA General Manager
#> 5 NA NA NA NA Director
In base R, you can lapply the same function across the list of lists, then use Reduce to cbind the resulting lists and convert it to a data frame. Takes two steps instead of purrr's one:
cols <- lapply(all_lists, function(l) c(l, rep(NA, len - length(l))))
as.data.frame(Reduce(cbind, cols, init = NULL))
#> V1 V2 V3 V4 V5
#> 1 USA Microsoft NA 1989 CEO
#> 2 Singapore University of London NA 2001 Chairman
#> 3 UK Boeing NA 2018 VP of sales
#> 4 NA Apple NA NA General Manager
#> 5 NA NA NA NA Director
For both of these, you can now set the names however you like.

You could do:
data.frame(sapply(dyem_list, "length<-", max(lengths(dyem_list))))
location organization person date Jobs
1 USA Microsoft NULL 1989 CEO
2 Singapore University of London NULL 2001 Chairman
3 UK Boeing NULL 2018 VP of sales
4 NULL Apple NULL NULL General Manager
5 NULL NULL NULL NULL Director
Where dyem_list is the following:
dyem_list <- list(
location = list("USA","Singapore","UK"),
organization = list("Microsoft","University of London","Boeing","Apple"),
person = list(),
date = list("1989","2001","2018"),
Jobs = list("CEO","Chairman","VP of sales","General Manager","Director")
)

Related

Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.
Here's the table.
And here's the code:
# load library
library(pdftools)
# link to report
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
# read data through pdftool
data <- pdf_text(url)
# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
table_number,
number_columns,
column_names,
first_row,
last_row) {
data <- list_of_tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[min(grep(first_row, data)):
max(grep(last_row, data))]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data)
names(data) <- column_names
return(data)
}
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")
And the output.
6m 30-9-2022 6m 30-9-2021 12m 30-3-2022 NA
1 Care fees 210,187 194,603 398,206
2 Management fees 59,746 50,959 105,552
3 Interest received 364 42 41
4 Other income 3,942 2,260 4,998
5 Total revenue 274,239 247,864 508,797
6
7 Fair-value movement of
8 investment properties 3 261,346 285,143 745,885
9 Total income 535,585 533,007 1,254,682
10
11 Operating expenses (265,148) (225,380) (466,238)
12 Depreciation and
13 amortisation expenses (22,996) (17,854) (35,698)
14 Finance costs (19,355) (15,250) (30,664)
15 Impairment loss 2 (10,784) - -
16 Total expenses (318,283) (258,484) (532,600)
17
18 Profit before income tax 217,302 274,523 722,082
19 Income tax (expense) / credit (23,316) 6,944 (29,209)
20 Profit for the period 193,986 281,467 692,873
21
22 Earnings per share
23 Basic and diluted (cents per share) 38.8 56.3 138.6
How can I best circumvent this issue?
Many thanks in advance!
While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.
library(pdftools)
library(dplyr, warn.conflicts = F)
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
data <- pdf_text(url)
scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
lines <- unlist(strsplit(pdf_text_item, "\n"))
# remove 0-length lines
lines <- lines[nchar(lines) > 0]
lines <- lines[min(grep(first_row_str, lines)):
max(grep(last_row_str , lines))]
# paste lines back into single string for read_fwf()
paste(lines, collapse = "\n") %>%
readr::read_fwf() %>%
# re-connect strings in first colum if values were split between rows
mutate(X1 = if_else(!is.na(lag(X1)) & is.na(lag(X3)), paste(lag(X1), X1), X1)) %>%
filter(!is.na(X3))
}
output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )
Result:
output %>%
mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#> X1 X2 X3 X4 X5
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Care fees NA 210,187 194,603 398,206
#> 2 Management fees NA 59,746 50,959 105,552
#> 3 Interest received NA 364 42 41
#> 4 Other income NA 3,942 2,260 4,998
#> 5 Total revenue NA 274,239 247,864 508,797
#> 6 Fair-value movement of investmen... 3 261,346 285,143 745,885
#> 7 Total income NA 535,585 533,007 1,254,682
#> 8 Operating expenses NA (265,148) (225,380) (466,238)
#> 9 Depreciation and amortisation ex... NA (22,996) (17,854) (35,698)
#> 10 Finance costs NA (19,355) (15,250) (30,664)
#> 11 Impairment loss 2 (10,784) - -
#> 12 Total expenses NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax NA 217,302 274,523 722,082
#> 14 Income tax (expense) / credit NA (23,316) 6,944 (29,209)
#> 15 Profit for the period NA 193,986 281,467 692,873
#> 16 Earnings per share Basic and dil... NA 38.8 56.3 138.6
Created on 2022-11-19 with reprex v2.0.2

Change data type of all columns in list of data frames before using `bind_rows()`

I have a list of data frames, e.g. from the following code:
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE)
I would now like to combine the dataframes into one, e.g. with dplyr::bind_rows() but get the Error: Can't combine ..1$Deaths<integer> and..5$Deaths <character>. (the answer suggested here doesn't do the trick).
So I need to convert the data types before using row binding. I would like to use this inside a pipe (a tidyverse solution would be ideal) and not loop through the data frames due to the structure of the remaining project but instead use something vectorized like lapply(., function(x) {lapply(x %>% mutate_all, as.character)}) (which doesn't work) to convert all values to character.
Can someone help me with this?
You can change all the column classes to characters and bind them together with map_df.
library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE) %>%
map_df(~.x %>% mutate(across(.fns = as.character)))
# Deaths Date Attraction `Amusement park` Location Incident Injuries
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 28 14 Feb… Transvaal Park (entire … Transvaal Park Yasenevo, Mosc… NA NA
#2 15 27 Jun… Formosa Fun Coast music… Formosa Fun Coast Bali, New Taip… NA NA
#3 8 11 May… Haunted Castle; a fire … Six Flags Great … Jackson Townsh… NA NA
#4 7 9 June… Ghost Train; a fire at … Luna Park Sydney Sydney, Austra… NA NA
#5 7 14 Aug… Skylab; a crane collide… Hamburger Dom Hamburg, (Germ… NA NA
# 6 6 13 Aug… Virginia Reel; a fire a… Palisades Amusem… Cliffside Park… NA NA
# 7 6 29 Jun… Eco-Adventure Valley Sp… OCT East Yantian Distri… NA NA
# 8 5 30 May… Big Dipper; the roller … Battersea Park Battersea, Lon… NA NA
# 9 5 23 Jun… Kuzuluk Aquapark swimmi… Kuzuluk Aquapark Akyazi, Turkey… NA NA
#10 4 24 Jul… Big Dipper; a bolt came… Krug Park Omaha, Nebrask… NA NA
# … with 1,895 more rows

approximate character matching using R

I have two datafiles. One of the files contains only one column with the name of the company (usually a hospital) and the other one contains a list of companies with the respective adresses. The problem is that the company names do not exactly match. How can i match them approximately ?
> dput(head(HOSPITALS[130:140,], 10))
I would like to obtain one datafile, where the company is matchen with an adress, if available in adress
Check out the fuzzyjoin package and the stringdist_join functions.
Here's a starting point. In your example data ignore_case = TRUE solves the matching problem. Depending on how the full data looks, you will have to experiment with the arguments (e.g. max_dist) and possibly filter the result until your achieve what you want.
library(dplyr)
library(fuzzyjoin)
HOSPITALS %>%
stringdist_left_join(GH_MY,
by = c("hospital" = "hospital_name"),
ignore_case = TRUE,
max_dist = 2,
distance_col = "dist")
Result:
# A tibble: 10 x 6
hospital hospital_name adress district town dist
<chr> <chr> <chr> <chr> <chr> <dbl>
1 HOSPITAL PAPAR Hospital Papar Peti Surat No. 6, Papar Sabah 0
2 HOSPITAL PARIT BUNT~ Hospital Parit ~ Jalan Sempadan Parit Bun~ Perak 0
3 HOSPITAL PEKAN Hospital Pekan 26600 Pekan Pekan Pahang 0
4 HOSPITAL PENAWAR SD~ NA NA NA NA NA
5 HOSPITAL PORT DICKS~ Hospital Port D~ KM 11, Jalan Pantai Port Dick~ Negeri ~ 0
6 HOSPITAL PULAU PINA~ Hospital Pulau ~ Jalan Residensi Pulau Pin~ Pulau P~ 0
7 HOSPITAL PUSRAWI SD~ NA NA NA NA NA
8 HOSPITAL PUSRAWI SM~ NA NA NA NA NA
9 HOSPITAL PUTRAJAYA Hospital Putraj~ Pusat Pentadbiran Ker~ Putrajaya WP Putr~ 0
10 HOSPITAL QUEEN ELIZ~ NA NA NA NA NA

Replacing items in a list with items from another list in R

I have a column in a list with country codes in characters, I want to replace these with numeric codes. for the "decoding" I have a second list where the character country codes are associated with the numeric codes.
I tried gsub:
for (i in 1:nrow(countries))
{gsub(countries$code3[i], countries$numcode[i], doc_report$nationality)}
I tried a for loop:
i <- NULL
n <- NULL
for (i in 1:nrow(doc_report)) {
for (n in 1:nrow(countries)) {
if(doc_report$nationality[i] == countries$code3[n])
doc_report$nationality[i] <- countries$numcode[n]
else
if(doc_report$nationality[i] == "NA")
doc_report$nationality[i] <- 000
}
}
and I had something in mind with merge()
this is how the column looks like that has to be replaced
[nationality] IRL GBR ITA FRA POL BRA ESP GBR GBR GBR
this is how the second table for decoding looks like:
[code3] AFG ALB DZA ASM AGO AIA ATG ARG ARM
[numcode] 4 8 12 16 24 660 NA 28 32 51
so in table one I want the numcode from table 2 rather than the code3 style.
Updated Answer
Here's an example with data formatted like yours to make it clearer that it does work despite duplicate country codes.
library(tidyverse)
country <- c("IRL", "GBR", "ITA", "FRA", "POL", "BRA", "ESP")
codes <- c(1,2,3,4,5,6,7)
countries <- tibble(country, codes)
doc_report <- tibble(x=c("a","b","c","d","e"),
country = c("ITA","ITA", "POL", "BRA","ESP"))
left_join(doc_report, countries, by="country")
The output of this code is:
# A tibble: 5 x 3
x country codes
<chr> <chr> <dbl>
1 a ITA 3
2 b ITA 3
3 c POL 5
4 d BRA 6
5 e ESP 7
Which I believe is the behavior you're looking for.
Original Answer
A simple solution would be to use the left_join() function in the dplyr package and then select() to remove the unneeded column.
Let's say doc_report keys countries by code and country_codes is a tibble with 1 column of country string codes and 1 column of corresponding numerical codes, you could do something like this
## join the country codes
doc_report <- left_join(doc_report, country_codes, by="code3")
## remove the unneeded column
doc_report <- select(doc_report, -code3)
Does this make sense? Happy to expand otherwise.

Running a function in a for loop and adding the returned dataframe to a list in R

I have a function that takes in one argument as a particular year. This function returns a dataframe. Now I want to create a for loop for a range of years and add these data frames to a list or to combine into a large dataframe.
Will something like this help:
l <- list()
for (year in 2010:2017) {l <- functionX(subset(dataset, Year==year))}
It's not working. The error I get is-
longer object length is not a multiple of shorter object length
I also tried calling the function just as :
functionX(subset(dataset, Year== 2010:2017))
This doesn't work either.
Edit:
I think because the lengths of the data frames for each year are not same, hence they're not getting added. I made a slight change-
for (i in 2010:2017) {
df <- functionX(subset(dataset, Year==i))
l[i] <- df$Name
}
Error:
number of items to replace is not a multiple of replacement length
I'm not trying to replace, but just trying to add elements of a particular dataframe for each year to the list.
I updated your example to make it reproducible. The general idea is as follows: inside the for loop you put your data.frame inside a list. Then you append that list to the big list, l. That way your data.frame becomes an element inside of the list l:
l <- list()
functionX <- function(Year) {
set.seed(Year)
df <- data.frame(year=Year, x=rnorm(10))
return(df)
}
for (year in 2010:2011) {
l <- functionX(year) ## this will not error, but will just overwrite l on every loop
}
l
#> year x
#> 1 2011 -0.65480083
#> 2 2011 -0.02877456
#> 3 2011 -0.19413575
#> 4 2011 -0.90141523
#> 5 2011 1.31329723
#> 6 2011 -0.82243619
#> 7 2011 -0.25875645
#> 8 2011 0.23465318
#> 9 2011 -0.42060734
#> 10 2011 -0.63676356
l <- list()
for (year in 2010:2011) {
new_l <- list(functionX(year)) ## this will put the resulting df as an element in a list
l <- append(l, new_l) ## this appends the lists together
}
l
#> [[1]]
#> year x
#> 1 2010 -0.537472741
#> 2 2010 -0.005191135
#> 3 2010 1.005671811
#> 4 2010 0.214009870
#> 5 2010 -0.201253144
#> 6 2010 1.447430260
#> 7 2010 -0.539834711
#> 8 2010 -1.520636908
#> 9 2010 0.652780491
#> 10 2010 0.613471135
#>
#> [[2]]
#> year x
#> 1 2011 -0.65480083
#> 2 2011 -0.02877456
#> 3 2011 -0.19413575
#> 4 2011 -0.90141523
#> 5 2011 1.31329723
#> 6 2011 -0.82243619
#> 7 2011 -0.25875645
#> 8 2011 0.23465318
#> 9 2011 -0.42060734
#> 10 2011 -0.63676356
Created on 2018-08-02 by the reprex package (v0.2.0.9000).
The following code will do what you want.
First, I will create a test dataset, since you have not posted one.
set.seed(527) # make the results reproducible
dataset <- data.frame(Year = sample(2000:2018, 100, TRUE), X = rnorm(100))
Now the function.
functionX <- function(DF, years){
res <- lapply(years, function(y) subset(DF, Year == y))
names(res) <- years
res
}
functionX(dataset, 2010:2017)

Resources