Assign Names based on total in r - r

I have two dataframes in which first data frame is Problems it has problem number and number of problem to be solved, second data frame is Name_Data it is having name of the person who will solve the problems and days for which he can work. I need the resulting data frame such as based on total person and days availability, problem list will be assigned to each person. Problem_Solver_Name has to be assigned equally by calculating total problems divided by days availability of problem solver. Problem_with_Solver is my resulting data frame.
Please help if anybody has the solution for this
Name=c("Antony","Roger","Peter","Samuel")
Days=c(1,1,0.5,1)
Name_Data=data.frame(Name,Days)
Problem_List=c("Problem1","Problem2","Problem3","Problem4","Problem5","Problem6","Problem7","Problem8",
"Problem9","Problem10","Problem11","Problem12","Problem13","Problem14","Problem15","Problem16",
"Problem17","Problem18","Problem19")
Total_Problem=c(1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,18,19,20,21)
Problems=data.frame(Problem_List,Total_Problem)
Problem_Solver_Name=c("Antony","Antony","Antony","Antony","Antony","Antony","Roger","Roger","Roger","Roger","Roger","Peter","Peter","Peter","Samuel","Samuel","Samuel","Samuel","Samuel")
Problem_with_Solver=data.frame(Problem_List,Total_Problem,Problem_Solver_Name)

Here is a solution.
First assign names proportionally to the days availability, then fill the remaining needs running through all names while there still are problems needing a solver's name.
This last assignment criterion is probably not the best but it will end up assigning names to problems.
# Persons availability data
Name <- c("Antony","Roger","Peter","Samuel")
Days <- c(1,1,0.5,1)
Name_Data <- data.frame(Name, Days)
# Problems data
Problem_List <- c("Problem1","Problem2","Problem3","Problem4","Problem5","Problem6","Problem7","Problem8",
"Problem9","Problem10","Problem11","Problem12","Problem13","Problem14","Problem15","Problem16",
"Problem17","Problem18","Problem19")
Total_Problem <- c(1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,18,19,20,21)
Problems <- data.frame(Problem_List, Total_Problem)
#
# X: person and days availability data
#
assign_solver <- function(X, Problems, Solver) {
N <- nrow(Problems)
tt <- N * X$Days
nn <- N * tt / sum(tt)
#
# psn: problem solver name
psn <- character(N)
psn[1:sum(round(nn))] <- rep(X$Name, round(nn))
k <- 1L + sum(round(nn))
i <- 1L
while(k <= N) {
psn[k] <- X$Name[i]
i <- if(i < nrow(X)) i + 1L else 1L
k <- k + 1L
}
psn <- psn[order(match(psn, X$Name))]
psn
}
Problem_with_Solver <- data.frame(Problem_List, Total_Problem)
Problem_with_Solver$Problem_Solver_Name <- assign_solver(Name_Data, Problems, Problem_Solver_Name)
Problem_with_Solver
#> Problem_List Total_Problem Problem_Solver_Name
#> 1 Problem1 1 Antony
#> 2 Problem2 2 Antony
#> 3 Problem3 3 Antony
#> 4 Problem4 4 Antony
#> 5 Problem5 5 Antony
#> 6 Problem6 6 Antony
#> 7 Problem7 8 Roger
#> 8 Problem8 9 Roger
#> 9 Problem9 10 Roger
#> 10 Problem10 11 Roger
#> 11 Problem11 12 Roger
#> 12 Problem12 13 Peter
#> 13 Problem13 14 Peter
#> 14 Problem14 15 Peter
#> 15 Problem15 16 Samuel
#> 16 Problem16 18 Samuel
#> 17 Problem17 19 Samuel
#> 18 Problem18 20 Samuel
#> 19 Problem19 21 Samuel
Created on 2022-07-25 by the reprex package (v2.0.1)

Related

Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.
Here's the table.
And here's the code:
# load library
library(pdftools)
# link to report
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
# read data through pdftool
data <- pdf_text(url)
# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
table_number,
number_columns,
column_names,
first_row,
last_row) {
data <- list_of_tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[min(grep(first_row, data)):
max(grep(last_row, data))]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data)
names(data) <- column_names
return(data)
}
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")
And the output.
6m 30-9-2022 6m 30-9-2021 12m 30-3-2022 NA
1 Care fees 210,187 194,603 398,206
2 Management fees 59,746 50,959 105,552
3 Interest received 364 42 41
4 Other income 3,942 2,260 4,998
5 Total revenue 274,239 247,864 508,797
6
7 Fair-value movement of
8 investment properties 3 261,346 285,143 745,885
9 Total income 535,585 533,007 1,254,682
10
11 Operating expenses (265,148) (225,380) (466,238)
12 Depreciation and
13 amortisation expenses (22,996) (17,854) (35,698)
14 Finance costs (19,355) (15,250) (30,664)
15 Impairment loss 2 (10,784) - -
16 Total expenses (318,283) (258,484) (532,600)
17
18 Profit before income tax 217,302 274,523 722,082
19 Income tax (expense) / credit (23,316) 6,944 (29,209)
20 Profit for the period 193,986 281,467 692,873
21
22 Earnings per share
23 Basic and diluted (cents per share) 38.8 56.3 138.6
How can I best circumvent this issue?
Many thanks in advance!
While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.
library(pdftools)
library(dplyr, warn.conflicts = F)
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
data <- pdf_text(url)
scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
lines <- unlist(strsplit(pdf_text_item, "\n"))
# remove 0-length lines
lines <- lines[nchar(lines) > 0]
lines <- lines[min(grep(first_row_str, lines)):
max(grep(last_row_str , lines))]
# paste lines back into single string for read_fwf()
paste(lines, collapse = "\n") %>%
readr::read_fwf() %>%
# re-connect strings in first colum if values were split between rows
mutate(X1 = if_else(!is.na(lag(X1)) & is.na(lag(X3)), paste(lag(X1), X1), X1)) %>%
filter(!is.na(X3))
}
output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )
Result:
output %>%
mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#> X1 X2 X3 X4 X5
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Care fees NA 210,187 194,603 398,206
#> 2 Management fees NA 59,746 50,959 105,552
#> 3 Interest received NA 364 42 41
#> 4 Other income NA 3,942 2,260 4,998
#> 5 Total revenue NA 274,239 247,864 508,797
#> 6 Fair-value movement of investmen... 3 261,346 285,143 745,885
#> 7 Total income NA 535,585 533,007 1,254,682
#> 8 Operating expenses NA (265,148) (225,380) (466,238)
#> 9 Depreciation and amortisation ex... NA (22,996) (17,854) (35,698)
#> 10 Finance costs NA (19,355) (15,250) (30,664)
#> 11 Impairment loss 2 (10,784) - -
#> 12 Total expenses NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax NA 217,302 274,523 722,082
#> 14 Income tax (expense) / credit NA (23,316) 6,944 (29,209)
#> 15 Profit for the period NA 193,986 281,467 692,873
#> 16 Earnings per share Basic and dil... NA 38.8 56.3 138.6
Created on 2022-11-19 with reprex v2.0.2

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

R - Sample consecutive series of dates in time series without replacement?

I have a data frame in R containing a series of dates. The earliest date is (ISO format) 2015-03-22 and the latest date is 2016-01-03, but there are two breaks within the data. Here is what it looks like:
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
I.e.:
> date_data
# A tibble: 211 x 2
dates sample_id
<date> <int>
1 2015-03-22 0
2 2015-03-23 0
3 2015-03-24 0
4 2015-03-25 0
5 2015-03-26 0
6 2015-03-27 0
7 2015-03-28 0
8 2015-03-29 0
9 2015-03-30 0
10 2015-03-31 0
# … with 201 more rows
What I want to do is to take ten 10-day long samples of continous dates from within that time series without replacement. For example, a valid sample would be the ten days from 2015-04-01 to 2015-04-10 because that falls completely within the dates column in my date_data data frame. Each sample would then get a unique (non-zero) number in the sample_id column in date_data such as 1:10.
To be clear, my requirements are:
Each sample would be 10 consecutive days.
The sampling has to be without replacement. So if sample_id == 1 is the 2015-04-01 to 2015-04-10 period, those dates can't be part of another 10-day-long sample.
Each 10-day-long sample can't include any date that's not within date_data$dates.
At the end, date_data$sample_id would have unique numbers representing each 10-day-long sample, likely with lots of 0s left over that were not part of any sample (and there would be 200 rows - 10 for each sample - where sample_id != 0).
I am aware of dplyr::sample_n() but it doesn't sample consecutive values, and I don't know how to devise a way to "remember" which dates have already been sampled...
What's a good way to do this? A for loop?!?! Or perhaps something with purrr? Thank you very much for your help.
UPDATE: Thanks to #gfgm's solution, it reminded me that performance is an important consideration. My real dataset is quite a bit larger, and in some cases I would want to take 20+ samples instead of just 10. Ideally the size of the sample can be changed as well, i.e. not necessarily 10-days long.
This is tricky, as you anticipated, because of the requirement of sampling without replacement. I have a working solution below which achieves a random sample and works fast on a problem of the scale given in your toy example. It should also be fine with more observations, but will get really really slow if you need to pick a lot of points relative to the sample size.
The basic premise is to pick n=10 points, generate the 10 vectors from these points forwards, and if the vectors overlap ditch them and pick again. This is simple and works fine given that 10*n << nrow(df). If you wanted to get 15 subvectors out of your 200 observations this would be a good deal slower.
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
# A function that picks n indices, projects them forward 10,
# and if any of the segments overlap resamples
pick_n_vec <- function(df, n = 10, out = 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
vecs <- lapply(points, function(i){i:(i+(out - 1))})
while (max(table(unlist(vecs))) > 1) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
vecs <- lapply(points, function(i){i:(i+(out - 1))})
}
vecs
}
# demonstrate
set.seed(42)
indices <- pick_n_vec(date_data)
for (i in 1:10) {
date_data$sample_id[indices[[i]]] <- i
}
date_data[indices[[1]], ]
#> # A tibble: 10 x 2
#> dates sample_id
#> <date> <int>
#> 1 2015-05-31 1
#> 2 2015-06-01 1
#> 3 2015-06-02 1
#> 4 2015-06-03 1
#> 5 2015-06-04 1
#> 6 2015-06-05 1
#> 7 2015-06-06 1
#> 8 2015-06-07 1
#> 9 2015-06-08 1
#> 10 2015-06-09 1
table(date_data$sample_id)
#>
#> 0 1 2 3 4 5 6 7 8 9 10
#> 111 10 10 10 10 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
marginally faster version
pick_n_vec2 <- function(df, n = 10, out = 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
while (min(diff(sort(points))) < 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
}
lapply(points, function(i){i:(i+(out - 1))})
}

Running a function in a for loop and adding the returned dataframe to a list in R

I have a function that takes in one argument as a particular year. This function returns a dataframe. Now I want to create a for loop for a range of years and add these data frames to a list or to combine into a large dataframe.
Will something like this help:
l <- list()
for (year in 2010:2017) {l <- functionX(subset(dataset, Year==year))}
It's not working. The error I get is-
longer object length is not a multiple of shorter object length
I also tried calling the function just as :
functionX(subset(dataset, Year== 2010:2017))
This doesn't work either.
Edit:
I think because the lengths of the data frames for each year are not same, hence they're not getting added. I made a slight change-
for (i in 2010:2017) {
df <- functionX(subset(dataset, Year==i))
l[i] <- df$Name
}
Error:
number of items to replace is not a multiple of replacement length
I'm not trying to replace, but just trying to add elements of a particular dataframe for each year to the list.
I updated your example to make it reproducible. The general idea is as follows: inside the for loop you put your data.frame inside a list. Then you append that list to the big list, l. That way your data.frame becomes an element inside of the list l:
l <- list()
functionX <- function(Year) {
set.seed(Year)
df <- data.frame(year=Year, x=rnorm(10))
return(df)
}
for (year in 2010:2011) {
l <- functionX(year) ## this will not error, but will just overwrite l on every loop
}
l
#> year x
#> 1 2011 -0.65480083
#> 2 2011 -0.02877456
#> 3 2011 -0.19413575
#> 4 2011 -0.90141523
#> 5 2011 1.31329723
#> 6 2011 -0.82243619
#> 7 2011 -0.25875645
#> 8 2011 0.23465318
#> 9 2011 -0.42060734
#> 10 2011 -0.63676356
l <- list()
for (year in 2010:2011) {
new_l <- list(functionX(year)) ## this will put the resulting df as an element in a list
l <- append(l, new_l) ## this appends the lists together
}
l
#> [[1]]
#> year x
#> 1 2010 -0.537472741
#> 2 2010 -0.005191135
#> 3 2010 1.005671811
#> 4 2010 0.214009870
#> 5 2010 -0.201253144
#> 6 2010 1.447430260
#> 7 2010 -0.539834711
#> 8 2010 -1.520636908
#> 9 2010 0.652780491
#> 10 2010 0.613471135
#>
#> [[2]]
#> year x
#> 1 2011 -0.65480083
#> 2 2011 -0.02877456
#> 3 2011 -0.19413575
#> 4 2011 -0.90141523
#> 5 2011 1.31329723
#> 6 2011 -0.82243619
#> 7 2011 -0.25875645
#> 8 2011 0.23465318
#> 9 2011 -0.42060734
#> 10 2011 -0.63676356
Created on 2018-08-02 by the reprex package (v0.2.0.9000).
The following code will do what you want.
First, I will create a test dataset, since you have not posted one.
set.seed(527) # make the results reproducible
dataset <- data.frame(Year = sample(2000:2018, 100, TRUE), X = rnorm(100))
Now the function.
functionX <- function(DF, years){
res <- lapply(years, function(y) subset(DF, Year == y))
names(res) <- years
res
}
functionX(dataset, 2010:2017)

Extend data frame column with inflation in R

I'm trying to extend some code to be able to:
1) read in a vector of prices
2) left join that vector of prices to a data frame of years (or years and months)
3) append/fill the prices for missing years with interpolated data based on the last year of available prices plus a specified inflation rate. Consider an example like this one:
prices <- data.frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA))
What I need is something that will fill the missing rows of each column with the last price plus inflation (suppose 2%). I can do this in a pretty brute force way as:
i_rate<-0.02
for(i in c(1:nrow(prices))){
if(is.na(prices$wti[i]))
prices$wti[i]<-prices$wti[i-1]*(1+i_rate)
if(is.na(prices$brent[i]))
prices$brent[i]<-prices$brent[i-1]*(1+i_rate)
}
It seems to me there should be a way to do this using some combination of apply() and/or fill() but I can't seem to make it work.
Any help would be much appreciated.
As noted by #camille, the problem with dplyr::lag is that it doesn't work here with consecutive NAs because it uses the "original" ith element of a vector instead of the "revised" ith element. We'd have to first create a version of lag that will do this by creating a new function:
impute_inflation <- function(x, rate) {
output <- x
y <- rep(NA, length = length(x)) #Creating an empty vector to fill in with the loop. This makes R faster to run for vectors with a large number of elements.
for (i in seq_len(length(output))) {
if (i == 1) {
y[i] <- output[i] #To avoid an error attempting to use the 0th element.
} else {
y[i] <- output[i - 1]
}
if (is.na(output[i])) {
output[i] <- y[i] * (1 + rate)
} else {
output[i]
}
}
output
}
Then it's a pinch to apply this across a bunch of variables with dplyr::mutate_at():
library(dplyr)
mutate_at(prices, vars(wti, brent), impute_inflation, 0.02)
year wti brent
1 2018 75.000 80.00
2 2019 80.000 85.00
3 2020 90.000 94.00
4 2021 91.800 93.00
5 2022 93.636 94.86
You can use dplyr::lag to get the previous value in a given column. Your lagged values look like this:
library(dplyr)
inflation_factor <- 1.02
prices <- data_frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA)) %>%
mutate_at(vars(wti, brent), as.numeric)
prices %>%
mutate(prev_wti = lag(wti))
#> # A tibble: 5 x 4
#> year wti brent prev_wti
#> <int> <dbl> <dbl> <dbl>
#> 1 2018 75 80 NA
#> 2 2019 80 85 75
#> 3 2020 90 94 80
#> 4 2021 NA 93 90
#> 5 2022 NA NA NA
When a value is NA, multiply the lagged value by the inflation factor. As you can see, that doesn't handle consecutive NAs, however.
prices %>%
mutate(wti = ifelse(is.na(wti), lag(wti) * inflation_factor, wti),
brent = ifelse(is.na(brent), lag(brent) * inflation_factor, brent))
#> # A tibble: 5 x 3
#> year wti brent
#> <int> <dbl> <dbl>
#> 1 2018 75 80
#> 2 2019 80 85
#> 3 2020 90 94
#> 4 2021 91.8 93
#> 5 2022 NA 94.9
Or to scale this and avoid doing the same multiplication over and over, gather the data into a long format, get lags within each group (wti, brent, or any others you may have), and adjust values as needed. Then you can spread back to the original shape:
prices %>%
tidyr::gather(key = key, value = value, wti, brent) %>%
group_by(key) %>%
mutate(value = ifelse(is.na(value), lag(value) * inflation_factor, value)) %>%
tidyr::spread(key = key, value = value)
#> # A tibble: 5 x 3
#> year brent wti
#> <int> <dbl> <dbl>
#> 1 2018 80 75
#> 2 2019 85 80
#> 3 2020 94 90
#> 4 2021 93 91.8
#> 5 2022 94.9 NA
Created on 2018-07-12 by the reprex package (v0.2.0).

Resources