How to change a proportion of values in a data.frame - r

I'm trying to find out how to change the values in a data.frame based on specific proportions. This is an example of the values in the data.frame where class values (with different counts) are grouped by the field "id":
> head(pts)
id class
1 245 10
2 522 10
3 522 10
4 522 10
And this is an example of the proportions:
id class perc%
245 10 100
522 10 50
522 20 50
My objective is to be able to select the values for each "id" and change them according to the "perc%" field, e.g. if I have 100 values for id=522 then change 50 values to class=10 and then 50 values to class=20 (perc%=50).
I've tried subsetting the data.frame or making conditional selections but can't find a way to basically join the "perc%" with the counts of values per "id".
Thanks in advance.

I think I see better what you are trying to do now, this code may not be very elegant, but should get the job done. The "percentages" dataframe represents the second table you described, note that I am renaming "perc%" to be "perc"
colnames(df.percentages) <- c("id","class","perc")
p.check <- df.percentages %>% group_by(id) %>% summarize(sum(perc))
colnames(p.check) <- c("id","sumperc")
not100 <- which(p.check$sumperc != 100)
if(length(not100) != 0)
{
print(paste("ID",p.check$id[not100],"does not add up to 100%"))
}
rm(p.check)
ids <- unique(df.percentages$id)
for(i in 1:length(ids))
{
print("")
print(paste("Processing ID:",ids[i]))
classes.to.reassign <- pts %>% filter(id == ids[i])
if(nrow(classes.to.reassign) == 0)
{
print(paste("Could not find ID",ids[i],"pts dataframe!"))
next
}
class.rows <- df.percentages %>%
filter(id == ids[i]) %>%
mutate(rows = as.integer(round(nrow(classes.to.reassign) * (perc / 100))))
if(nrow(classes.to.reassign) < nrow(class.rows))
{
print(paste("Cannot split", nrow(classes.to.reassign), "classes into", nrow(class.rows), "segments for ID:", ids[i]))
next
}
if(sum(class.rows$rows) > nrow(classes.to.reassign))
{class.rows$rows[nrow(class.rows)] <- class.rows$rows[nrow(class.rows)] + (nrow(classes.to.reassign) - sum(class.rows$rows))}
else if(sum(class.rows$rows) < nrow(classes.to.reassign))
{class.rows$rows[nrow(class.rows)] <- class.rows$rows[nrow(class.rows)] + (sum(class.rows$rows) - nrow(classes.to.reassign))}
class.rows <- class.rows %>%
mutate(cumrows = cumsum(as.integer(rows)))
print(paste("Total rows for ID",ids[i],"=",nrow(classes.to.reassign)))
cur.row <- 1
for(c in 1:nrow(class.rows))
{
last.row <- class.rows$cumrow[c]
print(paste("Assigning class",class.rows$class[c],"to rows",cur.row,"-",last.row))
classes.to.reassign$class[cur.row:last.row] <- class.rows$class[c]
cur.row <- last.row + 1
}
if(i == 1)
{pts.new <- classes.to.reassign}
else
{pts.new <- rbind(pts.new, classes.to.reassign)}
rm(classes.to.reassign, class.rows)
}
pts <- pts.new
View(pts)

Related

Scraping table in RVest when there are multiple rows that span columns

I am trying to scrape the following webpage: https://rptsvr1.tea.texas.gov/cgi/sas/broker?_service=marykay&_program=sfadhoc.budget_report_2022.sas&_service=appserv&_debug=0&who_box=&who_list=031901
Initially, I tried this code:
library(rvest)
library(tidyverse)
webpage = read_html("https://rptsvr1.tea.texas.gov/cgi/sas/broker?_service=marykay&_program=sfadhoc.budget_report_2022.sas&_service=appserv&_debug=0&who_box=&who_list=101912")
tables <- html_nodes(webpage, "table") %>%
html_table()
budget = tables[[2]]
Then I realized that the results were messy, because html_table() cannot read tables where rows span multiple columns. I could clean the dataframe up after scraping it, but I'm wondering if perhaps there's a way to scrape it that avoids the issue in the first place.
I read the answers to similar questions, but they all involved a situation where only the header rows spanned multiple columns, or only the first column spanned multiple rows. In this case, this table is made up of multiple tables squished together, so it's like there are headers all throughout the table. Is there a solution that can deal with rows spanning columns throughout the table?
One way could be as follows:
Decide on your final format. I have selected on a flat file format which entails splitting some header values, those with merged cells, into separate new columns, and repeating these values later
Gather and loop all the rows of data
In the loop process the row according to a set of rules
Generate your output row in the loop and write to a sized list
Drop empty entries list
Convert to DataFrame and add any additional info wanted
The rules I chose to apply where based on looking at the first column of each row as follows (pseudo-code):
if (first column className contains "linecontent") {
if (if first column has br and span child elements) {
split colmn text, after trimming, on line break
assign split to red and measure columns (re-use as filldown)
} else {
assign measure the trimmed column text
}
} else {
we are on a non-header row so grab the $ and % values by matching on class "data"
}
R:
library(tidyverse)
library(rvest)
library(httr2)
page <- request("https://rptsvr1.tea.texas.gov/cgi/sas/broker?_service=marykay&_program=sfadhoc.budget_report_2022.sas&_service=appserv&_debug=0&who_box=&who_list=031901") %>%
req_headers(
"user-agent" = "Mozilla/5.0",
"accept" = "text/html",
"connection" = "keep-alive"
) %>%
req_perform() %>%
resp_body_html()
rows <- page %>% html_elements(".table tbody tr")
row_data <- vector("list", length(rows))
row <- 1
for (row_node in rows) {
first_column_node <- row_node %>% html_element("td")
first_column_node_classname <- html_attr(first_column_node, "class")
is_new_column <- if_else(grepl("linecontent", first_column_node_classname), 1, 0)
column_needs_split <- if_else((length(html_elements(first_column_node, "br")) > 0 &
!is.na(html_element(first_column_node, "span"))), 1, 0)
node_text <- first_column_node %>% html_text2()
data <- NULL
if (is_new_column) {
if (column_needs_split) {
new_values <- trimws(str_split_1(trimws(node_text), "\\n"))
red <- new_values[[1]]
measure <- new_values[[2]]
} else {
measure <- trimws(node_text)
}
} else {
data <- row_node %>%
html_elements(".data") %>%
html_text2()
}
if (!is.null(data)) {
row_data[[row]] <- c(c(red, measure), data)
# print(c(c(red, measure), data))
row <- row + 1
}
}
row_data <- discard(row_data, is.null)
df <- do.call(rbind, row_data) %>% as.data.frame()
colnames(df) <- c(
"red", "category", "measure",
"gen_fund", "gen_fund_perc", "gen_fund_per_student",
"all_fund", "all_fund_perc", "all_fund_per_student"
)
summary_info <- page %>% html_element('.c.systemtitle') %>% html_text(trim = T)
additional_info <- str_match_all(
gsub("\\n", "", summary_info),
"(\\d{4}\\s-\\s\\d{4}).*Totals for (.*?)\\sISD.*?\\((\\d{6})\\)"
)
df$year <- additional_info[[1]][, 2]
df$district <- additional_info[[1]][, 3]
df$isd_code <- additional_info[[1]][, 4]
head(df)
Sample output:
> head(df)
red category measure gen_fund gen_fund_perc
1 Revenues Operating Revenue Local Property Tax from M&O (excluding recapture) $70,019,020 15.03%
2 Revenues Operating Revenue State Operating Funds $333,999,269 71.68%
3 Revenues Operating Revenue Federal Funds $59,326,937 12.73%
4 Revenues Operating Revenue Other Local $2,644,317 0.57%
5 Revenues Operating Revenue Total Operating Revenue $465,989,543 100.00%
6 Revenues Other Revenue Local Property Tax from I&S $0 0.00%
gen_fund_per_student all_fund all_fund_perc all_fund_per_student year district isd_code
1 $1,823 $70,019,020 15.03% $1,823 2021 - 2022 BROWNSVILLE 031901
2 $8,695 $333,999,269 71.68% $8,695 2021 - 2022 BROWNSVILLE 031901
3 $1,544 $59,326,937 12.73% $1,544 2021 - 2022 BROWNSVILLE 031901
4 $69 $2,644,317 0.57% $69 2021 - 2022 BROWNSVILLE 031901
5 $12,131 $465,989,543 100.00% $12,131 2021 - 2022 BROWNSVILLE 031901
6 $0 $9,212,992 17.23% $240 2021 - 2022 BROWNSVILLE 031901

R - Sum total time of multiple overlapping and/or discontinuous periods

For example, these are the days certain type of roles are present in an office
type
day_in
day_out
A
1
10
A
5
15
A
31
35
B
5
15
C
10
20
C
45
55
D
41
50
I want the number of days the office is occupied. There is a continuous office presence from days 1 to 20, 31 to 35, and 41 to 45, so the answer I want is 40 days.
I have a solution based on pivoting the data and setting flags on day when the state switches between occupied and unoccupied , using a for loop to cycle through each row. But I came to this solution reluctantly after failing to work out a vectorized approach.
Is there a vectorized way to do the operation from my for loop? Or any ideas for different algorithms would also be welcome.
My solution with example data is below:
library(dplyr)
library(tidyr)
df_raw <- read.table(
header = TRUE,
text = "
type day_in day_out
A 1 10
A 5 15
A 31 35
B 5 15
C 10 20
C 45 55
D 41 50
"
)
# occupancy from day 1 to 20, 31 to 35 & 41 to 55 = 40 days
# Unoccupied for 15 days
df <- df_raw %>%
tidyr::pivot_longer(cols = c(day_in, day_out), names_to = "in_out", values_to = "day") %>%
arrange(day)
# Create these columns to prevent warning "Unknown or uninitialised column" later
df$current_types <- NA
df$flag <- NA
# Loop to create flags on day when occupancy switches from occupied to unoccupied or vice-versa
for (rown in 1:nrow(df)) {
df$current_types[rown] <- if (rown == 1) {
df$type[rown]
} else {
if (df$in_out[rown] == "day_in") {
paste(df$current_types[rown - 1], df$type[rown], collapse = " ")
} else {
trimws(gsub(paste0("\\s?", df$type[rown], "\\s?"), " ", df$current_types[rown - 1]))
}
}
# if there are no current type then unoccupied. It may or may not be occupied again afterwards.
df$flag[rown] <- if (rown == 1 | (df$in_out[rown] == "day_out" & nchar(df$current_types[rown]) == 0)) {
1
} else {
if (df$in_out[rown] == "day_in" & nchar(df$current_types[rown - 1]) == 0) 1 else 0
}
}
# Then filter the flags, "pivot" to get each occupancy start and end in one row and sum the total days occupied
df %>%
filter(flag == 1) %>%
mutate(
start = if_else(in_out == "day_out" & lag(in_out) == "day_in", dplyr::lag(day), NULL),
stop = if_else(in_out == "day_out", day, NULL)
) %>%
filter(in_out == "day_out") %>%
summarise(days_occupied = sum(stop - start + 1))
You can generate day sequences for each role and count the number of unique days:
length(unique(unlist(apply(df_raw[, c('day_in', 'day_out')],
1,
function(x) seq(x[1], x[2])))))
Or using pipes:
df_raw[, c('day_in', 'day_out')] %>%
apply(1, function(x) seq(x[1], x[2])) %>%
unlist %>%
unique %>%
length
Another simple solution would be to create a vector with the size of your timespan and flag all occupied days and count them afterwards.
df <- data.frame(
type = c("A","A","A","B","C","C","D"),
day_in = c(1,5,31,5,10,45,41),
day_out = c(10,15,35,15,20,55,50))
occupation <- rep(0, max(df$day_out))
for(i in 1:nrow(df)){
occupation[df[i,'day_in']:df[i,'day_out']] <- 1
}
# 40
sum(occupation)

How to efficiently match offerlines based on a rolling window in R

I currently have 600 000 + offerlines, where I want to efficiently match them based on the product bought & the timeframe.
With timeframe I mean that from the base line, I look at all offerlines that are maximally either 10 days before the base line or 10 days after. Everything in between with the same product should be matched.
However, it is very time expensive & after running it for a complete night, I only got to line 45000.
I know parallelism is one option, but I want to know if there are better ways (packages, functions, logic).
Input data
Offerline n°,Customer n°,Offerdate,Product
(we clean to 1 offerline n° per day per custno, for a certain product)
Logic => match lines with same product, different Customer n°
Desired output
Base customer, Related Customer, Offerline n°, Matched Offerline n°, Product, Offerdate base, Offerdate matched line.
Current code
for(i in 1:nrow(X)){
sku <- X[i,]$product
date <- X[i,]$order.offer_date
cust <- X[i,]$customer_code
oon <- X[i,]$order.offer_number
F <- data.frame()
F <- X %>%
filter(product == (X[i,]$product) & (order.offer_date <= date + 10 & order.offer_date >= date - 10)& customer_code != cust)
if(nrow(F)== 0){next}
else{
for(j in 1:nrow(F)){
skuc <- F[j,]$product
datec <- F[j,]$order.offer_date
custc <- F[j,]$customer_code
oonc <- F[j,]$order.offer_number
if(custc == cust | oon == oonc){next}
else if(skuc != sku){next}
else if(skuc == sku){
if(datec <= date + 10 & datec >= date - 10){
z <- z + 1
Y[z,]$count <- j
Y[z,]$base <- oon
Y[z,]$related <- oonc
Y[z,]$baseSku <- sku
Y[z,]$relSku <- skuc
Y[z,]$basedate <- as.Date(date)
Y[z,]$reldate <- as.Date(datec)
Y[z,]$basecust <- cust
Y[z,]$relcust <- custc
}
else{next}
}
}
next
}
}

cbind 1:nrows of same ID variable value to original data.frame

I have a large dataframe, where a variable id (first column) recurs with different values in the second column. My idea is to order the dataframe, to split it into a list and then lapply a function which cbinds the sequence 1:nrows(variable id) to each group. My code so far:
DF <- DF[order(DF[,1]),]
DF <- split(DF,DF[,1])
DF <- lapply(1:length(DF), function(i) cbind(DF[[i]], 1:length(DF[[i]])))
But this gives me an error: arguments imply different number of rows.
Can you elaborate?
> head(DF, n=50)
cell area
1 1 121.2130
2 2 81.3555
3 3 81.5862
4 4 83.6345
...
33 1 121.3270
34 2 80.7832
35 3 81.1816
36 4 83.3340
DF <- DF[order(DF$cell),]
What I want is:
> head(DF, n=50)
cell area counter
1 1 121.213 1
33 1 121.327 2
65 1 122.171 3
97 1 122.913 4
129 1 123.697 5
161 1 124.474 6
...and so on.
This is my code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF <- splitstackshape::getanID(DF, "cell")[] # thanks to akrun's answer
ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = cell)) +
geom_line(aes(group = cell)) + geom_point(size=0.1)
}
And the plot looks like this:
Most cells increase in area, only some decrease. This is only a first try to visualize my data, so what you can't see very well is that the areas drop down periodically due to cell division.
Additional question:
There is a problem I didn't take into account beforehand, which is that after a cell division a new cell is added to the data.frame and is handed the initial index 1 (you see in the image that all cells start from .id=1, not later), which is not what I want - it needs to inherit the index of its creation time. First thing that comes into my mind is that I could use a parsing mechanism that does this job for a newly added cell variable:
DF$.id[DF$cell != temporary.cellindex] <- max(DF$.id[DF$cell != temporary.cellindex])
Do you have a better idea? Thanks.
There is a boundary condition which may ease the problem: fixed number of cells at the beginning (32). Another solution would be to cut away all data before the last daughter cell is created.
Update: Additional question solved, here's the code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF$.id <- c(0, cumsum(diff(DF$cell) < 0)) + 1L # Indexing
title <- getwd()
myplot <- ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = factor(cell))) +
geom_line(aes(group = cell)) + geom_line(size=0.1) + theme(legend.position="none") + ggtitle(title)
#save the plot
image=myplot
ggsave(file="cell_areas_time.svg", plot=image, width=10, height=8)
}
We can use getanID from splitstackshape
library(splitstackshape)
getanID(DF, "cell")[]
There's a much easier method to accomplish that goal. Use ave with seq.int
DF$group_seq <- ave(DF, DF[,1], FUN=function(x){ seq.int(nrow(x)) } )

Assigning data to a vector of dates with unevenly spaced events

I am sorry for the cryptic title but I didn't know how to adequately summarise my problem. So here's my question. I have a data frame with dates and a name for several entities:
df <- data.frame(
time=rep(as.Date(seq(as.Date("2004/1/1"), as.Date("2005/12/1"), by = "1 month ")),2),
name=c(rep("a",24),rep("b",24))
)
str(df)
'data.frame': 48 obs. of 2 variables:
$ time: Date, format: "2004-01-01" "2004-02-01" ...
$ name: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
And I have another dataframe with several unevenly spaced events:
events <- data.frame(
time = c("2004-12-1", "2005-8-1", "2005-6-1", "2004-4-1"),
event = c("normal", "extraordinary", "normal", "extraordinary"),
name = c("a", "a", "b", "b")
)
I want to merge these two data frames in a way that the event is assigned from the either the beginning of the data set up to the event or starting with the last event up to the next event or the end of the data set. This would look something like:
date name event
2004-01-01 a normal
2004-01-02 a normal
...
2004-12-01 a extraordinary
2005-01-01 a extraordinary
Is there an easy way doing this in R that I don't see or do I merge these by hand? Thank you very much for your help!
I don't know any function to do this, but here is some R code to do it yourself :
# Needed type coercions (Date for comparisons, characters to avoid 'factor' problems)
events$time <- as.Date(events$time)
events$event <- as.character(events$event)
events$name <- as.character(events$name)
df$name <- as.character(df$name)
# Events ordering (needed to detect previous events as non NA)
events <- events[ order(events$time) ,]
# Updates
df$event = NA
for(i in 1:nrow(events)) {
# Update where time is lesser than the limit, if names correspond and if an event was not already assigned to the row
df[ df$time <= events[i,"time"] & df$name == events[i,"name"] & is.na(df$event) , "event" ] = events[i,"event"]
}
Here is a function to do what you want:
event.aligning <- function(time.dataframe, events){
if(!class(events[["time"]]) == 'Date'){
events[["time"]] <- as.Date(events[["time"]])
}
## lets sort on time
events <- events[order(events[["time"]]),]
## setup event column
time.dataframe$event <- NA
time.dataframe$event <- as.factor(time.dataframe$event)
levels(time.dataframe$event) <- event.types
rownames.tdf <- rownames(time.dataframe)
res.time.dataframe <- NULL
for( i in 1:length(levels(events$name))){
i.name <- levels(events$name)[i]
i.name.events <- subset(events, name == i.name)
first.time <- time.dataframe$time[time.dataframe$name == i.name][1]
first.event <- i.name.events$time[1]
## assume 2 events
first.event.type <- i.name.events$event[1]
second.event.type <- unique(i.name.events$event[i.name.events$event != first.event.type])
event.types <- levels(i.name.events$event)
sub.time.df <- time.dataframe[time.dataframe$name == i.name,]
rownames(sub.time.df) <- 1:length(sub.time.df[,1])
sub.time.df[1:(as.numeric(rownames(sub.time.df[sub.time.df$time == first.event,])) - 1),]$event <- second.event.type
cur.event <- first.event
for( j in 2:length(i.name.events[,1])){
next.event <- i.name.events$time[j]
sub.time.df[rownames( sub.time.df[ sub.time.df[["time"]] == cur.event,]) :
(as.numeric(rownames( sub.time.df[sub.time.df[["time"]] == next.event,])) - 1),]$event <- i.name.events$event[j-1]
cur.event <- next.event
next.event.type = i.name.events$event[j]
}
last.time <- sub.time.df$time[length(sub.time.df$time)]
last.event <- i.name.events$time[length(i.name.events$time)]
sub.time.df[rownames( sub.time.df[sub.time.df$time == last.event,]):length(sub.time.df$time),]$event <- next.event.type
res.time.dataframe <- rbind(res.time.dataframe, sub.time.df)
}
rownames(res.time.dataframe) <- rownames.tdf
return(res.time.dataframe)
}
df2 <- event.aligning(df, events)

Resources