I have a dataset of over 300K rows and over 20 years. I'm trying to create a Load Duration Curve for every year for XX years (so # of MW used every hour of the year (8760 hours for every year or 8784 for leap year). Currently I make a new dataframe by filtering by year and then reordering by descending order of MW used (descending order for the curve) and then create another column to match the row order so that I can use that column as a placeholder for the x-axis. Seems pretty inefficient and could be difficult to update if needed (see playground for what I've been doing). I also don't want to use facet_wrap() because the graphs are too small for what is needed.
Dummy_file:
Where hrxhr is the running total of hours in a given year.
YEAR
MONTH
DAY
HOUR OF DAY
MW
Month_num
Date
Date1
hrxhr
2023
Dec
31
22
2416
12
2023-12-31
365
8758
2023
Dec
31
23
2412
12
2023-12-31
365
8759
2023
Dec
31
24
2400
12
2023-12-31
365
8760
2024
Jan
01
1
2271
12
2024-01-01
1
1
2023
Jan
01
2
2264
12
2024-01-01
1
2
### ------------ Load in source ------------ ###
dummy_file <- 'Dummydata.csv'
forecast_df <- read_csv(dummy_file)
### ---- Order df by MW (load) and YEAR ---- ###
ordered_df <- forecast_df[order(forecast_df$MW, decreasing = TRUE), ]
ordered_df <- ordered_df[order(ordered_df$YEAR, decreasing = FALSE), ]
### -------------- Playground -------------- ###
## Create a dataframe for the forecast for calendar year 2023
cy23_df <- ordered_df[ordered_df$YEAR == 2023,]
## Add placeholder column for graphing purposes (add order number)
cy23_df$placeholder <- row.names(cy23_df)
## Check df structure and change columns as needed
str(cy23_df)
# Change placeholder column from character to numeric for graphing purposes
cy23_df$placeholder <- as.numeric(cy23_df$placeholder)
# Check if changed correctly
class(cy23_df$placeholder) #YES
## Load duration curve - Interactive
LF_cy23_LDC <- plot_ly(cy23_df,
x= ~placeholder,
y= ~MW,
type= 'scatter',
mode = 'lines',
hoverinfo = 'text',
text = paste("Megawatts: ", cy23_df$MW,
"Date: ", cy23_df$MONTH, cy23_df$DAY,
"Hour: ", cy23_df$hrxhr)) %>%
layout(title = 'CY2023 Load Forecast - LDC')
# "Hour: ", orderby_MW$yrhour))
saveWidget(LF_cy23_LDC, "cy23_LDC.html")
Current Output for CY2023:
Yaxis Megawatts used (MW) and Xaxis is a placeholder (placeholder) and then I just repeat the playground code for the rest of the years, but change 2023 to 2024, then 2025, etc.
Sorry if this is a long post, tmi, or not enough information. I'm fairly new to R and this community. Many thanks for your help!
Simply generalize your playground process in a user-defined method, then iterate through years with lapply.
# USER DEFINED METHOD TO RUN A SINGLE YEAR
build_year_plot <- function(year) {
### -------------- Playground -------------- ###
## Create a dataframe for the forecast for calendar year
cy_df <- ordered_df[ordered_df$YEAR == year,]
## Add placeholder column for graphing purposes (add order number)
cy_df$placeholder <- row.names(cy_df)
## Check df structure and change columns as needed
str(cy_df)
# Change placeholder column from character to numeric for graphing purposes
cy_df$placeholder <- as.numeric(cy_df$placeholder)
# Check if changed correctly
class(cy_df$placeholder) #YES
## Load duration curve - Interactive
LF_cy_LDC <- plot_ly(
cy_df, x = ~placeholder, y = ~MW, type= 'scatter',
mode = 'lines', hoverinfo = 'text',
text = paste(
"Megawatts: ", cy_df$MW,
"Date: ", cy_df$MONTH, cy_df$DAY,
"Hour: ", cy_df$hrxhr
)
) %>% layout( # USING BASE R 4.1.0+ PIPE
title = paste0('CY', year, ' Load Forecast - LDC')
)
saveWidget(LF_cy_LDC, paste0("cy", year-2000, "_LDC.html"))
return(LF_cy_LDC)
}
# CALLER TO RUN THROUGH SEVERAL YEARS
LF_cy_plots <- lapply(2023:2025, build_year_plot)
Consider even by (object-oriented wrapper to tapply and roughly equivalent to split + lapply) and avoid the year indexing. Notice input parameter changes below and variables used in title and filename:
# USER DEFINED METHOD TO RUN A SINGLE DATA FRAME
build_year_plot <- function(cy_df) {
### -------------- Playground -------------- ###
## Add placeholder column for graphing purposes (add order number)
cy_df$placeholder <- row.names(cy_df)
...SAME AS ABOVE...
) %>% layout(
title = paste0('CY', cy_df$YEAR[1], ' Load Forecast - LDC')
)
saveWidget(LF_cy_LDC, paste0("cy", cy_df$YEAR[1]-2000, "_LDC.html"))
return(LF_cy_LDC)
}
# CALLER TO RUN THROUGH SEVERAL YEARS
LF_cy_plots <- by(ordered_df, ordered_df$YEAR, build_year_plot)
Counterparts in tidyverse would be purrr.map:
# METHOD RECEIVES YEAR (lapply counterpart)
LF_cy_plots <- purrr::map(2023:2025, build_year_plot)
# METHOD RECEIVES DATA FRAME (by counterpart)
LF_cy_plots <- ordered_year %>%
split(.$YEAR) %>%
purrr::map(build_year_plot)
Related
My code is meant to order a table called Football (imported csv2) and then, using a for loop, go through the data and return the row number of the start year and end year.
Football[order(Football$Year),]
start_year <- min(Football$Year)
end_year <- max(Football$Year)
for (i in 1:nrow(Football)
{
if (Football$Year[i] = start_year)
{
row_of_start <- i
}
if (Football$Year[i] = end_year)
{
row_of_end <- i
}
}
This produces the following error:
> if (Football$Year[1] = start_year) row_of_start <- 1
Error: unexpected '=' in "if (Football$Year[1] ="
I appreciate there are probably ways of doing this without a for loop (which I would be very appreciative to know) although I would also like to know how to make the for loop work (to further my understanding).
You can skip the loop entirely using which(). This will usually be faster and more legible:
# Create example data
set.seed(123)
Football <- data.frame(Year = sample(1990:2000, size = 10),
foo = sample(letters, size = 10))
# Sort the data as you have done
Football_sort <- Football[order(Football$Year), ]
# Get the row numbers of the min and max (start and end years)
which(with(Football_sort, Year == min(Year)))
#> [1] 1
which(with(Football_sort, Year == max(Year)))
#> [1] 10
Depending upon what you actually want to do, you can skip the ordering step as well. Both of the below depend upon the dplyr package to work.
If you just want the start and end year rows rather than their row numbers:
library(dplyr)
Football %>%
filter(Year %in% c(min(Year), max(Year)))
#> Year foo
#> 1 2000 e
#> 2 1990 d
If you want the "year number" of the start and end year:
Football %>%
summarise(start_year = 1,
end_year = max(Year) - min(Year))
#> start_year end_year
#> 1 1 10
I am creating a bar graph with continuous x-labels of 'Fiscal Years', such as "2009/10", "2010/11", etc. I have a column in my dataset with a specific Fiscal Year that I would like the x-labels to begin at (see example image below). Then, I would like the x-labels to be every continuous Fiscal Year until the present. The last x-label should be "2018/19". When I try to set the limits with scale_x_continuous, I receive an error of Error: Discrete value supplied to continuous scale. However, if I use 'scale_x_discrete', I get a graph with only two bars: my chosen "Start" date and the "End" of 2018/19.
Start<-Project_x$Start[c(1)]
End<-"2018/2019"
ggplot(Project_x, (aes(x=`FY`, y=Amount)), na.rm=TRUE)+
geom_bar(stat="identity", position="stack")+
scale_x_continuous(limits = c(Start,End))
` Error: Discrete value supplied to continuous scale `
Thank you.
My data is:
df <- data.frame(Project = c(5, 6, 5, 5, 9, 5),
FY = c("2010/11","2017/18","2012/13","2011/12","2003/04","2000/01"),
Start=c("2010/11", "2011/12", "2010/11", "2010/11", "2001/02", "2010/11"),
Amount = c(500,502,788,100,78,NA))
To use the code in the answer below, I need to base my Start_Year off of my Start column rather than the FY column, and the graph should just be for Project #5.
as.tibble(df) %>%
mutate(Start_Year = as.numeric(sub("/\\d{2}","",Start)))
xlabel_start<-subset(df$Start_Year, Project == 5)
xlabel_end<-2018
filter(between(Start_Year,xlabel_start,xlabel_end)) %>%
ggplot(aes(x = FY, y = Amount))+
geom_col()
When running this, my xlabel_start is NULL.
In ggplot, continuous is dedicated for numerical values. Here, your fiscal year are character (or factor) format and so they are considered as discrete values and are sorted alphabetically by ggplot2.
One possible solution to get your expected plot is to create a new variable containing the starting year of the fiscal year and filter for values between 2010 and 2018.
But first, we are going to isolate the project and the starting year of interest by creating a new dataframe:
library(dplyr)
xlabel_start <- as.tibble(df) %>%
mutate(Start_Year = as.numeric(sub("/\\d{2}","",Start))) %>%
distinct(Project, Start_Year) %>%
filter(Project == 5)
# A tibble: 1 x 2
Project Start_Year
<dbl> <dbl>
1 5 2010
Now, using almost the same pipeline, we can isolate values of interest by
doing:
library(tidyverse)
as.tibble(df) %>%
mutate(Year = as.numeric(sub("/\\d{2}","",FY))) %>%
filter(Project == 5 & between(Year,xlabel_start$Start_Year,xlabel_end))
# A tibble: 3 x 5
Project FY Start Amount Year
<dbl> <fct> <fct> <dbl> <dbl>
1 5 2010/11 2010/11 500 2010
2 5 2012/13 2010/11 788 2012
3 5 2011/12 2010/11 100 2011
And once you have done this, you can simply add the ggplot plotting part at the end of this pipe sequence:
library(tidyverse)
as.tibble(df) %>%
mutate(Year = as.numeric(sub("/\\d{2}","",FY))) %>%
filter(Project == 5 & between(Year,xlabel_start$Start_Year,xlabel_end)) #%>%
ggplot(aes(x = FY, y = Amount))+
geom_col()
Does it answer your question ?
I have a raw text file in the following format:
RELEASE VERSION: 20150514 (May 14, 2015)
======================================================================== VERSION
STUDY VARIABLE: Version Number Of Release
QUESTION:
--------- Version of Cumulative Data File
NOTES:
------ This variable appears in the data as:
ANES_cdf_VERSION:YYYY-mmm-DD where mmm is standard 3-character month abbreviation (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec).
TYPE:
----- Character-1
======================================================================== VCF0004
STUDY VARIABLE: Year of Study
QUESTION:
--------- Year of study (4-digit)
TYPE:
----- Numeric Dec 0-1
===================================================================== VCF0006
...
and so on
Observations are bounded by "=" row and each of observation has some amount of variables (not all may be presented)
I am trying to create a data table out of it.
I created a vector of observations, in each observations columns are separated by '|'. Then I use fread to make a data table:
dt <- fread(paste(rawObs, collapse = '\n'),sep = '|',header = F, fill = T)
However, this is not really a solution. Fill = T only considers missing columns at the end of the observations and not in between:
In the example we have it should be:
id | study_var | question | notes | type
version | s1 | q1 | notes1 | character-1
VCF0004 | s2 | q2 | NA | numeric
But R creates it as
id | study_var | question | notes | type
version | s1 | q1 | notes1 | character-1
VCF0004 | s2 | q2 | numeric | NA
Type of the second observation is shifted leftward. As a solution, I was thinking to determine a missing columns within each observation and insert NAs explicitly in the input file, using max number of variables found but it might be slow for large files.
Thanks for help. Any comments are appreciated.
Here is all code:
library(magrittr)
library(data.table)
path <- 'Downloads/anes_timeseries_cdf_codebook_var.txt'
raw_data <- readLines(path)
head(raw_data)
#remove empty lines
raw_data <- raw_data[raw_data != ""]
#remove header
raw_data <- raw_data[-c(1,2)]
data_entries_index <- grep('^=+', raw_data)+1
#add end position of the last observation
data_entries_index <- c(data_entries_index, length(raw_data))
#opening file shows editor couldn't read two characters - we can ignore it though
data_entries_index
parseRawObservation <- function(singleRawObs, VariableIndex){
count=length(VariableIndex)-1
for (i in 1:count){
start = VariableIndex[i]+2
end = VariableIndex[i+1]-1
varValue <- paste(singleRawObs[start:end],collapse = ' ')
if (i==1)
obsSpaced <- varValue
else
obsSpaced <- paste(obsSpaced,varValue, sep = '|')
}
obsSpaced
}
#create a vector of raw observations
numObs <- length(data_entries_index)
count=numObs-1
rawObs=vector()
for (i in 1:count) {
start <- data_entries_index[i]
end <- data_entries_index[i+1]-2
singleRawObs <-raw_data[start:end]
VariableIndex <- grep("^-+",singleRawObs)-1
#add end of the last variable index
VariableIndex <- c(VariableIndex, length(singleRawObs)+1)
rawObs[i] <- parseRawObservation(singleRawObs,VariableIndex)
#add first two columns separately as they do not have dashes at the next line
rawObs[i] <- paste(singleRawObs[1], singleRawObs[2], rawObs[i], sep = '|')
}
#determine max number of fields
numOfCol <- max(sapply(rawObs, FUN = function(x) length(strsplit(x,'|')[[1]])))
which.max(sapply(rawObs, FUN = function(x) length(strsplit(x,'|')[[1]])))
dt <- fread(textConnection(rawObs),sep = '|',header = F)
dt <- fread(paste(rawObs[1:2], collapse = '\n'),sep = '|',header = F, fill = T)
rawObs[653]
There is a handy alternative for reading files like this one: read.dcf().
read.dcf() reads files in Debian Control Format (DCF) which consist of regular lines of form tag:value. Records are separated by one or more empty lines.
However, the input file needs to be modified to conform with the DCF format (plus some additional modifications to meet OP's expected result):
Empty rows need to be removed as they would be mistaken as record separator.
The streaks of equal signs = which are used as record separator need to be replaced by multiple empty lines and the missing tag id:.
The streaks of dashes should be removed.
The first row containing RELEASE VERSION: should be removed to be in line with OP's expectations.
The code below assumes that the raw text file is named "raw.txt".
library(data.table)
library(magrittr)
# read raw file, skip first row
raw <- fread("raw.txt", sep = "\n", header = FALSE, skip = 1L)
# replace streaks of "=" and "-"
raw[, V1 := V1 %>%
stringr::str_replace("[=]+", "\n\nid:") %>%
stringr::str_replace(": [-]+", ": ")][]
# now read the modified data using DCF format skipping empty rows
dt <- as.data.table(read.dcf(textConnection(raw[V1 != "", V1])))
dt
id STUDY VARIABLE QUESTION
1: VERSION Version Number Of Release Version of Cumulative Data File
2: VCF0004 Year of Study Year of study (4-digit)
3: VCF0006 NA NA
NOTES
1: This variable appears in the data as: ANES_cdf_VERSION:YYYY-mmm-DD [...]
2: NA
3: NA
TYPE
1: Character-1
2: Numeric Dec 0-1
3: NA
Thanks in advance! I have been trying this for a few days, and I am kind of stuck. I am trying to loop through a text file (imported as a list), and create a data frame from the text file. The data frame starts a new row if the item in the list has a day of the week in the text, and will populate in the first column (V1). I want to put the rest of the comments in the second column (V2) and I may have to concatenate strings together. I am trying to use a conditional with grepl(), but I am kind of lost on the logic after I set up the initial data frame.
Here is an example text I am bringing into R (it is Facebook data from a text file). The []'s signify the list number. It is a lengthy file (50K+ lines) but I have the date column set up.
[1]
Thursday, August 25, 2016 at 3:57pm EDT
[2]
Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
[3]Sunday, August 14, 2016 at 9:17am EDT
[4]Michael shared Jason post.
[5]This bird is a lot smarter than the majority of political posts I have read recently here
[6]Sunday, August 14, 2016 at 8:44am EDT
[7]Michael and Kurt are now friends.
The end result would be data frame where the day of the week starts a new row in the data frame, and the rest of the list is concatenated into the second column of the data frame. So the end data fame would be
Row 1 ([1] in V1 and [2] in V2)
Row 2 ([3] in V1 and [4],[5] in V2)
Row 3 ([6] in V1 and [7] in V2)
Here is the start of my code, and I can get V1 to populate correctly, but not the second column of the data frame.
### Read in the text file
temp <- readLines("C:/Program Files/R/Text Mining/testa.txt")
### Remove empty lines from the text file
temp <- temp[temp!=""]
### Create the temp char file as a list file
tmp <- as.list(temp)
### A days vector for searching through the list of days.
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday")
df <- {}
### Loop through the list
for (n in 1:length(tmp)){
### Search to see if there is a day in the list item
for(i in 1:length(days)){
if(grepl(days[i], tmp[n])==1){
### Bind the row to the df if there is a day in the list item
df<- rbind(df, tmp[n])
}
}
### I know this is wrong, I am trying to create a vector to concatenate and add to the data frame, but I am struggling here.
d <- c(d, tmp[n])
}
Here's an option using the tidyverse:
library(tidyverse)
text <- "[1] Thursday, August 25, 2016 at 3:57pm EDT
[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
[3]Sunday, August 14, 2016 at 9:17am EDT
[4]Michael shared Jason post.
[5]This bird is a lot smarter than the majority of political posts I have read recently here
[6]Sunday, August 14, 2016 at 8:44am EDT
[7]Michael and Kurt are now friends."
df <- data_frame(lines = read_lines(text)) %>% # read data, set up data.frame
filter(lines != '') %>% # filter out empty lines
# set grouping by cumulative number of rows with weekdays in them
group_by(grp = cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), lines))) %>%
# collapse each group to two columns
summarise(V1 = lines[1], V2 = list(lines[-1]))
df
## # A tibble: 3 × 3
## grp V1 V2
## <int> <chr> <list>
## 1 1 [1] Thursday, August 25, 2016 at 3:57pm EDT <chr [1]>
## 2 2 [3]Sunday, August 14, 2016 at 9:17am EDT <chr [2]>
## 3 3 [6]Sunday, August 14, 2016 at 8:44am EDT <chr [1]>
This approach uses a list column for V2, which is probably the best approach in terms of preserving your data, but use paste or toString if you need.
Roughly equivalent base R:
df <- data.frame(V2 = readLines(textConnection(text)), stringsAsFactors = FALSE)
df <- df[df$V2 != '', , drop = FALSE]
df$grp <- cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), df$V2))
df$V1 <- ave(df$V2, df$grp, FUN = function(x){x[1]})
df <- aggregate(V2 ~ grp + V1, df, FUN = function(x){x[-1]})
df
## grp V1
## 1 1 [1] Thursday, August 25, 2016 at 3:57pm EDT
## 2 2 [3]Sunday, August 14, 2016 at 9:17am EDT
## 3 3 [6]Sunday, August 14, 2016 at 8:44am EDT
## V2
## 1 [2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
## 2 [4]Michael shared Jason post., [5]This bird is a lot smarter than the majority of political posts I have read recently here
## 3 [7]Michael and Kurt are now friends.
I'm trying to use the 'relsurv' package in R to compare the survival of a cohort to national life tables. The code below shows my problem using the example from relsurv but changing the life-table data. I've just used two years and two ages in the life-table data below, the actual data is much larger but gives the same error. The error is 'invalid ratetable argument' but I've formatted it as per the example life-tables 'slopop' and 'survexp.us'.
library(survival)
library(relsurv)
data(rdata) # example data from relsurv
raw = read.table(header=T, stringsAsFactors = F, sep=' ', text='
Year Age sex qx
1980 30 1 0.00189
1980 31 1 0.00188
1981 30 1 0.00191
1981 31 1 0.00191
1980 30 2 0.00077
1980 31 2 0.00078
1981 30 2 0.00076
1981 31 2 0.00074
')
ages = c(30,40) # in years
years = c(1980, 1990)
rtab = array(data=NA, dim=c(length(ages), 2, length(years))) # set up blank array: ages, sexes, years
for (y in unique(raw$Year)){
for (s in 1:2){
rtab[ , s, y-min(years)+1] = -1 * log(1-subset(raw, Year==y&sex==s)$qx) / 365.24 # probability of death in next year, transformed to hazard (see ratetables help)
}
}
attributes(rtab)$dimnames[[1]] = as.character(ages)
attributes(rtab)$dimnames[[2]] = c('male','female')
attributes(rtab)$dimnames[[3]] = as.character(years)
attributes(rtab)$dimid <- c("age", "sex", 'year')
attributes(rtab)$dim <- c(length(ages), 2, length(years))
attributes(rtab)$factor = c(0,0,1)
attributes(rtab)$type = c(2,1,4)
attributes(rtab)$cutpoints[[1]] = ages*365.24 # must be in days
attributes(rtab)$cutpoints[[2]] = NULL
attributes(rtab)$cutpoints[[3]] = as.date(paste("1Jan", years, sep='')) # must be date
attributes(rtab)$class = "ratetable"
# example from relsurv
rsmul(Surv(time,cens) ~ sex+as.factor(agegr)+
ratetable(age=age*365.24, sex=sex, year=year),
data=rdata, ratetable=rtab, int=1)
Try using the transrate function from the relsurv package to reformat the data. That should give you a compatible dataset.
Regards,
Josh
Three things to add:
You should set attributes(rtab)$factor = c(0,1,0), since sex (the second dimension) is a factor (i.e., doesn't change over time).
A good way to check whether something is a valid rate table is to use the is.ratetable() function. is.ratetable(rtab, verbose = TRUE) will even return a message stating what was wrong.
Check the result of is.ratetable without using verbose first, because it will lie about valid rate tables.
The rest of this comment is about this lie.
If the type attribute isn't given, is.ratetable will calculate it using the factor attribute; you can see this by just printing the function. However, it seems to do so incorrectly. It uses type <- 1 * (fac == 1) + 2 * (fac == 0) + 4 * (fac > 0), where fac is attributes(rtab)$factor.
But the next section, which checks the type attribute if it's provided, says the only valid values are 1, 2, 3, and 4. It's impossible to get 1 from the code above.
For example, let's examine the slopop ratetable provided with the relsurv package.
library(relsurv)
data(slopop)
is.ratetable(slopop)
# [1] TRUE
is.ratetable(slopop, verbose = TRUE)
# [1] "wrong length for cutpoints 3"
I think this is where your rate table is being hung up.