Looking for advice on refining my code and also trimming to a date range.
The spreadsheet itself is pulled from another system and so the structure of the excel cannot be changed. When you pull the data it basically starts at E2, with the first date column in F2, and the first item in E3. The data will continue to populate to the right for as long as it goes on for. I have replicated the structure below.
AndI want it to look like:
I have come up with the below, which works, but I was looking for advice on refining it down to fewer individual step by steps.
In the below code:
= extracting data
= pulling the dates out
= formatting from
excel number to an actual date
= grabbing the item names
= transposing data and skipping some parts
= adding in dates to the row names
#1
df <- data.frame(read_excel("C:/example.xlsx",
sheet = "Sheet1"))
#2
dfdate <- gtb[1, -c(1,2,3,4,5)]
#3
dfdate <- format(as.Date(as.numeric(dfdate),
origin = "1899-12-30"), "%d/%m/%Y")
#4
rownames(gtb) <- gtb[,1]
#5
gtb <- as.data.frame(t(gtb[, -c(1,2,3,4,5)]))
#6
rownames(gtb) <- dfdate
After the row names have been added the structure is such that I am happy to start creating the visuals where needed.
thanks for your advice
David
Here is one suggestion, I don't really have easy access to your data, but I am including code to remove those columns as you do, based on their names, which can be nicer than removing by index.
df <- read.table( text=
"Item_Code 01/01/2018 01/02/2018 01/03/2018 01/04/2018
Item 99 51 60 69
Item2 42 47 88 2
Item3 36 81 42 48
",header=TRUE, check.names=FALSE) %>%
rename( `Item Code` = Item_Code )
library(tibble)
library(lubridate)
x <- df %>% select( -matches("Code \\d|Internal Code") ) %>%
column_to_rownames("Item Code") %>%
t %>% as.data.frame %>%
rownames_to_column("Item Code") %>%
mutate( `Item Code` = dmy(`Item Code`) )
x
Output:
> x
Item Code Item Item2 Item3
1 2018-01-01 99 42 36
2 2018-02-01 51 47 81
3 2018-03-01 60 88 42
4 2018-04-01 69 2 48
I went a bit forth and back with this solution, but it can be nice to also showcase how to remove columns by a regex on their column names, since you are removing several similarly named columns.
The t trick, that you also use, works becuase there is really only one more column there that would cause problems with this, as others have commented, and this can be temporarily stowed away as rownames. If that weren't the case, you're looking at a more complex solution involving pivot_wider and pivot_longer or splitting the data.frame and transposing only one of the halves.
I have a report that i need to do on a quarterly basis that involves adding various components of revenue together to formulate a trailing 12 month and trailing 24 month total.
rather than retyping a bunch of column names to add each column together on a rolling basis i was hoping to create a function where i could declare variables for the trailing months so i can sum them together easier.
my dataframe all_rel contains all the data i need to sum together. it contains the following fields (unfortunately i just inherited this report an it isn't exactly in tidy format)
Total_Processing_Revenue
Ancillary_Revenue
in the data frame i have T24 months of these data points within separate columns
the script that someone put together that i inherited uses the following to add the columns together:
all_rel$anci_rev_cy_ytd = all_rel$X201701Ancillary_Revenue+all_rel$X201702Ancillary_Revenue+all_rel$X201703Ancillary_Revenue+...+all_rel$X201712Ancillary_Revenue
i'm trying was hoping to do something with paste but can't seem to get it to work
dfname <- 'all_rel$X'
revmonth1 <- '01'
revmonth2 <- '02'
revmonth3 <- '03'
revmonth4 <- '04'
revmonth5 <- '05'
revmonth6 <- '06'
revmonth7 <- '07'
revmonth8 <- '08'
revmonth9 <- '09'
revmonth10 <- '10'
revmonth11 <- '11'
revmonth12 <- '12'
cy <- '2017'
py <- '2016'
rev1 <- 'Total_Processing_Revenue'
rev2 <- 'Ancillary_Revenue'
all_rel$anci_rev_py_ytd = paste(dfname,py,revmonth1,rev2, sep ='')+paste(dfname,py,revmonth2,rev2, sep ='')+...paste(dfname,py,revmonth12,rev2, sep ='')
when i try to sum these fields together i get a "non-numeric argument to binary operator" error. Is there something else i can do instead of what i've been trying to do?
paste(rpt,py,revmonth1,rev2, sep ='') returns "all_rel$X201601Ancillary_Revenue"
is there a way that I can tell R that the reason why I'm pasting these names is to reference the data within them rather than the text I'm pasting?
i'm fairly new to R (i've been learning on the fly to try to make my life easier.
ultimately i need to figure out how to convert this mess to a tidy data format where each of the revenue columns has a month and year but i was hoping to use this issue to understand how to use substitution logic to better automate processes. Maybe i just worded my searches incorrectly but i was struggling to find the exact issue i'm trying to solve.
Any help is greatly appreciated.
::edit::
added dput(head)
structure(list(Chain = c("000001", "000029", "000060", "000064","000076", "000079"), X201601Net_Revenue = c(-2.92, 25005.14,55787.59, 3996.69, 14229.41, 3455.85),X201601Total_Processing_Revenue = c(0,16140.48, 23238.89, 3574.17, 4093.51, 641.1), X201601Ancillary_Revenue = c(-2.92,8864.66, 32548.7, 422.52, 10135.9, 2814.75), X201602Net_Revenue = c(0,41918.84, 56696.34, 4789.57, 13113.2, 5211.27), X201602Total_Processing_Revenue = c(0,13253.19, 24733.04, 4395.69, 4102.79, 546.68), X201602Ancillary_Revenue = c(0,28665.65, 31963.3, 393.88, 9010.41, 4664.59), X201603Net_Revenue = c(0,23843.76, 62494.51, 5262.87, 20551.79, 7646.75), X201603Total_Processing_Revenue = c(0,15037.39, 27523.19,4792.63,4805.61,2134.72)),.Names=c("Chain","X201601Net_Revenue","X201601Total_Processing_Revenue","X201601Ancillary_Revenue","X201602Net_Revenue","X201602Total_Processing_Revenue","X201602Ancillary_Revenue","X201603Net_Revenue", "X201603Total_Processing_Revenue"), row.names = c(NA,6L), class = "data.frame")
Here's how to tidy your data (calling your data dd):
library(tidyr)
library(dplyr)
gather(dd, key = key, value = value, -Chain) %>%
mutate(year = substr(key, start = 2, 5),
month = substr(key, 6, 7),
metric = substr(key, 8, nchar(key))) %>%
select(-key) %>%
spread(key = metric, value = value)
# Chain year month Ancillary_Revenue Net_Revenue Total_Processing_Revenue
# 1 000001 2016 01 -2.92 -2.92 0.00
# 2 000001 2016 02 0.00 0.00 0.00
# 3 000001 2016 03 NA 0.00 0.00
# 4 000029 2016 01 8864.66 25005.14 16140.48
# 5 000029 2016 02 28665.65 41918.84 13253.19
# 6 000029 2016 03 NA 23843.76 15037.39
# 7 000060 2016 01 32548.70 55787.59 23238.89
# 8 000060 2016 02 31963.30 56696.34 24733.04
# 9 000060 2016 03 NA 62494.51 27523.19
# 10 000064 2016 01 422.52 3996.69 3574.17
# 11 000064 2016 02 393.88 4789.57 4395.69
# 12 000064 2016 03 NA 5262.87 4792.63
# 13 000076 2016 01 10135.90 14229.41 4093.51
# 14 000076 2016 02 9010.41 13113.20 4102.79
# 15 000076 2016 03 NA 20551.79 4805.61
# 16 000079 2016 01 2814.75 3455.85 641.10
# 17 000079 2016 02 4664.59 5211.27 546.68
# 18 000079 2016 03 NA 7646.75 2134.72
With that done, you can use whatever grouped operations you want - sums, rolling sums or averages, etc. You might be interested in the yearmon class provided in the zoo package, this question on rolling sums by group, and of course the R-FAQ on grouped sums.
I have a text filed with different types of date formats and its id`s. I need to extract all the strings using regex.
df <- data.frame(id=1:8,text=c("deficit based on wage statement 7/14/ to 7/17/2015.",
"Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2.",
"Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;Deficit due to wage,",
"statement: 4/22/15 thru 5/12/15,depos transcript 7/10/15 for 7/8/15 depos,",
"difference owed for 4/25/15-5/22/15 10-29-99 Feb. 25, 2009,",
"tpd 4:30:2015 - 5:22:2015--09/26/99, 7-14 1.3.99, 1.3.1999,",
"Medical TREATMENT DATES: 6/30/2015 - 30/06/2015 09/26/1999,",
"4/25/15-5/22/15,Medical 2010-01-29 **2010-01-30 February25,2009, February 25, 2009"))
So far, I have created regex using multiple OR statements.
#Different string patterns
#all day formats
day<-c(31:1,"01","02","03","04","05","06","07","08","09")
day_p<-paste(day,collapse = "|")
day_p <- paste0("(",day_p,")")
#all month formats
month<-c(12:1,"01","02","03","04","05","06","07","08","09")
month_p<-paste(month,collapse="|")
month_p <- paste0("(",month_p,")")
#all year 4 digit formats
year<-"\\d{4}"
year_p<-paste(year,collapse="|")
year_p <- paste0("(",year_p,")")
#all year 2 digit formats
year_i<-"\\d{2}"
year_i_p<-paste(year_i,collapse="|")
year_i_p <- paste0("(",year_i_p,")")
#all seperator symbol
symbol_p<-paste(c("\\.","\\|","\\/","\\-","\\:","\\,"),collapse="|")
symbol_p <- paste0("(",symbol_p,")")
patterns<-paste0("(",month_p,symbol_p,day_p,symbol_p,year_p,")","|",
"(",day_p,symbol_p,month_p,symbol_p,year_p,")","|",
"(",year_p,symbol_p,month_p,symbol_p,day_p,")","|",
"(",month_p,symbol_p,day_p,symbol_p,year_i_p,")","|",
"(",day_p,symbol_p,month_p,symbol_p,year_i_p,")","|",
"(",year_i_p,symbol_p,month_p,symbol_p,day_p,")","|",
"(",month_p,"\\-",day_p,")","|",
"(",day_p,"\\-",month_p,")","|",
"(",month_p,"\\/",day_p,")","|",
"(",day_p,"\\/",month_p,")")
#String extaction
extract= str_extract_all(df$text,patterns)
Is there an approach to put all the regex rules in a data frame, name each rule and do a string extraction?
#regex patterns in a data frame
df_patterns<-data.frame(pattern=c(paste0("(",month_p,symbol_p,day_p,symbol_p,year_p,")"),
paste0("(",day_p,symbol_p,month_p,symbol_p,year_p,")")),
rule=c(1,2))
The output data frame should include the extraction values and the rule which triggered its extraction.
#output data frame
output<-data.frame(id=c(1,1,2,3,3),string=c("7/14","7/17/2015",NA,"7/14/15","10/5/15"),rule=c(9,1,NA,2,3))
stringr has a function called str_match_all that can extract all matches as well as return the capture groups that matched in separate columns. This is convenient for this question since you can name the capture groups and associate them to each column of output from str_match_all:
#Different string patterns
#all day formats
day_p <- "[0-3]?[0-9]"
#all month formats
month_p <- "[0-1]?[0-9]"
#all year 4 digit formats
year_p <- "\\d{4}"
#all year 2 digit formats
year_i_p <- "\\d{2}"
#all seperator symbol
symbol_p <- "[-/:.]"
# Patterns to match structured as combination of capture groups
patterns<-paste0("(",month_p,symbol_p,day_p,symbol_p,year_p,")","|",
"(",day_p,symbol_p,month_p,symbol_p,year_p,")","|",
"(",year_p,symbol_p,month_p,symbol_p,day_p,")","|",
"(",month_p,symbol_p,day_p,symbol_p,year_i_p,")","|",
"(",day_p,symbol_p,month_p,symbol_p,year_i_p,")","|",
"(",year_i_p,symbol_p,month_p,symbol_p,day_p,")","|",
"(",month_p,"[-]",day_p,")","|",
"(",day_p,"[-]",month_p,")","|",
"(",month_p,"[/]",day_p,")","|",
"(",day_p,"[/]",month_p,")","|",
"(", "\\w+[.]?[\\s]?\\d+[,]\\s?",year_p,")")
# Name the capture groups
rule_names = c("MDYYYY", "DMYYYY",
"YYYYMD", "MDYY",
"DMYY", "YYMD",
"MD_dash", "DM_dash",
"MD_slash", "DM_slash",
"MDYYYY_word")
library(dplyr)
library(tidyr)
library(purrr)
df$text %>%
str_match_all(patterns) %>%
map2(df$id, function(x, y){
if(nrow(x) == 0){
x = rbind(x, NA)
}
data.frame(id = y, x)
}) %>%
do.call(rbind, .) %>%
mutate_at(vars(X2:X11), funs(ifelse(!is.na(.), 1, NA))) %>%
setNames(c("id", "string", rule_names)) %>%
gather(rule, value, -id, -string) %>%
na.omit() %>%
select(-value) %>%
arrange(id)
Notes:
This final part does all the work. str_match_all returns a list with each element a character matrix of matches and capture groups for each df$text value.
map2 binds the id's with the character matrices, so that each row refers to an id + match combination. The if statement checks if an element has no match and rbinds an NA value if it is the case. This allows id to have at least one row to bind to.
mutate_at converts each of the "capture_group" columns to dummy variables indicating whether "this capture group has a match"
Rename capture group columns with rule_names and transform all dummy into one single categorical variable.
Important note is that there is no way of knowing whether "5/6/2015" is MDYYYY or DMYYYY format, so in this case, you will have to order patterns to have one of them take precedence (e.g. if MDYYYY is before DMYYYY in patterns, MDYYYY will match first for "5/6/2015")
Result:
id string rule
1 1 7/17/2015 MDYYYY
2 1 7/14 MD_slash
3 3 7/14/15 MDYY
4 3 10/5/15 MDYY
5 4 4/22/15 MDYY
6 4 5/12/15 MDYY
7 4 7/10/15 MDYY
8 4 7/8/15 MDYY
9 5 4/25/15 MDYY
10 5 5/22/15 MDYY
11 5 10-29-99 MDYY
12 5 Feb. 25, 2009 MDYYYY_word
13 6 4:30:2015 MDYYYY
14 6 5:22:2015 MDYYYY
15 6 1.3.1999 MDYYYY
16 6 09/26/99 MDYY
17 6 1.3.99 MDYY
18 6 7-14 MD_dash
19 7 6/30/2015 MDYYYY
20 7 09/26/1999 MDYYYY
21 7 30/06/2015 DMYYYY
22 8 2010-01-29 YYYYMD
23 8 2010-01-30 YYYYMD
24 8 4/25/15 MDYY
25 8 5/22/15 MDYY
26 8 February25,2009 MDYYYY_word
27 8 February 25, 2009 MDYYYY_word
Answer
Brief
Correct me if I'm wrong, but I believe R does support PCRE regex. That being the case, you can use the following regex to catch any date in the formats you specified.
Code
See this regex in use here
(?(DEFINE)
(?# Definitions )
(?<day>[12]\d|3[01]|0?[1-9])
(?<month>1[0-2]|0?[1-9])
(?<year>\d+)
(?<separator>[.|\/:,-])
(?# Date formats )
(?<mdy>(?&month)(?<mdy_1>(?&separator))(?&day)(?&mdy_1)(?&year))
(?<dmy>(?&day)(?<dmy_1>(?&separator))(?&month)(?&dmy_1)(?&year))
(?<ymd>(?&year)(?<ymd_1>(?&separator))(?&month)(?&ymd_1)(?&day))
(?<md>(?&month)(?<md_1>(?&separator))(?&day)(?&md_1)?)
(?<dm>(?&day)(?<dm_1>(?&separator))(?&month)(?&dm_1)?)
(?# Date )
(?<date>(?&mdy)|(?&dmy)|(?&ymd)|(?&md)|(?&dm))
)
(?<=\b|\s)(?&date)(?=\b|\s)
Explanation
The define block specifies all our definitions for what constitutes a day, month, year, separator. It also defines our date formats (mdy, dmy, ymd, md, dm). Finally it defines our date group which is a simple OR between all our date formats.
The final regex simply specifies that the preceding or following tokens should be word boundary characters \b or whitespace character \s (whitespace added here in the case of the last character being a word boundary character, it will catch the final character as well - you can test this with the first match by removing the |\s in the final regex to see the result).
Please note that this assumes the days of a month can go to 31 (a more specific check would result in a very lengthy regex and seems pointless when you can validate it through code).
Results
Input
deficit based on wage statement 7/14/ to 7/17/2015.
Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2.
Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;Deficit due to wage,
statement: 4/22/15 thru 5/12/15,depos transcript 7/10/15 for 7/8/15 depos,
difference owed for 4/25/15-5/22/15 10-29-99 Feb. 25, 2009,
tpd 4:30:2015 - 5:22:2015--09/26/99, 7-14 1.3.99, 1.3.1999,
Medical TREATMENT DATES: 6/30/2015 - 30/06/2015 09/26/1999,
4/25/15-5/22/15,Medical 2010-01-29 **2010-01-30 February25,2009, February 25, 2009
Output
7/14/
7/17/2015
7/14/15
10/5/15
4/22/15
5/12/15
7/10/15
7/8/15
4/25/15
5/22/15
10-29-99
4:30:2015
5:22:2015
09/26/99
7-14
1.3.99
1.3.1999
6/30/2015
30/06/2015
09/26/1999
4/25/15
5/22/15
2010-01-29
2010-01-30
Edits
Code
See this code in use here
(?(DEFINE)
(?# Definitions )
(?<day>[12]\d|3[01]|0?[1-9])
(?<month>1[0-2]|0?[1-9])
(?<year>\d+)
(?<separator>[.|\/:,-])
(?# Date formats )
(?<f_mdy>(?&month)(?<mdy_1>(?&separator))(?&day)(?&mdy_1)(?&year))
(?<f_dmy>(?&day)(?<dmy_1>(?&separator))(?&month)(?&dmy_1)(?&year))
(?<f_ymd>(?&year)(?<ymd_1>(?&separator))(?&month)(?&ymd_1)(?&day))
(?<f_md>(?&month)(?<md_1>(?&separator))(?&day)(?&md_1)?)
(?<f_dm>(?&day)(?<dm_1>(?&separator))(?&month)(?&dm_1)?)
(?<f_Mdy>(?:jan(?:uary|\.)?|feb(?:ruary|\.)?|mar(?:ch|\.)?|apr(?:il|\.)?|may|jun(?:e|\.)?|jul(?:y|\.)?|aug(?:ust|\.)?|sep(?:tember|\.)?|oct(?:ober|\.)?|nov(?:ember|\.)?|dec(?:ember|\.)?)\s*(?&day)(?:\s*(?&separator)|(?&separator)\s*|\s+)(?&year))
)
(?<=\b|\s)(?:(?<mdy>(?&f_mdy))|(?<dmy>(?&f_dmy))|(?<ymd>(?&f_ymd))|(?<md>(?&f_md))|(?<dm>(?&f_dm))|(?<Mdy>(?&f_Mdy)))(?=\b|\s)
This will set captures into named capture groups. If you look at the output in the link, you'll see named groups with the content it matched.