Extracting the Package name from fully defined class names using R scripting - r

I have following sort of data set(ds1) in my CSV file that includes class Name and corresponding faults. I intend to extract or filter Package Name from the data having number of faults equal to 2 using R script.
Class Faults
org.apache.tools.ant.taskdefs.Definer 2
org.apache.tools.ant.taskdefs.Definer 2
org.apache.tools.ant.taskdefs.Delete 1
org.apache.tools.ant.taskdefs.Deltree 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.Ear 2
org.apache.tools.ant.taskdefs.Ear 2
org.apache.tools.ant.taskdefs.Echo 1
org.apache.tools.ant.Exec 2
org.apache.tools.ant.Exec 2
I have written following code, but, it does not produce desired output
dschanged<- subset(ds1, grep( "/^([^\\.]+)/", class) & Faults==2 )
Technically, I require proper regular expression to pull the string before last dot(.) to generate following output.
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant 2
org.apache.tools.ant 2

grep (and grepl) are inappropriate for this: you aren't filtering based on textual content. You are (a) filtering based on Faults, and (b) changing the text in Class.
Your data:
ds1 <- structure(list(Class = c("org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Delete", "org.apache.tools.ant.taskdefs.Deltree", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Echo", "org.apache.tools.ant.Exec", "org.apache.tools.ant.Exec"),
Faults = c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L)),
.Names = c("Class", "Faults"), class = "data.frame", row.names = c(NA, -12L))
Filter on Faults (you already had this). You only need one of these two commands, they both do the same thing; the major differences are in readability (personal preference) and performance (the second one, in this case, takes about 35% less time, though since they are both measured in microseconds, it seems silly to compete).
ds2 <- subset(ds1, Faults == 2)
ds2 <- ds1[ds1$Faults == 2,]
Update Class to remove the last word (and dot):
ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class)
# Class Faults
# 1 org.apache.tools.ant.taskdefs 2
# 2 org.apache.tools.ant.taskdefs 2
# 4 org.apache.tools.ant.taskdefs 2
# 5 org.apache.tools.ant.taskdefs 2
# 6 org.apache.tools.ant.taskdefs 2
# 7 org.apache.tools.ant.taskdefs 2
# 8 org.apache.tools.ant.taskdefs 2
# 9 org.apache.tools.ant.taskdefs 2
# 11 org.apache.tools.ant 2
# 12 org.apache.tools.ant 2
Note: this can also be done with sub instead of gsub, but the latter is my first-reach since most of my uses deal with larger and repeating regexes. The major (only?) difference between the two is that:
'sub' and 'gsub' perform replacement of the first and all matches respectively
(from ?sub).
I know of no tool that does both the filtering and changing in a single command (though perhaps data.table does, I don't know).
Similar to #egnha's solution (that uses magrittr), here's one using dplyr, which many people allege is very easy to read and adapt (at the potential cost of performance):
ds2 <- ds1 %>%
filter(Faults == 2) %>%
mutate(Class = gsub("\\.[^.]*$", "", Class))
Since I mentioned performance, here's a comparison:
microbenchmark(indexing = { ds2 <- ds1[ds1$Faults == 2,]; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
subset = { ds2 <- subset(ds1, Faults == 2) ; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
dplyr = { ds1 %>% filter(Faults == 2) %>% mutate(Class = gsub("\\.[^.]*$", "", Class)) })
# Unit: microseconds
# expr min lq mean median uq max neval
# indexing 71.841 87.7045 109.4496 104.2975 120.7075 269.493 100
# subset 102.473 115.6020 147.0108 139.1230 165.5620 287.726 100
# dplyr 1067.030 1156.3745 1323.1174 1225.4805 1351.2920 4270.308 100
For the record, dplyr used in this way is not often this speed-poor in comparison to other methods. It is not commonly faster, but it is not often an order-of-magnitude slower.

I don't think you are looking for filtering based on class name.
Just do it in 2 steps.
# Filter
dschanged <- ds1[ds1$Faults == 2,]
# Extract package name
dschanged$class <- sub('(.*)[.](.*)','\\1',dschanged$class)

You can also do this without any fancy regex’s: split each Class string on the dots, then dot-paste all but the last substring.
library(magrittr) # Provides pipe operator `%>%`
dschanged <- subset(ds1, Faults == 2)
dschanged$Class <- dschanged$Class %>%
strsplit(split = "[.]") %>%
sapply(function(x) head(x, -1L) %>% paste(collapse = "."))
Note that strings without dots will be transformed to empty strings. It is also quite a bit slower than the solution suggested by #r2evans.


How to count each column values frequency combinations in R?

In the original dataset I have 3k+ rows and 2 columns - ids and languages that id can apply in practice. My first step was to find the frequency combinations of chosen languages. For e.g., how many times Python was chosen along with R, SQL; or how many times Java was picked with JavaScript, C++ and so on.
Some research on Stackoverflow helped me to find these possible patterns. Here's some code with a sample data set:
sample <- data.frame(id = rep(randomNames::randomNames(4), each = 4),
programming = c("R", "Python", "C#", "Other",
"R", "Tableu", "Assembler",
"Other", "Java", "JavaScript",
"Python", "C#","R", "Python", "C#",
gr <- sample %>%
group_by(id) %>%
arrange(programming) %>%
summarise(programming = paste(sort(unique(programming)), collapse = ", ")) %>%
But now I wonder how can I find the number of the most frequent picks for each language. For instance, R language was picked with Java and Kotlin very few times, this is not a very popular setting. But R that was picked with Python and SQL is more popular. And my purpose is to find what languages has the greatest frequency of being picked.
I also did some research (example), and, unfortunately, have not found the solution.
I think I should iterate my programming column to find all possible picks (R + ..., Python + ...; then R + Python + ...). I tried using lapply but struggled with writing a lambda function.
What are the possible ways to solve the issue? Is there any effective function for such purposes?
One option would be to create combinations of languages within each id and count the combinations which most frequently occur together. .
sample %>%
group_by(id) %>%
summarise(programming = combn(sort(programming), 2,
paste0, collapse = '-'), .groups = 'drop') %>%
count(programming, sort = TRUE)
# programming n
# <chr> <int>
# 1 C#-Python 3
# 2 Other-R 3
# 3 C#-Other 2
# 4 C#-R 2
# 5 Other-Python 2
# 6 Python-R 2
# 7 Assembler-Other 1
# 8 Assembler-R 1
# 9 Assembler-Tableu 1
#10 C#-Java 1
#11 C#-JavaScript 1
#12 Java-JavaScript 1
#13 Java-Python 1
#14 JavaScript-Python 1
#15 Other-Tableu 1
#16 R-Tableu 1

If else statement with a value that is part of a continuous character in R

My dataframe (df) contains a list of values which are labelled following a format of 'Month' 'Name of Site' and 'Camera No.'. I.e., if my value is 'DECBUTCAM27' then Dec-December, BUT-Name of Site and CAM27-Camera No.
I have 100 such values with 19 different site names.
I want to write an If else code such that only the site names are recognised and a corresponding number is added.
My initial idea was to add the corresponding number for all the 100 values, but since if else does not work beyond 50 values I couldnt use that option.
This is what I had written for the option that i had tried:
df <- df2 %>% mutate(Site_ID =
ifelse (CT_Name == 'DECBUTCAM27', "1",
ifelse (CT_Name == 'DECBUTCAM28', "1",
ifelse (CT_Name == 'DECI2NCAM01', "2",
ifelse (CT_Name == 'DECI2NCAM07', "2",
ifelse (CT_Name == 'DECI5CAM39', "3",
ifelse (CT_Name == 'DECI5CAM40', "3","NoVal")))))))
I am looking for a code such that only the sites i.e., 'BUT', 'I2N' and 'I5' would be recognised and a corresponding number is added.
Any help would be greatly appreciated.
Extract the sitename using regex and use match + unique to assign unique number.
df2$site_name <- sub('...(.*)CAM.*', '\\1', df2$CT_Name)
df2$Site_ID <- match(df2$site_name, unique(df2$site_name))
For example, see this example :
site_name <- sub('...(.*)CAM.*', '\\1', CT_Name)
#[1] "BUT" "BUT" "I2N" "I2N" "I5" "I5"
Site_ID <- match(site_name, unique(site_name))
#[1] 1 1 2 2 3 3
Here is a tidyverse solution:
You haven't provided a reproducible example, but let's use the CT_Names that you have supplied to create a test dataframe:
data <- tribble(
~ CT_Name,
Let's assume that the string format is 3 letters for months, 2 or more letters or numbers for site and CAM + 1 or more digits for camera number (adjust these as needed). We can use a regular expression in tidyr's extract() function to split up the string into its components:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera"))
(add remove = FALSE if you want to keep the original CT_Name variable)
This yields:
# A tibble: 6 x 3
Month Site Camera
<chr> <chr> <chr>
5 DEC I5 CAM39
6 DEC I5 CAM40
We can then group by site and assign a group ID as your Site_ID:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera")) %>%
group_by(Site) %>%
mutate(Site_ID = cur_group_id())
This produces:
# A tibble: 6 x 4
# Groups: Site [3]
Month Site Camera Site_ID
<chr> <chr> <chr> <int>
3 DEC I2N CAM01 2
4 DEC I2N CAM07 2
5 DEC I5 CAM39 3
6 DEC I5 CAM40 3
Here is a quick example using regex to find the site code and using an apply function to return a vector of code.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$loc <- apply(df, 1, function(x) gsub("CAM.*$","",gsub("^.{3}",'',x[1])))
unique(df$loc) # all the location of the file
df$n <- as.numeric(as.factor(df$loc)) # get a number for each location
Mind that here I use the x[1] because the code are in the first column of my data.frame, which may vary for you.
---EDIT--- This was a previous answer also working but with more work for you to do. However it allow you to choose numeric code value (or text) to assign locations if they are ordered for example.
It require you to put all the codes for each site, which I found heavy in term of code but it works. The switch part is roughly the same as an ifelse.
The regex consist in excluding the 3 first character and the other ones at the end after the 'CAM' sequence.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$n <- apply(df, 1, function(x) switch(gsub("CAM.*$","",gsub("^.{3}",'',x[1])),
BUT = 1,
DUC = 2)

How to optimize iterating over a huge dataframe with non-unique rows

I understand that if R is not updating a variable in place within the confines of a for loop then I've just made some horrendously slow and expensive code. Unfortunately, with a set of very tight deadlines and a strong background in C++/Java it's my go-to behaviour until I can get my R hat on.
I have a function I need to improve. It takes a dataframe (as below) returns the unique patid values and uses those to retrieve subsets of that dataframe for date modifications. A trimmed example below (note, I just pulled this out of a completed run, so the date has already been modified). The last R run I performed was over a dataframe of 27 million row and took about four/five hours. The size of the dataframe will be a lot bigger.
patid eventdate
1 12/03/1998
1 12/03/1998
2 04/03/2007
3 15/11/1980
3 15/11/1980
3 01/02/1981
A trimmed example of the function:
rearrangeDates <- function(dataFrame) {
#return a list of the unique patient ids
uniquePatids <- getUniquePatidList(dataFrame) #this is only called once and is very fast
for(i in 1:length(uniquePatids)) { # iterate over the list
idf <- subset(dataFrame, dataFrame$patid=uniquePatids[[i]])
idf$eventdate <- as.POSIXct(idf$eventdate,format="%d/%m/%Y")
idf <- idf[order(idf$eventdate,decreasing=FALSE),]
out = rbind(out,idf)
Can anyone suggest improvements?
Since you want to sort your data on patid & eventdate this should work.
df %>%
mutate(eventdate = as.Date(eventdate, format="%d/%m/%Y")) %>%
arrange(patid, eventdate)
Output is:
patid eventdate
1 1 1998-03-12
2 1 1998-03-12
3 2 2007-03-04
4 3 1980-11-15
5 3 1980-11-15
6 3 1981-02-01
Sample data:
df <- structure(list(patid = c(1L, 1L, 2L, 3L, 3L, 3L), eventdate = c("12/03/1998",
"12/03/1998", "04/03/2007", "15/11/1980", "15/11/1980", "01/02/1981"
)), class = "data.frame", row.names = c(NA, -6L))
This is ideally suited to data.table: your data has a well-defined key that you group-by (patid,eventdate), you know the size of the output df will be <= size of input df, so it's safe to do do in-place assignments (waaay faster) instead of appends, you don't need the output iterative-append, and data.table has a nice fast unique function. So please try out the (loop-free!) code below and let us know how it compares both to your original, and to the dplyr approach:
dt = data.table(patid=c(1,1,2,3,3,3), eventdate=c('12/03/1998','12/03/1998',
'04/03/2007', '15/11/1980', '15/11/1980','01/02/1981'))
dt[, eventdate := as.POSIXct(eventdate,format="%d/%m/%Y") ]
# If you set a key, the `by` operation will be super-fast
setkeyv(dt, c('patid','eventdate'))
odt <- dt[, by=.(patid,eventdate)]
patid eventdate
1: 1 1998-03-12
2: 1 1998-03-12
3: 2 2007-03-04
4: 3 1980-11-15
5: 3 1980-11-15
6: 3 1981-02-01
(One last thing: don't be afraid of POSIXct/lt, convert to them early, they're more efficient than strings, they support comparison operators hence the column can be used as key, sorted on, compared.)
(And for the fastest dplyr implementation, use dplyr::distinct())

Reading Excel file: How to find the start cell in messy spreadsheets?

I'm trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first column is a date and the second column has "Monthly return" as the header. In this example, the data starts in cell B5:
How do I automate the search of Excel cells for my "Monthly return" string using R?
At the moment, the best idea I can come up with is to upload everything in R starting at cell A1 and sort out the mess in the resulting (huge) matrices. I'm hoping for a more elegant solution
I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain.
For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good.
Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken.
f_path <- file.path("whatever.xlsx")
if (!file.exists(f_path)) {
f_path <- file.choose()
# I read this twice, temp_read to figure out where the data actually starts...
# Maybe you need something like this -
# excel_sheets <- readxl::excel_sheets(f_path)
# desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)
skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10
# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
col_skip <- col_skip + 1
if (col_skip == max_cols_to_search) break
skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0
# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
sheet = desired_sheet,
skip = skip_rows
# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
# select(-(1:col_skip))
Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)
You would get something similar to:
data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("",
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund",
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("",
"0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))
then you can dynamincally filter on the "Monthly return" cell and identify your matrix.
targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]
With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply() or purrr::map(). This function will encapsulate the skip-learning read and the "real" read.
two_passes <- function(path, sheet = NULL, n_max = 10) {
first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
skip <- which(first_pass[[2]] == "Monthly return")
message("For sheet '", if (is.null(sheet)) 1 else sheet,
"' we'll skip ", skip, " rows.")
read_excel(path, sheet = sheet, skip = skip)
(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
#> $sheet_two
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.
I have this Excel book:
I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.
First: load data
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
1 MC Pension Fund <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Second: I added some cols with NA and '' values in the case that your data contain some
dat$x2 <- NA
dat$x4 <- NA
1 MC Pension Fund <NA> NA NA
3 updated by IG on 20/04/2017 <NA> NA NA
4 <NA> Monthly return NA NA
5 Mar-14 0.0097 NA NA
6 Apr-14 6e-04 NA NA
Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past
colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
1 MC Pension Fund <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Fourth: Keep only rows with complete observations (it's what I supose from your example)
rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Finally if you want to keep the header you can make something like this:
colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])
colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
Month Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Here is how I would tackle it.
Read the excel spreadsheet in without the headers.
Find the row index for your string Monthly return in this case
Filter from the identified row (or column or both), prettify a little and done.
Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex to make it more robust.
Function Definition:
extract_return <- function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
filepath = paste(path, "/", filename, sep = "")
input = read.xlsx(filepath, sheetnum, header = FALSE)
start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
output = input[start_idx:dim(input)[1],]
rownames(output) <- NULL
colnames(output) <- c("Date","Monthly Return")
output = output[-1, ]
final_df <- extract_return(
path = "~/Desktop",
filename = "Apr2017.xlsx",
sheetnum = 2)
No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.
This is a tidy alternative that avoids the multiple reads issue discussed above. However, when doing benchmarks, Rafael Zayas's answer still wins out.
tidy_solution <- function() {
raw <- xlsx_cells("messyExcel.xlsx")
start <- raw %>%
filter_all(any_vars(. %in% c("Monthly return"))) %>%
select(row, col)
month.col <- raw %>%
filter(row >= start$row + 1, col == start$col - 1) %>%
pivot_wider(date, col)
return.col <- raw %>%
filter(row >= start$row + 1, col == start$col) %>%
pivot_wider(numeric, col)
output <- cbind(month.col, return.col)
# My Solution
expr min lq mean median uq max neval
tidy_solution() 29.0372 30.40305 32.13793 31.36925 32.9812 56.6455 100
# Rafael's
expr min lq mean median uq max neval
original_solution() 21.4405 23.8009 25.86874 25.10865 26.99945 59.4128 100
This gives you first column with year. Or use "-14" or whatever you have for years.
Similar way grep("Monthly",dat)[1] gives you second column

dplyr idiom for summarize() a filtered-group-by, and also replace any NAs due to missing rows

I am computing a dplyr::summarize across a dataframe of sales data.
I do a group-by (S,D,Y), then within each group, compute medians and means for weeks 5..43, then merge those back into the parent df. Variable X is sales. X is never NA (i.e. there are no explicit NAs anywhere in df), but if there is no data (as in, no sales) for that S,D,Y and set of weeks, there will simply be no row with those values in df (take it that means zero sales for that particular set of parameters). In other words, impute X=0 in any structurally missing rows (but I hope I don't need to melt/cast the original df, to avoid bloat. Similar to cast(fill....,add.missing=T) or caret::preProcess()).
Two questions about my code idiom:
Is it better to use summarize than dplyr::filter, because filter physically drops rows so I have to assign the results to df.tmp then left-join it back to the original df (as below)? Also, big subsetting expressions repeated on every single line of summarize computations make the code harder to read.
Should I worry (or not) about caching the rows or logical indices of the subsetting operation, in the general case where I might be computing say n=20 new summary variables?
Not all combinations of S,D,Y-groups and filter (for those weeks) have rows, so how to get the summarize to replace NA on any missing rows? Currently I do as below.
Sorry both the code and dataset are proprietary, but here's the code idiom, and below is code you should run first to generate sample-data:
# Compute median, mean of X across wks 5..43, for that set of S,D,Y-values
# Issue a) filter() or repeatedly use subset() within each calculation?
df.tmp <- df %.% group_by(S,D,Y) %.% filter(Week>=5 & Week<=43) %.%
summarize(ysd_med543_X = median(X),
ysd_mean543_X = mean(X)
) %.% ungroup()
# Issue b) how to replace NAs in groups where the group_by-and-filter gave empty output?
# can you merge this code with the summarize above?
df <- left_join(df, df.tmp, copy=F)
newcols <- match(c('ysd_mean543_X','ysd_med543_X'), names(df))
df[!complete.cases(df[,newcols]), newcols] <- c(0.0,0.0)
and run this first to generate sample-data:
rep_vector <- function(vv, n) {
unlist(as.vector(lapply(vv, function(...) {rep(...,n)} )))
df = data.frame(S = rep_vector(10:12, n), D = 20:26,
Y = rep_vector(2005:2007, n),
Week = round(52*runif(m*n)),
X = 4e4*runif(m*n) + 1e4 )
# Now drop some rows, to model structurally missing rows
I <- sort(sample(1:nrow(df),0.6*nrow(df)))
df = df[I,]
I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.
I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:
df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...
This is just so that you're aware of it for any future cases.
IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
md_X = median(X[Week >=5 & Week <= 43]),
mn_X = mean(X[Week >=5 & Week <= 43]))
Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.
What I've in mind is something like:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
{ tmp = Week >= 5 & Week <= 43;
md_X = ifelse(length(tmp), median(X[tmp]), 0),
md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
So, we'll have to workaround in this manner probably:
df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], mean(X), 0))
Or to put things together:
df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43,
md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], median(X), 0))
# S D Y Week X tmp md_X mn_X
# 1 10 20 2005 6 22107.73 TRUE 22107.73 22107.73
# 2 10 23 2005 32 18751.98 TRUE 18751.98 18751.98
# 3 10 25 2005 33 31027.90 TRUE 31027.90 31027.90
# 4 10 26 2005 0 46586.33 FALSE 0.00 0.00
# 5 11 20 2006 12 43253.80 TRUE 43253.80 43253.80
# 6 11 22 2006 27 28243.66 TRUE 28243.66 28243.66
# 7 11 23 2006 36 20607.47 TRUE 20607.47 20607.47
# 8 11 24 2006 28 22186.89 TRUE 22186.89 22186.89
# 9 11 25 2006 15 30292.27 TRUE 30292.27 30292.27
# 10 12 20 2007 15 40386.83 TRUE 40386.83 40386.83
# 11 12 21 2007 44 18049.92 FALSE 0.00 0.00
# 12 12 26 2007 16 35856.24 TRUE 35856.24 35856.24
which doesn't require df.tmp.
