Improve code efficiency - r

I've been working on a code that reads in all the sheets of an Excel workbook, where the first two columns in each sheet are "Date" and "Time", and the next two columns are either "Level" and "Temperature, or "LEVEL" and "TEMPERATURE". The code works, but I am working on improving my coding clarity and efficiency, so any advice in those regards would be greatly appreciated.
My function 1) reads in the data to a list of dataframes, 2) gets rid of any NA columns that were accidentally read in, 3) combines "Date" and "Time" to "DateTime" for each dataframe, 4) rounds "DateTime" to the nearest 5 minutes for each dataframe, 5) replaces "Date" and "Time" in each dataframe with "DateTime". I started getting more comfortable with lapply, but am wondering if I can improve the code efficiency at all instead of have so many lines with lapply.
library(readxl)
library(plyr)
read_excel_allsheets <- function(filename) {
sheets <- readxl::excel_sheets(filename)
data <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
names(data) <- sheets
clean <- lapply(data, function(y) y[, colSums(is.na(y)) == 0])
date <- lapply(clean, "[[", 1)
time <- lapply(clean, "[[", 2)
time <- lapply(time, function(z) format(z, format = "%H:%M"))
datetime <- Map(paste, date, time)
datetime <- lapply(datetime, function(a) as.POSIXct(a, format = "%Y-%m-%d %H:%M"))
rounded <- lapply(datetime, function(b) as.POSIXlt(round(as.numeric(b)/(5*60))*(5*60),origin='1970-01-01'))
addDateTime <- mapply(cbind, clean, "DateTime" = rounded, SIMPLIFY = F)
final <- lapply(addDateTime, function(z) z[!(names(z) %in% c("Date", "Time"))])
return(final)
}
Next, I would like to plot all of my data. So, I 1) run my code for a file, 2) combine the list of dataframes into one dataframe while maintaining an "ID" for each dataframe as a column, 3) combine the lowercase and uppercase versions of the variable columns, 4) add two new columns that split the "ID". Each ID is something like B1CC or B2CO, where I want to split the "ID" like so: "B1" and "CC". Now I can use ggplot very easily.
mysheets <- read_excel_allsheets(filename)
df = ldply(mysheets)
df$Temp <- rowSums(df[, c("Temperature", "TEMPERATURE")], na.rm = T)
df$Lev <- rowSums(df[, c("Level", "LEVEL")], na.rm = T)
df <- df[!names(df) %in% c("Level", "LEVEL", "Temperature", "TEMPERATURE")]
df$exp <- gsub("^[[:alnum:]]{2}", "\\1",df$.id)
df$plot <- gsub("[[:alnum:]]{2}$", "\\1", df$.id)
Here are the data for the first two dataframes, but there are over 50 of them, and each is relatively big, and there are many files to read. Therefore, I'm looking to improve efficiency (in terms of time to run) where I can. Any help or advice is greatly appreciated!
dput(head(x[[1]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000,
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121912, -2209121612,
-2209121312, -2209121012, -2209120712, -2209120412), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Level = c(106.9038, 106.9059, 106.89,
106.9121, 106.8522, 106.8813), Temperature = c(6.176, 6.173,
6.172, 6.168, 6.166, 6.165)), .Names = c("Date", "Time", "Level",
"Temperature"), row.names = c(NA, 6L), class = c("tbl_df", "tbl",
"data.frame"))
dput(head(x[[2]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000,
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121988, -2209121688,
-2209121388, -2209121088, -2209120788, -2209120488), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), LEVEL = c(117.5149, 117.511, 117.5031,
117.5272, 117.4523, 117.4524), TEMPERATURE = c(5.661, 5.651,
5.645, 5.644, 5.644, 5.645), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), .Names = c("Date", "Time", "LEVEL",
"TEMPERATURE", NA, NA, NA, NA, NA), row.names = c(NA, 6L), class =
c("tbl_df", "tbl", "data.frame"))

Related

Updating multiple date columns in R conditioned on a particular column

I have a table that consists of only columns of type Date. The data is about shopping behavior of customers on a website. The columns correspond to the first time an event is triggered by a customer (NULL if no occurrence of the event). One of the columns is the purchase motion.
Here's a MRE for the starting state of the Database:
structure(list(purchase = structure(c(NA, NA, 10729, NA, 10737
), class = "Date"), action_A = structure(c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), action_B = structure(c(NA,
NA, 10713, NA, 10613), class = "Date"), action_C = structure(c(10707,
10729, 10739, NA, NA), class = "Date")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"))
I want to update the table so that all the columns of a particular row, all the cells that did not occur within 30 days prior to the purchase are replaced with NULL. However, if the purchase motion is NULL, I'd like to keep the dates of the other events.
So after my envisioned transformation, the above table should look as the following:
structure(list(purchase = structure(c(NA, NA, 10729, NA, 10737
), class = "Date"), action_A = structure(c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), action_B = structure(c(NA,
NA, 10713, NA, NA), class = "Date"), action_C = structure(c(10707,
10729, NA, NA, NA), class = "Date")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
I have yet to be able to achieve this transformation, and would appreciate the help!
Finally, I'd like to transform the above table into a binary format. I've achieved this via the below code segment; however, I'd like to know if I can do this in a simpler way.
df_c <- df_b %>%
is.na() %>%
magrittr::not() %>%
data.frame()
df_c <- df_c * 1
I assume that by saying "replaced by NULL" you actually mean "replaced by NA".
I also assume that the first structure in your question is df_a.
df_b <- df_a %>% mutate(across(starts_with("action"),
~ if_else(purchase - . > 30, as.Date(NA), .)))
mutate(across(cols, func)) applies func to all selected cols.
the real trick here is to use if_else and cast NA into Date class. Otherwise, the dates will be converted to numeric vectors.
Result:
# Tibble (class tbl_df) 4 x 5:
│purchase │action_A│action_B │action_C
1│NA │NA │NA │NA
2│NA │NA │NA │NA
3│1999-05-18│NA │1999-05-02│1999-05-28
4│NA │NA │NA │NA
5│1999-05-26│NA │NA │NA
One problem which remains as a homework exercise: how do you modify the if_else such that you will keep the action if purchase is NA? (this should be now very simple!) I did not include that on purpose because you omitted it from the question.

Coalescing columns more efficiently in R

I have 2 dataframes - the first dataframe (df1) has columns with values at different times of the year; these columns are ones that don't have stationary or Air in the column name. I used a linear model to predict the rest of the values for the year - which I created into a second data frame (df2).
df1 = df = structure(list(Date_Time_GMT_3 =
structure(c(1622552400, 1622553300,1622554200, 1622555100, 1622556000, 1622556900),
class = c("POSIXct","POSIXt"),
tzone = "EST"),
X20819830_R1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 16.808, 16.713, 17.753),
X20819742_R1AR_S_Stationary = c(16.903, 16.828, 16.808, NA_real_, NA_real_, NA_real_),
X20822215_R3AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
X20822215_R3AR_S_Stationary = c(13.942, 13.972, 13.842, NA_real_, NA_real_, NA_real_),
X20874235_R4AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 14.134, 14.534, 14.404),
X20874235_R4AR_S_Stationary = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_),
X20874311_F1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 15.187, 15.327, 15.567),
X20874311_F1AR_S_Stationary = c(15.282, 15.387, 15.587, NA_real_, NA_real_, NA_real_),
X20817727_F8AR_U = c(15.421, 14.441, 14.631, 14.781, 15.521, 15.821),
X20819742_X1AR_U = c(14.996, 15.996, 14.776, 14.920, 14.870, 14.235),
X20819742_R2AR_U = c(14.781, 15.521, 15.821, NA_real_, NA_real_, NA_real_),
X20817727_R5AR_U = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
X20817727_R7AR = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_)),
row.names = c(NA, 6L), class = "data.frame")
df2 = structure(list(Date_Time_GMT_3 =
structure(c(1622552400, 1622553300,1622554200, 1622555100, 1622556000, 1622556900),
class = c("POSIXct","POSIXt"),
tzone = "EST"),
Predicted_X20817727_F8AR_U = c(17.421, 15.441, 17.631, 15.781, 15.001, 16.821),
Predicted_X20819742_X1AR_U = c(15.596, 17.996, 13.676, 13.620, 12.860, 13.245),
Predicted_X20819742_R2AR_U = c(14.781, 15.521, 15.821, 17.421, 15.441, 17.631),
Predicted_X20817727_R5AR_U = c(15.596, 17.996, 13.676, 13.620, 12.860, 13.245),
Predicted_X20817727_R7AR = c(13.942, 13.942, 13.846, 17.421, 15.441, 17.631)),
row.names = c(NA, 6L), class = "data.frame")
I am trying to add the column values from df2 to matching column names and matching Date_Time into df1 where there are no values (i.e. NA), and maintain the original values that are already recorded in the df1. The code I'm using now works, but I was wondering if there was a more efficient way of doing it, without having to type out each column name. Here is the code I've been using
###cOMBINE the predicted columns to the mobile loggers so that values in mobile loggers are preserved
df1$Predicted_F8AR = df2$Predicted_X20817727_F8AR_U
df1$Predicted_R2AR = df2$Predicted_X20819742_R2AR_U
df1$Predicted_R5AR = df2$Predicted_X20817727_R5AR_U
df1$Predicted_X1AR = df2$Predicted_X20819742_X1AR_U
###cOMBINE the predicted columns to the mobile loggers so that values in mobile loggers are preserved
F8AR_U = df1 %>%
mutate(F8AR_U = coalesce(X20817727_F8AR_U,Predicted_F8AR)) %>%
select(X20817727_F8AR_U, F8AR_U)
df1$X20817727_F8AR_U = F8AR_U$F8AR_U
R2AR_U = df1 %>%
mutate(R2AR_U = coalesce(X20819742_R2AR_U,Predicted_R2AR)) %>%
select(X20819742_R2AR_U, R2AR_U)
df1$X20819742_R2AR_U = R2AR_U$R2AR_U
R5AR_U = df1 %>%
mutate(R5AR_U = coalesce(X20817727_R5AR_U,Predicted_R5AR)) %>%
select(X20817727_R5AR_U, R5AR_U)
df1$X20817727_R5AR_U = R5AR_U$R5AR_U
X1AR_U = df1 %>%
mutate(X1AR_U = coalesce(X20819742_X1AR_U,Predicted_X1AR)) %>%
select(X20819742_X1AR_U, X1AR_U)
df1$`X20819742_X1AR_U` = X1AR_U$X1AR_U
#gET RID OF PREDICTED COLUMNS FOR FINAL TABLE Export
df1 = df1[,c(1:13)]
any ideas?
consider:
df1 %>%
pivot_longer(-Date_Time_GMT_3, names_to = c('name1', 'name', 'name2'),
names_pattern = '(.*?)_([^_]+)_(.*)') %>%
left_join(df2 %>%
pivot_longer(-Date_Time_GMT_3,names_pattern = 'Predicted_[^_]+_([^_]+)') ,
by = c('Date_Time_GMT_3', 'name'))%>%
mutate(value=coalesce(value.x, value.y)) %>%
pivot_wider(Date_Time_GMT_3, names_from = c(name1, name, name2))

na.rm in custom function returns zero's

I have a data frame that I wrote a function for that includes a na.rm=T argument. Here is the data frame:
df<- structure(list(BLG = c(38.4615384615385, 33.3333333333333, 0),
LMB = c(20, 100, NA), RSF = c(100, 100, NA), WHC = c(66.6666666666667,
33.3333333333333, NA), BLC = c(NA_real_, NA_real_, NA_real_
), GSF = c(NA_real_, NA_real_, NA_real_), WSH = c(NA_real_,
NA_real_, NA_real_), CCF = c(NA_real_, NA_real_, NA_real_
), group = c(1L, 1L, 1L)), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
This is the function:
estimate = function(df, y_true, na.rm=T) {
sqrt(colSums((t(t(df) - y_true))^2, na.rm=T) / 3) / y_true * 100
}
y_true<- c(28, 50, 77, 46, 80, 12, 100, 50)
Run function:
final <- df %>%
group_by(group) %>%
group_modify( ~ as.data.frame.list(estimate(., y_true)))
The issue is that when all three observations are NA, the function returns 0 instead of NA (columns 6:9 of final). If only a fraction of the observations are NA, the function works as intended (column 5 of final).
I'm guessing this is due to how I formatted the na.rm argument in the function, but I'm not sure how to fix it. Does anyone know where I'm going wrong?

How to check if subsequent date variables occur on or after previous ones

I have 10 date variables, with the assumption that each subsequent variable is on or after the previous one (I have to check whether this assumption is true). I want to compare TloCriminal1CrimeDetails1Di_0001 to TloCriminal1CrimeDetails2Di_0001 TloCriminal1CrimeDetails2Di_0001 to TloCriminal1CrimeDetails3Di_0001, ..., TloCriminal1CrimeDetails9Di_0001 to TloCriminal1CrimeDetails10D_0001. Ideally, for each of those pairs, I want to output variables called, for example, compare1to2, compare2to3, ..., compare9to10 that equal 1 if the second instance of the pair is on or after the first instance and 0 otherwise. If this isn't possible, then an "overall" variable that equals 1 if any of the pairs are "bad" (e.g., the second date is before the first) and 0 otherwise would suffice.
I tried working in SAS but realized it was quite impossible, so I swapped over to R. I don't have a good starting point. Here is a snippet of my dataset. Thank you for your help!
structure(list(TloCriminal1CrimeDetails1Di_0001 = structure(c(10197,
12205, 15979, 12586, NA, 13787, 12913, 14616), label = "TloCriminal1CrimeDetails1DispositionDate", format.sas = "DATE", class = "Date"),
TloCriminal1CrimeDetails2Di_0001 = structure(c(10148, NA,
15979, 12586, NA, 14516, 12913, 14665), label = "TloCriminal1CrimeDetails2DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails3Di_0001 = structure(c(10148, NA,
NA, 12586, NA, 13787, 12913, 14665), label = "TloCriminal1CrimeDetails3DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails4Di_0001 = structure(c(NA, NA, NA,
NA, NA, NA, 12913, 14670), label = "TloCriminal1CrimeDetails4DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails5Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails5DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails6Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails6DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails7Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails7DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails8Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails8DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails9Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails9DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails10D_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails10DispositionDate", format.sas = "MMDDYY", class = "Date")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"), label = "CRIME_CHK")
We can remove the first and last column and then do the comparison in a vectorized way
out <- +(df1[-1] >= df1[-ncol(df1)])
out[is.na(out)] <- FALSE
If it is to find any element in each column
colSums(out, na.rm = TRUE) == 0
You could loop with sapply over the columns and look where dthe differences are all greater than or equal to zero.
sapply(d, function(x) all(diff(na.omit(x)) >= 0))
# TloCriminal1CrimeDetails1Di_0001 TloCriminal1CrimeDetails2Di_0001
# FALSE FALSE
# TloCriminal1CrimeDetails3Di_0001 TloCriminal1CrimeDetails4Di_0001
# FALSE TRUE
# TloCriminal1CrimeDetails5Di_0001 TloCriminal1CrimeDetails6Di_0001
# TRUE TRUE
# TloCriminal1CrimeDetails7Di_0001 TloCriminal1CrimeDetails8Di_0001
# TRUE TRUE
# TloCriminal1CrimeDetails9Di_0001 TloCriminal1CrimeDetails10D_0001
# TRUE TRUE
Or over the rows:
data.frame(check=apply(d, 1, function(x) all(diff(as.Date(na.omit(x))) >= 0)))
# check
# 1 FALSE
# 2 TRUE
# 3 TRUE
# 4 TRUE
# 5 TRUE
# 6 FALSE
# 7 TRUE
# 8 TRUE

Remove NA columns in a list of dataframes

I am having some trouble cleaning data that I imported from Excel with readxl. readxl created a large list of objects with classes = c('data.frame', tbl_df, tbl) (I would also like to know about why/how it has multiple classes assigned to it). Each of those objects is one of the sheets in the original Excel workbook. The problem is that each of those objects (sheets) may have many columns entirely filled with NAs. I have scanned through stackoverflow and found some similar problems and tried to apply the given solutions like here and here (the first one is the most like my problem). However when I try this:
lapply(x, function(y) y[, !is.na(y)])
I get the following error:
Error in `[.data.frame`(y, , !is.na(y)) : undefined columns selected
I've also tried this:
lapply(x, function(y) y[!is.na(y)]
but it reduces all of my dataframes to only the first column. I think I know it's something to do with my dataframe-within-list syntax. I've experimented with different iterations of y[[]][] and even recently found this interesting pattern in lapply: lapply(x, "[[", y), but couldn't make it work.
Here are the first two objects in my list of dataframes (any hints on how to be more efficient in dput-ing this data are also appreciated). As you can see, the first object has no NA columns, whereas the second has 5 NA columns. I would like to remove those 5 NA columns, but do so for all objects in my list.
Any help is greatly appreciated!
dput(head(x[[1]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000,
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121912, -2209121612,
-2209121312, -2209121012, -2209120712, -2209120412), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Level = c(106.9038, 106.9059, 106.89,
106.9121, 106.8522, 106.8813), Temperature = c(6.176, 6.173,
6.172, 6.168, 6.166, 6.165)), .Names = c("Date", "Time", "Level",
"Temperature"), row.names = c(NA, 6L), class = c("tbl_df", "tbl",
"data.frame"))
dput(head(x[[2]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000,
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121988, -2209121688,
-2209121388, -2209121088, -2209120788, -2209120488), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), LEVEL = c(117.5149, 117.511, 117.5031,
117.5272, 117.4523, 117.4524), TEMPERATURE = c(5.661, 5.651,
5.645, 5.644, 5.644, 5.645), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), .Names = c("Date", "Time", "LEVEL",
"TEMPERATURE", NA, NA, NA, NA, NA), row.names = c(NA, 6L), class =
c("tbl_df", "tbl", "data.frame"))
How about this:
lapply(df_list, function(df) df[, colSums(is.na(df)) == 0])
Or maybe:
lapply(df_list, function(df) df[, colSums(is.na(df)) < nrow(df)])
if you want to allow some, but not all rows to be NA

Resources