Remove NA columns in a list of dataframes - r

I am having some trouble cleaning data that I imported from Excel with readxl. readxl created a large list of objects with classes = c('data.frame', tbl_df, tbl) (I would also like to know about why/how it has multiple classes assigned to it). Each of those objects is one of the sheets in the original Excel workbook. The problem is that each of those objects (sheets) may have many columns entirely filled with NAs. I have scanned through stackoverflow and found some similar problems and tried to apply the given solutions like here and here (the first one is the most like my problem). However when I try this:
lapply(x, function(y) y[, !is.na(y)])
I get the following error:
Error in `[.data.frame`(y, , !is.na(y)) : undefined columns selected
I've also tried this:
lapply(x, function(y) y[!is.na(y)]
but it reduces all of my dataframes to only the first column. I think I know it's something to do with my dataframe-within-list syntax. I've experimented with different iterations of y[[]][] and even recently found this interesting pattern in lapply: lapply(x, "[[", y), but couldn't make it work.
Here are the first two objects in my list of dataframes (any hints on how to be more efficient in dput-ing this data are also appreciated). As you can see, the first object has no NA columns, whereas the second has 5 NA columns. I would like to remove those 5 NA columns, but do so for all objects in my list.
Any help is greatly appreciated!
dput(head(x[[1]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000,
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121912, -2209121612,
-2209121312, -2209121012, -2209120712, -2209120412), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Level = c(106.9038, 106.9059, 106.89,
106.9121, 106.8522, 106.8813), Temperature = c(6.176, 6.173,
6.172, 6.168, 6.166, 6.165)), .Names = c("Date", "Time", "Level",
"Temperature"), row.names = c(NA, 6L), class = c("tbl_df", "tbl",
"data.frame"))
dput(head(x[[2]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000,
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121988, -2209121688,
-2209121388, -2209121088, -2209120788, -2209120488), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), LEVEL = c(117.5149, 117.511, 117.5031,
117.5272, 117.4523, 117.4524), TEMPERATURE = c(5.661, 5.651,
5.645, 5.644, 5.644, 5.645), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), .Names = c("Date", "Time", "LEVEL",
"TEMPERATURE", NA, NA, NA, NA, NA), row.names = c(NA, 6L), class =
c("tbl_df", "tbl", "data.frame"))

How about this:
lapply(df_list, function(df) df[, colSums(is.na(df)) == 0])
Or maybe:
lapply(df_list, function(df) df[, colSums(is.na(df)) < nrow(df)])
if you want to allow some, but not all rows to be NA

Related

Catching an issue in imp.dates when two events are taking place on the same day

I am creating a calendar using a dynamic dataframe of the following pattern.
structure(list(Date = structure(c(19304, 19305, 19311,
19311, 19312), class = "Date"), Category = c("4",
"6", "1", "0",
"3"), Units_Sold = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), Raised = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), Method = c("Trad",
"Trad", "Unknown", "Trad",
"Unknown"), Day = c(8, 9, 15, 15, 16)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
This is then fed into imp.dates like this:
imp.dates <- rep(NA, 31)
imp.dates[df$Day] <- df$Category
This is then fed into a calendR and used to plot some colour-coded events.
As you can probably see from above, there are two events taking place on the same day being of two different categories, this is causing a bit of a problem in so far as the code doesn't know what category to plot them under and what text to put in the calendar on that date.
In an attempt to catch this issue, I'm thinking of putting a condition on the dataframe that checks whether there are >1 events taking place on the same day, and then dropping and renaming the category to say something like "2 events taken place today".
My question is whether this would be the best solution, and if not whether there are some better options available. Any advice or pointers greatly appreciated.

Coalescing columns more efficiently in R

I have 2 dataframes - the first dataframe (df1) has columns with values at different times of the year; these columns are ones that don't have stationary or Air in the column name. I used a linear model to predict the rest of the values for the year - which I created into a second data frame (df2).
df1 = df = structure(list(Date_Time_GMT_3 =
structure(c(1622552400, 1622553300,1622554200, 1622555100, 1622556000, 1622556900),
class = c("POSIXct","POSIXt"),
tzone = "EST"),
X20819830_R1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 16.808, 16.713, 17.753),
X20819742_R1AR_S_Stationary = c(16.903, 16.828, 16.808, NA_real_, NA_real_, NA_real_),
X20822215_R3AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
X20822215_R3AR_S_Stationary = c(13.942, 13.972, 13.842, NA_real_, NA_real_, NA_real_),
X20874235_R4AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 14.134, 14.534, 14.404),
X20874235_R4AR_S_Stationary = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_),
X20874311_F1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 15.187, 15.327, 15.567),
X20874311_F1AR_S_Stationary = c(15.282, 15.387, 15.587, NA_real_, NA_real_, NA_real_),
X20817727_F8AR_U = c(15.421, 14.441, 14.631, 14.781, 15.521, 15.821),
X20819742_X1AR_U = c(14.996, 15.996, 14.776, 14.920, 14.870, 14.235),
X20819742_R2AR_U = c(14.781, 15.521, 15.821, NA_real_, NA_real_, NA_real_),
X20817727_R5AR_U = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
X20817727_R7AR = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_)),
row.names = c(NA, 6L), class = "data.frame")
df2 = structure(list(Date_Time_GMT_3 =
structure(c(1622552400, 1622553300,1622554200, 1622555100, 1622556000, 1622556900),
class = c("POSIXct","POSIXt"),
tzone = "EST"),
Predicted_X20817727_F8AR_U = c(17.421, 15.441, 17.631, 15.781, 15.001, 16.821),
Predicted_X20819742_X1AR_U = c(15.596, 17.996, 13.676, 13.620, 12.860, 13.245),
Predicted_X20819742_R2AR_U = c(14.781, 15.521, 15.821, 17.421, 15.441, 17.631),
Predicted_X20817727_R5AR_U = c(15.596, 17.996, 13.676, 13.620, 12.860, 13.245),
Predicted_X20817727_R7AR = c(13.942, 13.942, 13.846, 17.421, 15.441, 17.631)),
row.names = c(NA, 6L), class = "data.frame")
I am trying to add the column values from df2 to matching column names and matching Date_Time into df1 where there are no values (i.e. NA), and maintain the original values that are already recorded in the df1. The code I'm using now works, but I was wondering if there was a more efficient way of doing it, without having to type out each column name. Here is the code I've been using
###cOMBINE the predicted columns to the mobile loggers so that values in mobile loggers are preserved
df1$Predicted_F8AR = df2$Predicted_X20817727_F8AR_U
df1$Predicted_R2AR = df2$Predicted_X20819742_R2AR_U
df1$Predicted_R5AR = df2$Predicted_X20817727_R5AR_U
df1$Predicted_X1AR = df2$Predicted_X20819742_X1AR_U
###cOMBINE the predicted columns to the mobile loggers so that values in mobile loggers are preserved
F8AR_U = df1 %>%
mutate(F8AR_U = coalesce(X20817727_F8AR_U,Predicted_F8AR)) %>%
select(X20817727_F8AR_U, F8AR_U)
df1$X20817727_F8AR_U = F8AR_U$F8AR_U
R2AR_U = df1 %>%
mutate(R2AR_U = coalesce(X20819742_R2AR_U,Predicted_R2AR)) %>%
select(X20819742_R2AR_U, R2AR_U)
df1$X20819742_R2AR_U = R2AR_U$R2AR_U
R5AR_U = df1 %>%
mutate(R5AR_U = coalesce(X20817727_R5AR_U,Predicted_R5AR)) %>%
select(X20817727_R5AR_U, R5AR_U)
df1$X20817727_R5AR_U = R5AR_U$R5AR_U
X1AR_U = df1 %>%
mutate(X1AR_U = coalesce(X20819742_X1AR_U,Predicted_X1AR)) %>%
select(X20819742_X1AR_U, X1AR_U)
df1$`X20819742_X1AR_U` = X1AR_U$X1AR_U
#gET RID OF PREDICTED COLUMNS FOR FINAL TABLE Export
df1 = df1[,c(1:13)]
any ideas?
consider:
df1 %>%
pivot_longer(-Date_Time_GMT_3, names_to = c('name1', 'name', 'name2'),
names_pattern = '(.*?)_([^_]+)_(.*)') %>%
left_join(df2 %>%
pivot_longer(-Date_Time_GMT_3,names_pattern = 'Predicted_[^_]+_([^_]+)') ,
by = c('Date_Time_GMT_3', 'name'))%>%
mutate(value=coalesce(value.x, value.y)) %>%
pivot_wider(Date_Time_GMT_3, names_from = c(name1, name, name2))

R, in data.table, select some rows

I have this strange situation. I am simply trying to select some rows from data.table.
dput(DT)
structure(list(Date = structure(c(10959, 10960, 10961, 10962,
10963, 10966, 10967, 10968, 10969, 10970, 10974, 10975, 10976,
10977, 10980, 10981, 10982, 10983, 10984, 10987), class = "Date"),
A = c(51.502148, 47.567955, 44.61731, 42.918453, 46.494991,
49.311516, 48.640915, 47.657368, 48.372677, 48.909157, 51.144493,
50.071529, 48.730328, 49.177395, 48.998569, 48.417381, 48.864449,
48.953861, 48.685623, 47.344421), AA = c(96.840897, 97.561798,
103.329002, 101.598839, 101.406601, 101.214363, 100.397339,
99.820618, 97.802101, 96.120003, 93.717003, 93.813118, 88.093979,
90.400864, 88.045921, 86.748299, 85.450684, 84.489479, 83.287979,
83.432159), AAC = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), AACG = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_)), row.names = c(NA, -20L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fa9640148e0>, sorted = "Date")
StartDate <- as.Date("2000-01-05")
TestDates <- c(StartDate,
StartDate + duration(6, units = "day"),
StartDate + duration(2, units = "week"))
DT[Date %in% TestDates, ] # works well here.
The real data of "DT" has 20 million rows. Using this same block of codes, R reported:
Empty data.table (0 rows and 7347 cols)
Does anyone know how to pick rows using a vector, in a more reliable way?
I found the problem. In this line of code:
StartDate <- as.Date("2000-01-05")
I was trying to set the base date and then use the following codes to get different dates.
TestDates <- c(StartDate,
StartDate + duration(6, units = "day"),
StartDate + duration(2, units = "week"))
But using duration is wrong. Instead, I need:
TestDates <- c(StartDate,
StartDate + days(6),
StartDate + weeks(2))
In my case, I need to get data from different years, for example, 2000-01-01 and 2020-01-01. Using periods like seconds, minutes, hours, days, months, weeks and years work on human level and I do not need to worry about leap years. For example:
StartDate <- ymd("2020-01-01") # note, 2020 is leap year
StartDates + duration(1, units = "year")
>[1] "2020-12-31 06:00:00 UTC"
StartDates + years(1)
>[1] "2021-01-01"

How to check if subsequent date variables occur on or after previous ones

I have 10 date variables, with the assumption that each subsequent variable is on or after the previous one (I have to check whether this assumption is true). I want to compare TloCriminal1CrimeDetails1Di_0001 to TloCriminal1CrimeDetails2Di_0001 TloCriminal1CrimeDetails2Di_0001 to TloCriminal1CrimeDetails3Di_0001, ..., TloCriminal1CrimeDetails9Di_0001 to TloCriminal1CrimeDetails10D_0001. Ideally, for each of those pairs, I want to output variables called, for example, compare1to2, compare2to3, ..., compare9to10 that equal 1 if the second instance of the pair is on or after the first instance and 0 otherwise. If this isn't possible, then an "overall" variable that equals 1 if any of the pairs are "bad" (e.g., the second date is before the first) and 0 otherwise would suffice.
I tried working in SAS but realized it was quite impossible, so I swapped over to R. I don't have a good starting point. Here is a snippet of my dataset. Thank you for your help!
structure(list(TloCriminal1CrimeDetails1Di_0001 = structure(c(10197,
12205, 15979, 12586, NA, 13787, 12913, 14616), label = "TloCriminal1CrimeDetails1DispositionDate", format.sas = "DATE", class = "Date"),
TloCriminal1CrimeDetails2Di_0001 = structure(c(10148, NA,
15979, 12586, NA, 14516, 12913, 14665), label = "TloCriminal1CrimeDetails2DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails3Di_0001 = structure(c(10148, NA,
NA, 12586, NA, 13787, 12913, 14665), label = "TloCriminal1CrimeDetails3DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails4Di_0001 = structure(c(NA, NA, NA,
NA, NA, NA, 12913, 14670), label = "TloCriminal1CrimeDetails4DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails5Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails5DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails6Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails6DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails7Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails7DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails8Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails8DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails9Di_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails9DispositionDate", format.sas = "MMDDYY", class = "Date"),
TloCriminal1CrimeDetails10D_0001 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "TloCriminal1CrimeDetails10DispositionDate", format.sas = "MMDDYY", class = "Date")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"), label = "CRIME_CHK")
We can remove the first and last column and then do the comparison in a vectorized way
out <- +(df1[-1] >= df1[-ncol(df1)])
out[is.na(out)] <- FALSE
If it is to find any element in each column
colSums(out, na.rm = TRUE) == 0
You could loop with sapply over the columns and look where dthe differences are all greater than or equal to zero.
sapply(d, function(x) all(diff(na.omit(x)) >= 0))
# TloCriminal1CrimeDetails1Di_0001 TloCriminal1CrimeDetails2Di_0001
# FALSE FALSE
# TloCriminal1CrimeDetails3Di_0001 TloCriminal1CrimeDetails4Di_0001
# FALSE TRUE
# TloCriminal1CrimeDetails5Di_0001 TloCriminal1CrimeDetails6Di_0001
# TRUE TRUE
# TloCriminal1CrimeDetails7Di_0001 TloCriminal1CrimeDetails8Di_0001
# TRUE TRUE
# TloCriminal1CrimeDetails9Di_0001 TloCriminal1CrimeDetails10D_0001
# TRUE TRUE
Or over the rows:
data.frame(check=apply(d, 1, function(x) all(diff(as.Date(na.omit(x))) >= 0)))
# check
# 1 FALSE
# 2 TRUE
# 3 TRUE
# 4 TRUE
# 5 TRUE
# 6 FALSE
# 7 TRUE
# 8 TRUE

Improve code efficiency

I've been working on a code that reads in all the sheets of an Excel workbook, where the first two columns in each sheet are "Date" and "Time", and the next two columns are either "Level" and "Temperature, or "LEVEL" and "TEMPERATURE". The code works, but I am working on improving my coding clarity and efficiency, so any advice in those regards would be greatly appreciated.
My function 1) reads in the data to a list of dataframes, 2) gets rid of any NA columns that were accidentally read in, 3) combines "Date" and "Time" to "DateTime" for each dataframe, 4) rounds "DateTime" to the nearest 5 minutes for each dataframe, 5) replaces "Date" and "Time" in each dataframe with "DateTime". I started getting more comfortable with lapply, but am wondering if I can improve the code efficiency at all instead of have so many lines with lapply.
library(readxl)
library(plyr)
read_excel_allsheets <- function(filename) {
sheets <- readxl::excel_sheets(filename)
data <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
names(data) <- sheets
clean <- lapply(data, function(y) y[, colSums(is.na(y)) == 0])
date <- lapply(clean, "[[", 1)
time <- lapply(clean, "[[", 2)
time <- lapply(time, function(z) format(z, format = "%H:%M"))
datetime <- Map(paste, date, time)
datetime <- lapply(datetime, function(a) as.POSIXct(a, format = "%Y-%m-%d %H:%M"))
rounded <- lapply(datetime, function(b) as.POSIXlt(round(as.numeric(b)/(5*60))*(5*60),origin='1970-01-01'))
addDateTime <- mapply(cbind, clean, "DateTime" = rounded, SIMPLIFY = F)
final <- lapply(addDateTime, function(z) z[!(names(z) %in% c("Date", "Time"))])
return(final)
}
Next, I would like to plot all of my data. So, I 1) run my code for a file, 2) combine the list of dataframes into one dataframe while maintaining an "ID" for each dataframe as a column, 3) combine the lowercase and uppercase versions of the variable columns, 4) add two new columns that split the "ID". Each ID is something like B1CC or B2CO, where I want to split the "ID" like so: "B1" and "CC". Now I can use ggplot very easily.
mysheets <- read_excel_allsheets(filename)
df = ldply(mysheets)
df$Temp <- rowSums(df[, c("Temperature", "TEMPERATURE")], na.rm = T)
df$Lev <- rowSums(df[, c("Level", "LEVEL")], na.rm = T)
df <- df[!names(df) %in% c("Level", "LEVEL", "Temperature", "TEMPERATURE")]
df$exp <- gsub("^[[:alnum:]]{2}", "\\1",df$.id)
df$plot <- gsub("[[:alnum:]]{2}$", "\\1", df$.id)
Here are the data for the first two dataframes, but there are over 50 of them, and each is relatively big, and there are many files to read. Therefore, I'm looking to improve efficiency (in terms of time to run) where I can. Any help or advice is greatly appreciated!
dput(head(x[[1]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000,
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121912, -2209121612,
-2209121312, -2209121012, -2209120712, -2209120412), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Level = c(106.9038, 106.9059, 106.89,
106.9121, 106.8522, 106.8813), Temperature = c(6.176, 6.173,
6.172, 6.168, 6.166, 6.165)), .Names = c("Date", "Time", "Level",
"Temperature"), row.names = c(NA, 6L), class = c("tbl_df", "tbl",
"data.frame"))
dput(head(x[[2]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000,
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121988, -2209121688,
-2209121388, -2209121088, -2209120788, -2209120488), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), LEVEL = c(117.5149, 117.511, 117.5031,
117.5272, 117.4523, 117.4524), TEMPERATURE = c(5.661, 5.651,
5.645, 5.644, 5.644, 5.645), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), .Names = c("Date", "Time", "LEVEL",
"TEMPERATURE", NA, NA, NA, NA, NA), row.names = c(NA, 6L), class =
c("tbl_df", "tbl", "data.frame"))

Resources