Trimming NAs based on column subset - a more elegant solution? - r

A New Year's quandary for the stackoverflow community which has been quite the help by reading posts and answers in the past (this is my first question). I've found a work around, but I'm wondering if other approaches/solutions might be suggested.
I am attempting to remove trailing NA's from a large data.frame, but those NA's are only found in a few of the columns of the data.frame and I would like to retain all columns in the output. Here is a representative data subset.
df=data.frame(var1=rep("A", 8), var2=c("a","b","c","d","e","f","g","h"), var3=c(0,1,NA,2,3,NA,NA,NA), var4=c(0,0,NA,4,5,NA,NA,NA), var5=c(0,0,NA,0,2,4,NA,NA))
Goals of the process:
Trim trailing NAs based on NA presence in var3,var4 and var5
Retain all columns in final output
Only remove trailing NAs (i.e. row 3 remains in record as a placeholder)
Only trim if all columns have an NA (i.e. row 7 and 8, but not row 6)
Based on these goals, the solution should remove the last two rows of df:
df.output = df[-c(7,8),]
The behaviour of na.trim (in the zoo package) is ideal (as it limits removal to those NA's at the end of the data.frame, with sides="right"), and my work-around involved altering the na.trim.default function to include a subset term.
Any suggestions? Many thanks for any help.
EDIT: Just to complete this question, below is the function I created from the na.trim.default code which also works, but as noted, does require loading the zoo package.
na.trim.multiplecols <- function (object, colrange, sides = c("both", "left", "right"), is.na = c("any","all"),...)
{
is.na <- match.arg(is.na)
nisna <- if (is.na == "any" || length(dim(object[,colrange])) < 1) {
complete.cases(object[,colrange])
}
else rowSums(!is.na(object[,colrange])) > 0
idx <- switch(match.arg(sides), left = cumsum(nisna) > 0,
right = rev(cumsum(rev(nisna) > 0) > 0), both = (cumsum(nisna) >
0) & rev(cumsum(rev(nisna)) > 0))
if (length(dim(object)) < 2)
object[idx]
else object[idx, , drop = FALSE]
}

Something based on max(which(!is.na())) will work. We use this to find the largest index of non-missing data from the columns of interest.
Using your df
ind <- max(max(which(!is.na(df$var3))),
max(which(!is.na(df$var4))),
max(which(!is.na(df$var5))))
df[1:ind, ]
var1 var2 var3 var4 var5
1 A a 0 0 0
2 A b 1 0 0
3 A c NA NA NA
4 A d 2 4 0
5 A e 3 5 2
6 A f NA NA 4

Edit: First solution using base rle and apply
t <- rle(apply(as.matrix(df[,3:5]), 1, function(x) all(is.na(x))))
r <- ifelse(t$values[length(t$values)] == TRUE, t$lengths[length(t$lengths)], 0)
head(df, -r)
Second solution using Rle from package IRanges:
require(IRanges)
t <- min(sapply(df[,3:5], function(x) {
o <- Rle(x)
val <- runValue(o)
if (is.na(val[length(val)])) {
len <- runLength(o)
out <- len[length(len)]
} else {
out <- 0
}
}))
head(df, -t)

Related

R Error (from NA's to 0): duplicate subscripts for column in Data Frame

I looked at multiple answers on Stackoverflow, but I stil couldn't fix my problem. I made a function that works, but I added something and now it won't work anymore. I want to replace all NA's to 0, which seems simple to me.
This is my function, and I added bframe[is.na(bframe)] <- 0:
B <- function(frame1, frame2, column){
bframe <- merge(frame 1, frame2, by = column, all = TRUE)
bframe$result <- bframe$freq.x - bframe$freq.y
bframe$percentage <- (bframe$result/bframe$freq.y)*100
bframe[is.na(bframe)] <- 0
return(bframe)
}
B(DT2_1, 2_1, "BurgS")
However, it gives this error: Error in '[<-.data.frame'('* tmp *, thisvar, value = 0) : duplicate subscripts for columns.
The error occurs because there are NAs and it cannot perform the calculations:
BurgS freq.x freq.y result percentage percentageABS
1 9204 184042 -174838 -94.99897 94.99897
2 150 3034 -2884 -95.05603 95.05603
3 130 2602 -2472 -95.00384 95.00384
98 NA 47 NA NA NA
Not every data frame has this structure, so I am looking for a solution that changes NA's in the whole data set. Can someone help me out?
Changed 26/6/2018: I stumbled up the solution myself. The code is as follows, so that the NA of freq.x is changed to 0 and still can be part of the calculations which the outcome is shown in the last three columns:
B <- function(frame1, frame2, column){
bframe$freq.x[is.na(bframe$freq.x)] <- 0
bframe <- merge(frame 1, frame2, by = column, all = TRUE)
bframe$result <- bframe$freq.x - bframe$freq.y
bframe$percentage <- (bframe$result/bframe$freq.y)*100
return(bframe)
}
B(DT2_1, 2_1, "BurgS")
I was getting the same error and found another solution.
instead of:
bframe[is.na(bframe)] <- 0
try adding a comma after the is.na function:
bframe[is.na(bframe), ] <- 0
This worked perfectly for me!

R: find a condition to replace with NA and redirect to a dataset

I have a dataset require to filter out negative value and replace with NA.
Here is my code:
A[ A < 0 ] = NA
How can I generate one new dataset like this:
A_postive <- A[ A < 0 ] = NA
Thanks!
We can use subset to get the 'value' column with values less than 0 and replace that with NA using transform.
A_positive <- transform(subset(A, value < 0), value = NA)
Thanks to #thelatemail for your good comments - the simple way:
A_positive <- replace(A, A < 0 | A > 10, NA)

Suppress replacement errors in R

I have a sequence of data frame subsetting operations. Some of them might fail because the rows to be replaced do not exist. I would still like the others to execute. Example:
source_data[source_data$abbr_d == "bdp",]$party_id <- 32
source_data[source_data$abbr_d == "svp",]$party_id <- 4
source_data[source_data$abbr_d == "cvp",]$party_id <- 2
source_data[source_data$abbr_d == "fdp",]$party_id <- 1
source_data[source_data$abbr_d == "gps",]$party_id <- 13
source_data[source_data$abbr_d == "sp",]$party_id <- 3
source_data[source_data$abbr_d == "csp",]$party_id <- 8
source_data[source_data$abbr_d == "pcs",]$party_id <- 8
Error in `$<-.data.frame`(`*tmp*`, "party_id", value = 13) :
replacement has 1 row, data has 0
source_data[source_data$abbr_d == "lega",]$party_id <- 18
source_data[source_data$abbr_d == "edu",]$party_id <- 16
source_data[source_data$abbr_d == "glp",]$party_id <- 31
I would like the script to continue after the error has been thrown. I've tried using tryCatch() but that doesn't really help because I don't know in advance at which point the replacement will fail.
Is there a way to tell R to just "not care" about those replacement errors? And still continue with the next replacement operations?
The only solution I came up with is to use if-statements like this, which is tedious:
if(nrow(source_data[source_data$abbr_d == "lega", 1]) > 0){
source_data[source_data$abbr_d == "lega",]$party_id <- 18
}
if(nrow(source_data[source_data$abbr_d == "edu", 1]) > 0){
source_data[source_data$abbr_d == "edu",]$party_id <- 16
}
etc...
That is quite verbose code. Luckily, there is a way to get this done in a fraction of the code, and preventing your issue. My suggestion is to use a lookup table to build the party_id column
df = data.frame(abbr_d = sample(LETTERS[1:8], 100, replace = TRUE))
lookup_table = 1:8
names(lookup_table) = LETTERS[1:8]
# A B C D E F G H
# 1 2 3 4 5 6 7 8
df$party_id = lookup_table[df$abbr_d]
So, you create the link between abbr_d and party_id once (here letters and simple numbers, but simply replace your values), and use the df$abbr_d column to subset the lookup table. This maps the labels in abbr_d to the values that correspond to that for party_id.
The error you see is avoided because only addr_d values that are actually in the data are looked up in the lookup table. These unneeded values in the lookup table do not pose an issue.
A dplyr approach as a bonus:
library(dplyr)
df %>% mutate(party_id = lookup_table[abbr_d])
You can use data.table library to mitigate the issue
txt<-"
1,a,1
2,b,2
3,c,3
4,d,4
"
dat = read.delim(textConnection(txt),
header=FALSE,sep=",",strip.white=TRUE)
dat
dat[dat$V2=="e",]$V3<-4
# Error in `$<-.data.frame`(`*tmp*`, "V3", value = 4) :
# le tableau de remplacement a 1 lignes, le tableau remplacé en a 0
library(data.table)
data=as.data.table(dat)
data[data$V2=="e",]$V3<-4
# no error thrown
data.table is often faster than data frame, afaik.

Operations on elements of column vectors

I have a column vector containing 1's. I also have another numeric column containing numbers.
Example:
day_eq day
1 1
1 5
1 3
1 2
I now want to say:
If an element from day is smaller than its corresponding element in day_eq,
make invalid (a column vector element) = 5.
This is my code:
for (i in 1:nrow(setin)){
if (setin[[i,"day"]]<setin[[i,"day_eq"]]){
setin[[i,"valid"]] = 0
setin[[i,"invalid_code"]] = 5
}
}
It isn't working. It keeps saying:
Error in if (setin[[i, "day"]] < setin[[i, "day_eq"]]) { :
missing value where TRUE/FALSE needed
or
In if (test.ID1$day_eq > test.ID1$day) { :
the condition has length > 1 and only the first element will be used
Where test.ID1 is the set name.
You don't need a loop for that. I'm not sure exactly what you are doing... but ifelse should be able to help you...
setin$valid <- ifelse(setin$day < setin$day_eq, 0, NA)
setin$invalid_code <- ifelse(setin$day < setin$day_eq, 5, NA)
your data is
day_eq <- c(1,1,1,1)
day <- c (1,5,3,2)
setin <- data.frame(day_eq,day)
the solution using dplyr is
library(dplyr)
setin %>% mutate(invalid = ifelse (day < day_eq, 5, 0))
I used setin as set name, however, you also use test.ID1, so just replace it in case

Last Observation Carried Forward In a data frame? [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 5 years ago.
I wish to implement a "Last Observation Carried Forward" for a data set I am working on which has missing values at the end of it.
Here is a simple code to do it (question after it):
LOCF <- function(x)
{
# Last Observation Carried Forward (for a left to right series)
LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
x[LOCF:length(x)] <- x[LOCF]
return(x)
}
# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))
Now this works great for simple vectors. But if I where to try and use it on a data frame:
a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess
It will turn my data frame into a character matrix.
Can you think of a way to do LOCF on a data.frame, without turning it into a matrix? (I could use loops and such to correct the mess, but would love for a more elegant solution)
This already exists:
library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))
If you do not want to load a big package like zoo just for the na.locf function, here is a short solution which also works if there are some leading NAs in the input vector.
na.locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v)+1]
}
Adding the new tidyr::fill() function for carrying forward the last observation in a column to fill in NAs:
a <- data.frame(col1 = rep("a",4), col2 = 1:4,
col3 = 1:4, col4 = c(1,NA,NA,NA))
a
# col1 col2 col3 col4
# 1 a 1 1 1
# 2 a 2 2 NA
# 3 a 3 3 NA
# 4 a 4 4 NA
a %>% tidyr::fill(col4)
# col1 col2 col3 col4
# 1 a 1 1 1
# 2 a 2 2 1
# 3 a 3 3 1
# 4 a 4 4 1
There are a bunch of packages implementing exactly this functionality.
(with same basic functionality, but some differences in additional options)
spacetime::na.locf
imputeTS::na_locf
zoo::na.locf
xts::na.locf
tidyr::fill
Added a benchmark of these methods for #Alex:
I used the microbenchmark package and the tsNH4 time series, which has 4552 observations.
These are the results:
So for this case na_locf from imputeTS was the fastest - closely followed by na.locf0 from zoo. The other methods were significantly slower. But be careful it is only a benchmark made with one specific time series. (added the code that you can test for your specific use case)
Results as a plot:
Here is the code, if you want to recreate the benchmark with a self selected time series:
library(microbenchmark)
library(imputeTS)
library(zoo)
library(xts)
library(spacetime)
library(tidyr)
# Create a data.frame from tsNH series
df <- as.data.frame(tsNH4)
res <- microbenchmark(imputeTS::na_locf(tsNH4),
zoo::na.locf0(tsNH4),
zoo::na.locf(tsNH4),
tidyr::fill(df, everything()),
spacetime::na.locf(tsNH4),
times = 100)
ggplot2::autoplot(res)
plot(res)
# code just to show each methods produces correct output
spacetime::na.locf(tsNH4)
imputeTS::na_locf(tsNH4)
zoo::na.locf(tsNH4)
zoo::na.locf0(tsNH4)
tidyr::fill(df, everything())
This question is old but for posterity... the best solution is to use data.table package with the roll=T.
I ended up solving this using a loop:
fillInTheBlanks <- function(S) {
L <- !is.na(S)
c(S[L][1], S[L])[cumsum(L)+1]
}
LOCF.DF <- function(xx)
{
# won't work well if the first observation is NA
orig.class <- lapply(xx, class)
new.xx <- data.frame(t( apply(xx,1, fillInTheBlanks) ))
for(i in seq_along(orig.class))
{
if(orig.class[[i]] == "factor") new.xx[,i] <- as.factor(new.xx[,i])
if(orig.class[[i]] == "numeric") new.xx[,i] <- as.numeric(new.xx[,i])
if(orig.class[[i]] == "integer") new.xx[,i] <- as.integer(new.xx[,i])
}
#t(na.locf(t(a)))
return(new.xx)
}
a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
LOCF.DF(a)
Instead of apply() you can use lapply() and then transform the resulting list to data.frame.
LOCF <- function(x) {
# Last Observation Carried Forward (for a left to right series)
LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
x[LOCF:length(x)] <- x[LOCF]
return(x)
}
a <- data.frame(rep("a",4), 1:4, 1:4, c(1, NA, NA, NA))
a
data.frame(lapply(a, LOCF))

Resources