Related
This may not be the best title, feel free to edit it.
x=c(NA,NA,NA,1,NA,NA,NA,NA,0,NA,NA,NA,1,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,0,NA....)
or
x=c(NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,0,NA,NA,1,NA,NA,NA,NA,0,NA,NA,NA,1,NA....)
y=c(seq(1:length(x)))
I would like z to be a new vector that is equal to y except when NAs are between 0 and 1 (not 1 and 0) where it should repeat the value taken when x=0
[1] 1 2 3 4 5 6 7 8 9 9 9 9 13 14 15 16 16 16 16 16 21 22 23 24 25 25
or
[1] 1 2 3 4 4 4 4 4 9 10 11 12 13 13 13 16 17 18 19 20 21 21 21 21 25 26
depending on x
I really don't know how to translate this condition in R.
My solution is clunkier than #James's (now deleted) answer but maybe (?) it's more flexible:
## identify strings of NAs preceded by 0
library(zoo)
na_following_zero <- na.locf(c(1,x))[-1]==0 & is.na(x)
## now identify the 'chunks' to reset
## (there may be a more elegant way to do this)
rr <- rle(na_following_zero)
startvals <- cumsum(c(0,rr$lengths))+1
endvals <- cumsum(rr$lengths)
values <- c(NA,y[startvals-1])
z <- y
## replace values in chunks
for (i in seq_along(rr$values)[rr$values])
z[startvals[i]:endvals[i]] <- values[i]
If time isn't prohibitive, you can just use a "for" loop:
z <- y
between.0.1 <- rep(FALSE, length(x))
for(i in 2:length(x)){
if(!is.na(x[i-1]) && x[i-1]==0){ # switch on after a 0
between.0.1[i] <- TRUE
value.at.0 <- y[i-1]
z[i] <- value.at.0
}
if(between.0.1[i-1]){ # if switched on, stay switched on
between.0.1[i] <- TRUE
z[i] <- value.at.0
}
if(!is.na(x[i]) && x[i]==1){ # switch off if at a 1
between.0.1[i] <- FALSE
value.at.0 <- NA
}
}
z[between.0.1] # diagnostic check
Another approach:
y0 <- which(x==0)
y1<-which(x==1)
# need a kicker to start with first zero
y1<-y1[y1>y0[1]]
# and check for end of sequence
if(y1[length(y1)]< length(x)) y1[(length(y1)+1] <- length(x)+1
#now y0 and y1 better be same length
z<-y
#now do a loop any way you want
for (jj in 1: length(y0) ) z[y0[jj]:(y1[jj]-1)]<-y[y0[jj]]
Rgames> z
[1] 1 2 3 4 4 4 4 4 9 10 11 12 13 13 13 16 17 18 19 20 21 21 21 21 25
[26] 26
With the help of sebastian-c, I figured out my problem with daily data. Please see: R ifelse condition: frequency of continuously NA
And now I have a data set with hourly data:
set.seed(1234)
day <- c(rep(1:2, each=24))
hr <- c(rep(0:23, 2))
v <- c(rep(NA, 48))
A <- data.frame(cbind(day, hr, v))
A$v <- sample(c(NA, rnorm(100)), nrow(A), prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
What I need to do is: If there are more(>=) 4 continuously missing day-hours(7AM-7PM) or >= 3 continuously missing night-hours(7PM-7AM), I will delete the entire day from the data frame, otherwise just run linear interpolation. Thus, the second day should be entirely deleted from the data frame since there are 4 continuously NA during day-time (7AM-10AM). The result is preferably remain data frame. Please help, thank you!
If I modify the NA_run function from the question you linked to take a variable named v instead of value and return the boolean rather than the data.frame:
NA_run <- function(x, maxlen){
runs <- rle(is.na(x$v))
any(runs$lengths[runs$values] >= maxlen)
}
I can then write a wrapper function to call it twice for daytime and nighttime:
dropfun <- function(x) {
dt <- x$hr > 7 & x$hr < 19
daytime <- NA_run(x[dt,], 4)
nighttime <- NA_run(x[!dt,], 3)
any(daytime, nighttime)
}
Which gives me a data.frame of days to drop.
> ddply(A, .(day), dropfun)
day V1
1 1 TRUE
2 2 FALSE
>
We can alter the dropfun to return the dataframe instead though:
dropfun <- function(x) {
dt <- x$hr > 7 & x$hr < 19
daytime <- NA_run(x[dt,], 4)
nighttime <- NA_run(x[!dt,], 3)
if(any(daytime, nighttime)) NULL else x
}
> ddply(A, .(day), dropfun)
day hr v
1 2 0 NA
2 2 1 NA
3 2 2 2.54899107
4 2 3 NA
5 2 4 -0.03476039
6 2 5 NA
7 2 6 0.65658846
8 2 7 0.95949406
9 2 8 NA
10 2 9 1.08444118
11 2 10 0.95949406
12 2 11 NA
13 2 12 -1.80603126
14 2 13 NA
15 2 14 NA
16 2 15 0.97291675
17 2 16 NA
18 2 17 NA
19 2 18 NA
20 2 19 -0.29429386
21 2 20 0.87820363
22 2 21 NA
23 2 22 0.56305582
24 2 23 -0.11028549
>
I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.
I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.
Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
And here's the result of the process:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
You may also want to look into the nearZeroVar() function in the caret package.
If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.
Don't use table() - very slow for such things. One option is length(unique(x)):
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
Simon's solution here is similarly quick on this example:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
but you'll have to see if they scale similarly to real problem sizes.
Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like
var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))
It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance
Use the Caret Package and the function nearZeroVar
require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ]
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]
Well, save yourself some coding time:
Rgames: foo
[,1] [,2] [,3]
[1,] 1 1e+00 1
[2,] 1 2e+00 1
[3,] 1 3e+00 1
[4,] 1 4e+00 1
[5,] 1 5e+00 1
[6,] 1 6e+00 2
[7,] 1 7e+00 3
[8,] 1 8e+00 1
[9,] 1 9e+00 1
[10,] 1 1e+01 1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
Use apply(*, 2, sd) instead.
To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0 and then finally your data frame dat[,as.logical(bar)] should do the trick.
How about using factor to count the number of unique elements and looping with sapply:
dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
NAs are excluded by default, but this can be changed with the exclude parameter of factor:
dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
Because I'm an idiot who keeps googling the same question, let me leave a tidyverse approach that I've settled on:
library(tidyverse)
df <- df %>%
select(
- {
df %>%
map_dbl(~ length(table(.x, useNA = "ifany"))) %>%
{which(. == 1)} %>%
names()
}
)
I think this could be made shorter but I'm too tired!
I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:
removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
notConstant <- function(x) {
if (is.factor(x)) x <- as.integer(x)
return (0 != diff(range(x, na.rm=TRUE)))
}
bkeep <- sapply(a_dataframe, notConstant)
if (verbose) {
cat('removeConstantColumns: '
, ifelse(all(bkeep)
, 'nothing'
, paste(names(a_dataframe)[!bkeep], collapse=',')
, ' removed', '\n')
}
return (a_dataframe[, bkeep])
}
Check this custom function. I did not try it on data frames with 100+ variables.
remove_low_variance_cols <- function(df, threshold = 0) {
n <- Sys.time() #See how long this takes to run
remove_cols <- df %>%
select_if(is.numeric) %>%
map_dfr(var) %>%
gather() %>%
filter(value <= threshold) %>%
spread(key, value) %>%
names()
if(length(remove_cols)) {
print("Removing the following columns: ")
print(remove_cols)
}else {
print("There are no low variance columns with this threshold")
}
#How long did this script take?
print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
return(df[, setdiff(names(df), remove_cols)])
}
Let's say I have a matrix x which contains 10 rows and 2 columns. I want to generate a new matrix M that contains each unique pair of rows from x - that is, a new matrix with 55 rows and 4 columns.
E.g.,
x <- matrix (nrow=10, ncol=2, 1:20)
M <- data.frame(matrix(ncol=4, nrow=55))
k <- 1
for (i in 1:nrow(x))
for (j in i:nrow(x))
{
M[k,] <- unlist(cbind (x[i,], x[j,]))
k <- k + 1
}
So, x is:
[,1] [,2]
[1,] 1 11
[2,] 2 12
[3,] 3 13
[4,] 4 14
[5,] 5 15
[6,] 6 16
[7,] 7 17
[8,] 8 18
[9,] 9 19
[10,] 10 20
And then M has 4 columns, the first two are one row from x and the next 2 are another row from x:
> head(M,10)
X1 X2 X3 X4
1 1 11 1 11
2 1 11 2 12
3 1 11 3 13
4 1 11 4 14
5 1 11 5 15
6 1 11 6 16
7 1 11 7 17
8 1 11 8 18
9 1 11 9 19
10 1 11 10 20
Is there either a faster or simpler (or both) way of doing this in R?
The expand.grid() function useful for this:
R> GG <- expand.grid(1:10,1:10)
R> GG <- GG[GG[,1]>=GG[,2],] # trim it to your 55 pairs
R> dim(GG)
[1] 55 2
R> head(GG)
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
R>
Now you have the 'n*(n+1)/2' subsets and you can simple index your original matrix.
I'm not quite grokking what you are doing so I'll just throw out something that may, or may not help.
Here's what I think of as the Cartesian product of the two columns:
expand.grid(x[,1],x[,2])
You can also try the "relations" package. Here is the vignette. It should work like this:
relation_table(x %><% x)
Using Dirk's answer:
idx <- expand.grid(1:nrow(x), 1:nrow(x))
idx<-idx[idx[,1] >= idx[,2],]
N <- cbind(x[idx[,2],], x[idx[,1],])
> all(M == N)
[1] TRUE
Thanks everyone!
Inspired from the other answers, here is a function implementing cartesian product of two matrices, in the case of two matrices, the full cartesian product, for only one argument, omitting one of each pair:
cartesian_prod <- function(M1, M2) {
if(missing(M2)) { M2 <- M1
ind <- expand.grid(1:NROW(M1), 1:NROW(M2))
ind <- ind[ind[,1] >= ind[,2],] } else {
ind <- expand.grid(1:NROW(M1), 1:NROW(M2))}
rbind(cbind(M1[ind[,1],], M2[ind[,2],]))
}