I have a well balanced panel data set which contains NA observations. I will be using LOCF, and would like to know how many consecutive NA's are in each panel, before carrying observations forward. LOCF is a procedure where by missing values can be "filled in" using the "last observation carried forward". This can make sense it some time-series applications; perhaps we have weather data in 5 minute increments: a good guess at the value of a missing observation might be an observation made 5 minutes earlier.
Obviously, it makes more sense to carry an observation forward one hour within one panel than it does to carry that same observation forward to the next year in the same panel.
I am aware that you can set a "maxgap" argument using zoo::na.locf, however, I want to get a better feel for my data. Please see a simple example:
require(data.table)
set.seed(12345)
### Create a "panel" data set
data <- data.table(id = rep(1:10, each = 10),
date = seq(as.POSIXct('2012-01-01'),
as.POSIXct('2012-01-10'),
by = '1 day'),
x = runif(100))
### Randomly assign NA's to our "x" variable
na <- sample(1:100, size = 52)
data[na, x := NA]
### Calculate the max number of consecutive NA's by group...this is what I want:
### ID Consecutive NA's
# 1 1
# 2 3
# 3 3
# 4 3
# 5 4
# 6 5
# ...
# 10 2
### Count the total number of NA's by group...this is as far as I get:
data[is.na(x), .N, by = id]
All solutions are welcomed, but data.table solutions are highly preferred; the data file is large.
This will do it:
data[, max(with(rle(is.na(x)), lengths[values])), by = id]
I just ran rle to find all consecutive NA's and picked the max length.
Here's a rather convoluted answer to the comment question of recovering the date ranges for the above max:
data[, {
tmp = rle(is.na(x));
tmp$lengths[!tmp$values] = 0; # modify rle result to ignore non-NA's
n = which.max(tmp$lengths); # find the index in rle of longest NA sequence
tmp = rle(is.na(x)); # let's get back to the unmodified rle
start = sum(tmp$lengths[0:(n-1)]) + 1; # and find the start and end indices
end = sum(tmp$lengths[1:n]);
list(date[start], date[end], max(tmp$lengths[tmp$values]))
}, by = id]
You can use rle with the modification suggested here (and pasted below) to count NA values.
foo <- data[, rle(x), by=id]
foo[is.na(values), max(lengths), by=id]
# id V1
# 1: 1 1
# 2: 2 3
# 3: 3 3
# 4: 4 3
# 5: 5 4
# 6: 6 5
# 7: 7 3
# 8: 8 5
# 9: 9 2
# 10: 10 2
Amended rle function:
rle<-function (x)
{
if (!is.vector(x)&& !is.list(x))
stop("'x' must be an atomic vector")
n<- length(x)
if (n == 0L)
return(structure(list(lengths = integer(), values = x),
class = "rle"))
#### BEGIN NEW SECTION PART 1 ####
naRepFlag<-F
if(any(is.na(x))){
naRepFlag<-T
IS_LOGIC<-ifelse(typeof(x)=="logical",T,F)
if(typeof(x)=="logical"){
x<-as.integer(x)
naMaskVal<-2
}else if(typeof(x)=="character"){
naMaskVal<-paste(sample(c(letters,LETTERS,0:9),32,replace=T),collapse="")
}else{
naMaskVal<-max(0,abs(x[!is.infinite(x)]),na.rm=T)+1
}
x[which(is.na(x))]<-naMaskVal
}
#### END NEW SECTION PART 1 ####
y<- x[-1L] != x[-n]
i<- c(which(y), n)
#### BEGIN NEW SECTION PART 2 ####
if(naRepFlag)
x[which(x==naMaskVal)]<-NA
if(IS_LOGIC)
x<-as.logical(x)
#### END NEW SECTION PART 2 ####
structure(list(lengths = diff(c(0L, i)), values = x[i]),
class = "rle")
}
Related
I have a dataframe as shown below.
dataframe
Data for replication:
x <- data.frame(cluster=c(1,2,3,4,5),
groups=c('20000127 20000128',
'20000127 20000128 20000134',
'20000129 20000130 20000131 20000132',
'20000133 20000134 20000135 20000136',
'20000128 20000133 20000134 20000135 20000136'),
chr=c(17,26,35,35,44), stringsAsFactors=FALSE)
I'm trying to come up with a way to analyze the 'group' column for any groups with similar elements and remove the row with the higher count.
For example,
element 20000128 is present in rows 1,2 & 5. Since row 1 has a lower number of characters, I want to remove rows 2 & 5. I appreciate any help!!
Ideally the end result should only have Cluster 1,3,4. Each element should only appear once. (the clusters with the lowest character count)
Exploring this problem has been fun. I've learned that this is a variation of the set cover problem and is NP Complete.
It would help to understand the scope of your problem. If we are talking 10s of clusters, we could use brute force. If it's thousands of clusters, we are going to have to use an approximation.
I have learned there is an R implementation of the greedy algorithm in the RcppGreedySetCover package.
First we need to convert to two column long form. We can use dplyr.
library(tidyverse)
longx <- x %>%
mutate(splitgroups = strsplit(as.character(groups), " ")) %>%
unnest(splitgroups) %>% select(cluster, splitgroups)
Then we can use greedySetCover to approximate the smallest set that covers all elements.
library(RcppGreedySetCover)
greedySetCover(longx)
#100% covered by 3 sets.
# cluster splitgroups
# 1: 2 20000127
# 2: 3 20000129
# 3: 3 20000130
# 4: 3 20000131
# 5: 3 20000132
# 6: 5 20000128
# 7: 5 20000133
# 8: 5 20000134
# 9: 5 20000135
#10: 5 20000136
This suggests the set of 2,3, and 5 covers everything. But this does not fully answer your question, because, as you know there is a set of clusters that is shorter.
However, what we have learned, is that the minimum set is 3 clusters. Now we can test all combinations of 3 clusters.
set.size <- length(unique(greedySetCover(longx)$cluster))
binary.matrix <- table(longx)
combinations <- combn(unique(x$cluster),set.size)
total.lengths <- apply(combinations,2,function(x){
if(sum(as.logical(colSums(binary.matrix[x,]))) == ncol(binary.matrix))
{sum(rowSums(binary.matrix[x,]))}
else {NA}})
min.length <- min(total.lengths,na.rm = TRUE)
min.set <- combinations[,which(total.lengths == min.length)]
x[min.set,]
# cluster groups chr
#1 1 20000127 20000128 17
#3 3 20000129 20000130 20000131 20000132 35
#4 4 20000133 20000134 20000135 20000136 35
Data
x <- data.frame(cluster=c(1,2,3,4,5),
groups=c('20000127 20000128',
'20000127 20000128 20000134',
'20000129 20000130 20000131 20000132',
'20000133 20000134 20000135 20000136',
'20000128 20000133 20000134 20000135 20000136'),
chr=c(17,26,35,35,44), stringsAsFactors=FALSE)
I had to use a while loop, maybe there's a less loopy solution...
foo <- function(x) {
i <- 1
while(i < nrow(x)) {
grps <- strsplit(x$groups, " ")
keep <- unlist(lapply(grps, function(x) identical(x, grps[[i]]) | !any((length(x) > length(grps[[i]]) & duplicated(c(grps[[i]], x))))))
x <- x[keep,]
i <- i+1
}
x
}
foo(x)
cluster groups chr
1 1 20000127 20000128 17
3 3 20000129 20000130 20000131 20000132 35
4 4 20000133 20000134 20000135 20000136 35
Explanation.
# I created a function to keep things compact and allow it to be used for other datasets.
# The `x` is the argument, assumed to be your data frame.
# 1: foo <- function(x) {
# Start the ball rolling with a counter to use in the while loop.
# 2: i <- 1
# This starts the while loop and will continue until "i" reaches the end of the data.
# But note later that the data may change if there are rows that meet your condition.
# 3: while(i < nrow(x)) {
# Split the groups variable at the " " and store in "grps"
# 4: grps <- strsplit(x$groups, " ")
# This next line does the work.
# It creates a vector of logical indices which are used to remove rows of "x"
# I split this into many lines to explain better.
# 5: keep <- unlist(lapply(grps, function(x) # apply a function to "grps"
# identical(x, grps[[i]]) | # Returns TRUE for each row we are checking
# !any( # Negate the next conditions. They will return rows to remove.
# (length(x) > length(grps[[i]]) & # return TRUE (negated=FALSE) if the length of each x is more than all others
# duplicated(c(grps[[i]], x)))))) # if duplicated, return TRUE (negated=FALSE)
# Update "x" by keeping only the rows that meet the criteria defined in step 5.
# 6: x <- x[keep,]
# Increase i
# 7: i <- i+1
# 8: } # This ends the while loop
# 9: x # Return the result
} # End of function
I want to replace the nth consecutive occurrence of a particular code in my data frame. This should be a relatively easy task but I can't think of a solution.
Given a data frame
df <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,1,2,2,2,1,1))
I want a result
df_result <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,0,2,2,2,1,0))
The data frame is time-ordered so I need to keep the same order after replacing the values. I guess that nth() or duplicate() functions could be useful here but I'm not sure how to use them. What I'm missing is a function that would count the number of consecutive occurrences of a given value. Once I have it, I could then use it to replace the nth occurrence.
This question had some ideas that I explored but still didn't solve my problem.
EDIT:
After an answer by #Gregor I wrote the following function which solves the problem
library(data.table)
library(dplyr)
replace_nth <- function(x, nth, code) {
y <- data.table(x)
y <- y[, code_rleid := rleid(y$Code)]
y <- y[, seq := seq_along(Code), by = code_rleid]
y <- y[seq == nth & Code == code, Code := 0]
drop.cols <- c("code_rleid", "seq")
y %>% select(-one_of(drop.cols)) %>% data.frame() %>% return()
}
To get the solution, simply run replace_nth(df, 2, 1)
Using data.table:
library(data.table)
setDT(df)
df[, code_rleid := rleid(df$Code)]
df[, seq := seq_along(Code), by = code_rleid]
df[seq == 2 & Code == 1, Code := 0]
df
# Values Code code_rleid seq
# 1: 1 1 1 1
# 2: 4 0 1 2
# 3: 5 2 2 1
# 4: 6 2 2 2
# 5: 3 2 2 3
# 6: 3 1 3 1
# 7: 2 0 3 2
You could combine some of these (and drop the extra columns after). I'll leave it clear and let you make modifications as you like.
I am using the data.table package to work through a the House Prices data set from Kaggle.
When I retrieve the matches from the data table syntax, the row numbers are not returned with the data.
combined_df[is.na(GarageArea), garage_num_vars, with = FALSE]
GarageYrBlt GarageCars GarageArea
1: 1923 NA NA
How can I get the actual row number with that observation? I have seen many solutions using .I and using which = TRUE but how would I add the which = TRUE argument to my current syntax?
In addition to adding a column of row number as suggested in the comment, you can also use which argument in this way:
DT <- data.table(val = c(1, 2, 3, NA, 4))
# > DT
# val
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 4
x <- DT[is.na(val), which = TRUE]
cbind(rownum = x, DT[x])
# rownum val
# 1: 4 NA
I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.
I have a table of 55000 rows, which looks like that (left table):
(the code to generate sample data is below)
Now I need to convert every row of this table to 6 rows, each containing one letter of "hexamer" (right table on the picture) with some calculations:
# input for the function is one row of source table, output is 6 rows
splithexamer <- function(x){
dir <- x$dir # strand direction: +1 or -1
pos <- x$pos # hexamer position
out <- x[0,] # template of output
hexamer <- as.character(x$hexamer)
for (i in 1:nchar(hexamer)) {
letter <- substr(hexamer, i, i)
if (dir==1) {newpos <- pos+i-1;}
else {newpos <- pos+6-i;}
y <- x
y$pos <- newpos
y$letter <- letter
out <- rbind(out,y)
}
return(out);
}
# Sample data generation:
set.seed(123)
size <- 55000
letters <- c("G","A","T","C")
df<-data.frame(
HSid=paste0("Hs.", 1:size),
hexamer=replicate(n=size, paste0(sample(letters,6,replace=T), collapse="")),
chr=sample(c(1:23,"X","Y"),size,replace=T),
pos=sample(1:99999,size,replace=T),
dir=sample(c(1,-1),size,replace=T)
)
Now I would like to get some advices what would be the most efficient way to apply my function to every row. So far I tried the following:
# Variant 1: for() with rbind
tmp <- data.frame()
for (i in 1:nrow(df)){
tmp<-rbind(tmp,splithexamer(df[i,]));
}
# Variant 2: for() with direct writing to file
for (i in 1:nrow(df)){
write.table(splithexamer(df[i,]),file="d:/test.txt",append=TRUE,quote=FALSE,col.names=FALSE)
}
# Variant 3: ddply
tmp<-ddply(df, .(HSid), .fun=splithexamer)
# Variant 4: apply - I don't know correct syntax
tmp<-apply(X=df, 1, FUN=splithexamer) # this causes an error
all of the above is extremely slow, I am wondering if there's better way to solve this task...
Solution using data.table:
df$hexamer <- as.character(df$hexamer)
dt <- data.table(df)
dt[, id := seq_len(nrow(df))]
setkey(dt, "id")
dt.out <- dt[, { mod.pos <- pos:(pos+5); if(dir == -1) mod.pos <- rev(mod.pos);
list(split = unlist(strsplit(hexamer, "")),
mod.pos = mod.pos)}, by=id][dt][, id := NULL]
dt.out
# split mod.pos HSid hexamer chr pos dir
# 1: G 95982 Hs.1 GCTCCA 5 95982 1
# 2: C 95983 Hs.1 GCTCCA 5 95982 1
# 3: T 95984 Hs.1 GCTCCA 5 95982 1
# 4: C 95985 Hs.1 GCTCCA 5 95982 1
# 5: C 95986 Hs.1 GCTCCA 5 95982 1
# ---
# 329996: A 59437 Hs.55000 AATCTG 7 59436 1
# 329997: T 59438 Hs.55000 AATCTG 7 59436 1
# 329998: C 59439 Hs.55000 AATCTG 7 59436 1
# 329999: T 59440 Hs.55000 AATCTG 7 59436 1
# 330000: G 59441 Hs.55000 AATCTG 7 59436 1
Explanation of the main line:
The by=id will group by id and since they are all unique, it'll group by every line, one at a time.
Then, the ones within {} sets mod.pos to pos:(pos+6-1) and if dir == -1 reverses it.
Now, the list argument: It creates the column split by creating 6 nucleotides from your hexamer using strsplit and also sets mod.pos which we've already calculated in the step before.
This will result in a data.table with columns id, split and mod.pos.
The next part [dt] is a typical usage of data.table's X[Y] syntax which performs a join on the data.tables based on the key column ( = id, here). Since there are 6 rows for every id you get all the other columns in dt duplicated during this join.
I'd suggest you take a look at data.table FAQ first and then its documentation (intro). These links can be obtained by installing the package and loading it and then typing ?data.table. I also suggest you work through the many examples in there one by one with a test data.table to understand practically the features of data.table.
Hope this helps.