I'm constructing a data.table from two (or more) input vectors with different lengths:
x <- c(1,2,3,4)
y <- c(8,9)
dt <- data.table(x = x, y = y)
And need the shorter vector(s) to be filled with NA rather than recycling their values, resulting in a data.table like this:
x y
1: 1 8
2: 2 9
3: 3 NA
4: 4 NA
Is there a way to achieve this without explicitly filling the shorter vector(s) with NA before passing them to the data.table() constructor?
Thanks!
One can use out of range indices:
library("data.table")
x <- c(1,2,3,4)
y <- c(8,9)
n <- max(length(x), length(y))
dt <- data.table(x = x[1:n], y = y[1:n])
# > dt
# x y
# 1: 1 8
# 2: 2 9
# 3: 3 NA
# 4: 4 NA
Or you can extend y by doing (as #Roland recommended in the comment):
length(y) <- length(x) <- max(length(x), length(y))
dt <- data.table(x, y)
An option is cbind.fill from rowr
library(rowr)
setNames(cbind.fill(x, y, fill = NA), c("x", "y"))
Or place the vectors in a list and then pad NA at the end based on the maximum length of the list elements
library(data.table)
lst <- list(x = x, y = y)
as.data.table(lapply(lst, `length<-`, max(lengths(lst))))
# x y
#1: 1 8
#2: 2 9
#3: 3 NA
#4: 4 NA
The "out of range indices" answer provided by jogo can be extended cleanly to in-place assignment using .N:
x <- c(1,2,3,4)
y <- c(8,9)
n <- max(length(x), length(y))
dt <- data.table(x = x[1:n], y = y[1:n])
z <- c(6,7)
dt[, z := z[1:.N]]
# x y z
# 1: 1 8 6
# 2: 2 9 7
# 3: 3 NA NA
# 4: 4 NA NA
Related
I have new question related with this my topic
deleting outlier in r with account of nominal var.
In new case variables x and x1 has different lenght
x <- c(-10, 1:6, 50)
x1<- c(-20, 1:5, 60)
z<- c(1,2,3,4,5,6,7,8)
bx <- boxplot(x)
bx$out
bx1 <- boxplot(x1)
bx1$out
x<- x[!(x %in% bx$out)]
x1 <- x1[!(x1 %in% bx1$out)]
x_to_remove<-which(x %in% bx$out)
x <- x[!(x %in% bx$out)]
x1_to_remove<-which(x1 %in% bx1$out)
x1 <- x1[!(x1 %in% bx1$out)]
z<-z[-unique(c(x_to_remove,x1_to_remove))]
z
data.frame(cbind(x,x1,z))
then i get the warning
Warning message:
In cbind(x, x1, z) :
number of rows of result is not a multiple of vector length (arg 2)
so in new dataframe the obs. of Z is not corresponding to x and x1.
How can i decide this problem?
This solustion is not help me
Rsolnp: In cbind(temp, funv) : number of rows of result is not a multiple of vector length (arg 1)
or i just do anything wrong.
Edit
x_to_remove<-which(x %in% bx$out)
x <- x[!(x %in% bx$out)]
x1_to_remove<-which(x1 %in% bx1$out)
x1 <- x1[!(x1 %in% bx1$out)]
z<-z[-unique(c(x_to_remove,x1_to_remove))]
z
d=data.frame(cbind(x,x1,z))
d
it is wrong
Warning message:
In cbind(x, x1, z) :
number of rows of result is not a multiple of vector length (arg 2)
d
x x1 z
1 1 1 2
2 2 2 3
3 3 3 4
4 4 4 5
5 5 5 6
6 6 1 2
How on this 3 columg get this output
Na Na Na
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
Na Na Na
Na Na Na
the six row (d) is superfluous
Differents lengths in original x, x1 and z lists is the first problem, how can you say which z values is related to each x and x1 values?
x <- c(-10, 1:6, 50)
x1<- c(-20, 1:5, 60)
z<- c(1,2,3,4,5,6,7,8)
length(x)
[1] 8
length(x1)
[1] 7
length(z)
[1] 8
Another problem is here:
x<- x[!(x %in% bx$out)] #remove this
x1 <- x1[!(x1 %in% bx1$out)] #remove this
x_to_remove<-which(x %in% bx$out)
x <- x[!(x %in% bx$out)]
x1_to_remove<-which(x1 %in% bx1$out)
x1 <- x1[!(x1 %in% bx1$out)]
You clean x and x1 before calculating x_to_remove and x1_to_remove
EDIT:
To achieve your desired output try this code (/ode lines added signed in comments):
x <- c(-10, 1:6, 50)
x1<- c(-20, 1:5, 60)
z<- c(1,2,3,4,5,6,7,8)
length_max<-min(length(x),length(x1),length(z)) #Added: identify max length before outlier detection
bx <- boxplot(x)
bx1 <- boxplot(x1)
x_to_remove<-which(x %in% bx$out)
x <- x[!(x %in% bx$out)]
x1_to_remove<-which(x1 %in% bx1$out)
x1 <- x1[!(x1 %in% bx1$out)]
z<-z[-unique(c(x_to_remove,x1_to_remove))]
length_min<-min(length(x),length(x1),length(z)) #Minimum length after outlier remove
d=data.frame(cbind(x[1:length_min],x1[1:length_min],z[1:length_min])) #Bind columns
colnames(d)<-c("x","x1","z")
d_NA<-as.data.frame(matrix(rep(NA,(length_max-length_min)*3),nrow=(length_max-length_min))) #Create NA rows
colnames(d_NA)<-c("x","x1","z")
d<-rbind(d,d_NA) #Your desired output
d
x x1 z
1 1 1 2
2 2 2 3
3 3 3 4
4 4 4 5
5 5 5 6
6 NA NA NA
7 NA NA NA
MWE.
library(data.table)
x <- data.table(
g=rep(c("x", "y"), each=4), # grouping variable
time=c(1,3,5,7,2,4,6,8), # time index
val=1:8) # value
setkeyv(x, c("g", "time"))
cumsd <- function(x) sapply(sapply(seq_along(x)-1, head, x=x), sd)
x[, cumsd(val), by=g]
## Output
# g V1
# 1: x NA
# 2: x NA
# 3: x 0.7071068
# 4: x 1.0000000
# 5: y NA
# 6: y NA
# 7: y 0.7071068
# 8: y 1.0000000
I want to compute the standard deviation (or more generally, a mathematical function) of all prior values (not including the current value), per observation, by group, in R.
The cumsd ("cumulative sd") function above does what I need. For e.g. row 3, V1 = sd(c(1, 2)), corresponding to the values in rows 1 and 2. Row 7, V1 = sd(c(5, 6)), corresponding to the values in rows 5 and 6.
However, cumsd is very slow (too slow to use in my real-world application). Any ideas on how to do the computation more efficiently?
Edit
For sd we can use runSD from library TTR as discussed here: Calculating cumulative standard deviation by group using R
Gabor's answer below addresses the more general case of any arbitrary mathematical function on prior values. Though potentially the generalisability comes at some cost of efficiency.
We can specify the window widths as a vector and then omit the last value in the window for each application of sd.
library(zoo)
x[, sd:=rollapplyr(val, seq_along(val), function(x) sd(head(x, -1)), fill = NA), by = g]
giving:
> x
g time val sd
1: x 1 1 NA
2: x 3 2 NA
3: x 5 3 0.7071068
4: x 7 4 1.0000000
5: y 2 5 NA
6: y 4 6 NA
7: y 6 7 0.7071068
8: y 8 8 1.0000000
Alternately we can specify the offsets in a list. Negative offsets, used here, refer to prior values so -1 is the immediate prior value, -2 is the value before that and so on.
negseq <- function(x) -seq_len(x))
x[, sd:=rollapplyr(val, lapply(seq_along(val)-1, negseq), sd, fill = NA), by = g]
giving:
> x
g time val sd
1: x 1 1 NA
2: x 3 2 NA
3: x 5 3 0.7071068
4: x 7 4 1.0000000
5: y 2 5 NA
6: y 4 6 NA
7: y 6 7 0.7071068
8: y 8 8 1.0000000
We can use TTR::runSD with shift:
library(TTR);
setDT(x)[, cum_sd := shift(runSD(val, n = 2, cumulative = TRUE)) , g]
# g time val cum_sd
#1: x 1 1 NA
#2: x 3 2 NA
#3: x 5 3 0.7071068
#4: x 7 4 1.0000000
#5: y 2 5 NA
#6: y 4 6 NA
#7: y 6 7 0.7071068
#8: y 8 8 1.0000000
Turned out that neither option were fast enough for my application (millions of groups and observations). But your comments inspired me to write a small function in Rcpp that did the trick. Thanks everyone!
library(data.table)
library(Rcpp)
x <- data.table(
g=rep(c("x", "y"), each=4), # grouping variable
time=c(1,3,5,7,2,4,6,8), # time index
val=1:8) # value
setkeyv(x, c("g", "time"))
cumsd <- function(x) sapply(sapply(seq_along(x)-1, head, x=x), sd)
x[, v1:=cumsd(val), by=g]
cppFunction('
Rcpp::NumericVector rcpp_cumsd(Rcpp::NumericVector inputVector){
int len = inputVector.size();
Rcpp::NumericVector outputVector(len, NumericVector::get_na());
if (len < 3) return (outputVector);
for (int i = 2; i < len; ++i){
outputVector(i) = Rcpp::sd(inputVector[Rcpp::seq(0, i - 1)]);
}
return(outputVector);
};
')
x[, v2:= rcpp_cumsd(val), by=g]
all.equal(x$v1, x$v2)
## TRUE
The speed difference seems to depend on the number of groups vs. the number of observations per group in the data.table. I won't post benchmarks but in my case, the Rcpp version was much, much faster.
I would like to be able to apply a function to all combinations of a set of input arguments. I have a working solution (below) but would be surprised if there's not a better / more generic way to do this using, e.g. plyr, but so far have not found anything. Is there a better solution?
# Apply function FUN to all combinations of arguments and append results to
# data frame of arguments
cmapply <- function(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,
USE.NAMES = TRUE)
{
l <- expand.grid(..., stringsAsFactors=FALSE)
r <- do.call(mapply, c(
list(FUN=FUN, MoreArgs = MoreArgs, SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES),
l
))
if (is.matrix(r)) r <- t(r)
cbind(l, r)
}
examples:
# calculate sum of combinations of 1:3, 1:3 and 1:2
cmapply(arg1=1:3, arg2=1:3, 1:2, FUN=sum)
# paste input arguments
cmapply(arg1=1:3, arg2=c("a", "b"), c("x", "y", "z"), FUN=paste)
# function returns a vector
cmapply(a=1:3, b=2, FUN=function(a, b) c("x"=b-a, "y"=a+b))
This function isn't necessarily any better, just slightly different:
rcapply <- function(FUN, ...) {
## Cross-join all vectors
DT <- CJ(...)
## Get the original names
nl <- names(list(...))
## Make names, if all are missing
if(length(nl)==0L) nl <- make.names(1:length(list(...)))
## Fill in any missing names
nl[!nzchar(nl)] <- paste0("arg", 1:length(nl))[!nzchar(nl)]
setnames(DT, nl)
## Call the function using all columns of every row
DT2 <- DT[,
as.data.table(as.list(do.call(FUN, .SD))), ## Use all columns...
by=.(rn=1:nrow(DT))][ ## ...by every row
, rn:=NULL] ## Remove the temp row number
## Add res to names of unnamed result columns
setnames(DT2, gsub("(V)([0-9]+)", "res\\2", names(DT2)))
return(data.table(DT, DT2))
}
head(rcapply(arg1=1:3, arg2=1:3, 1:2, FUN=sum))
## arg1 arg2 arg3 res1
## 1: 1 1 1 3
## 2: 1 1 2 4
## 3: 1 2 1 4
## 4: 1 2 2 5
## 5: 1 3 1 5
## 6: 1 3 2 6
head(rcapply(arg1=1:3, arg2=c("a", "b"), c("x", "y", "z"), FUN=paste))
## arg1 arg2 arg3 res1
## 1: 1 a x 1 a x
## 2: 1 a y 1 a y
## 3: 1 a z 1 a z
## 4: 1 b x 1 b x
## 5: 1 b y 1 b y
## 6: 1 b z 1 b z
head(rcapply(a=1:3, b=2, FUN=function(a, b) c("x"=b-a, "y"=a+b)))
## a b x y
## 1: 1 2 1 3
## 2: 2 2 0 4
## 3: 3 2 -1 5
A slight simplification of your original code:
cmapply <- function(FUN, ..., MoreArgs = NULL)
{
l <- expand.grid(..., stringsAsFactors=FALSE)
r <- .mapply(FUN=FUN, dots=l, MoreArgs = MoreArgs)
r <- simplify2array(r, higher = FALSE)
if (is.matrix(r)) r <- t(r)
return(cbind(l, r))
}
This does not require a do.call.
It does miss the SIMPLIFY and USE.NAMES arguments, but the way you are using it seems to make the arguments not usable anyway: if SIMPLIFY = FALSE, the rbind() will fail, and USE.NAMES = TRUE does not do anything because the names get lost after the rbind() anyway.
I am trying to construct a matrix in a for loop. I am declaring an empty matrix called matrix1 in my code and I am trying to fill it row-wise with the simulated x, y and z variables:
simulation <- function(ss){
nsim <- 100
matrix1 <- matrix(0, ncol=3, nrow=nsim)
colnames(matrix1) <- c("x", "y", "z")
for(i in 1:nsim) {
set.seed(i)
x <- relevel(as.factor(sample(1:4,ss, replace=TRUE)), ref="4")
y <- relevel(as.factor(sample(1:3,ss, replace=TRUE)), ref="3")
z <- relevel(as.factor(sample(1:2,ss, replace=TRUE)), ref="2")
matrix1[i, ] <- cbind(x, y, z)
}
return(matrix1)
}
Now, when I run this, I am getting an error:
Error in matrix1[i, ] <- cbind(x, y, z) :
number of items to replace is not a multiple of replacement length
.
I don't see why that is happening since matrix1 has 3 columns and I am filling it recursively with the 3 variables x, y and z.
Too fill it row by row, do like this (this is not efficient):
x = data.frame(name = character(), y = numeric())
for(i in 1:10){
x = rbind(x, data.frame(name = letters[i], y = runif(1)))
}
# name y
1 a 0.09931082
2 b 0.85088120
3 c 0.39535348
4 d 0.08633770
5 e 0.08329996
6 f 0.46080032
7 g 0.39309986
8 h 0.01993358
9 i 0.96079532
10 j 0.19371701
If you have a lot of rows, this is very inefficient. In this case, you could try to pre-allocate the rows:
n = 10
x = data.frame(name = rep(NA_character_, n), y = rep(NA_real_, n))
name y
1 <NA> NA
2 <NA> NA
3 <NA> NA
4 <NA> NA
5 <NA> NA
6 <NA> NA
7 <NA> NA
8 <NA> NA
9 <NA> NA
10 <NA> NA
Then loop over the rows and columns to set the values.
for (i in 1:n){
for (j in 1:2){
# x[i, j] = ...
}
}
Please forgive me if I missed an answer to such a simple question.
I want to use cbind() to bind two columns. One of them is a single entry shorter in length.
Can I have R supply an NA for the missing value?
The documentation discusses a deparse.level argument but this doesn't seem to be my solution.
Further, if I may be so bold, would there also be a quick way to prepend the shorter column with NA's?
Try this:
x <- c(1:5)
y <- c(4:1)
length(y) = length(x)
cbind(x,y)
x y
[1,] 1 4
[2,] 2 3
[3,] 3 2
[4,] 4 1
[5,] 5 NA
or this:
x <- c(4:1)
y <- c(1:5)
length(x) = length(y)
cbind(x,y)
x y
[1,] 4 1
[2,] 3 2
[3,] 2 3
[4,] 1 4
[5,] NA 5
I think this will do something similar to what DWin suggested and work regardless of which vector is shorter:
x <- c(4:1)
y <- c(1:5)
lengths <- max(c(length(x), length(y)))
length(x) <- lengths
length(y) <- lengths
cbind(x,y)
The code above can also be condensed to:
x <- c(4:1)
y <- c(1:5)
length(x) <- length(y) <- max(c(length(x), length(y)))
cbind(x,y)
EDIT
Here is what I came up with to address the question:
"Further, if I may be so bold, would there also be a quick way to prepend the shorter column with NA's?"
inserted into the original post by Matt O'Brien.
x <- c(4:1)
y <- c(1:5)
first <- 1 # 1 means add NA to top of shorter vector
# 0 means add NA to bottom of shorter vector
if(length(x)<length(y)) {
if(first==1) x = c(rep(NA, length(y)-length(x)),x);y=y
if(first==0) x = c(x,rep(NA, length(y)-length(x)));y=y
}
if(length(y)<length(x)) {
if(first==1) y = c(rep(NA, length(x)-length(y)),y);x=x
if(first==0) y = c(y,rep(NA, length(x)-length(y)));x=x
}
cbind(x,y)
# x y
# [1,] NA 1
# [2,] 4 2
# [3,] 3 3
# [4,] 2 4
# [5,] 1 5
Here is a function:
x <- c(4:1)
y <- c(1:5)
first <- 1 # 1 means add NA to top of shorter vector
# 0 means add NA to bottom of shorter vector
my.cbind <- function(x,y,first) {
if(length(x)<length(y)) {
if(first==1) x = c(rep(NA, length(y)-length(x)),x);y=y
if(first==0) x = c(x,rep(NA, length(y)-length(x)));y=y
}
if(length(y)<length(x)) {
if(first==1) y = c(rep(NA, length(x)-length(y)),y);x=x
if(first==0) y = c(y,rep(NA, length(x)-length(y)));x=x
}
return(cbind(x,y))
}
my.cbind(x,y,first)
my.cbind(c(1:5),c(4:1),1)
my.cbind(c(1:5),c(4:1),0)
my.cbind(c(1:4),c(5:1),1)
my.cbind(c(1:4),c(5:1),0)
my.cbind(c(1:5),c(5:1),1)
my.cbind(c(1:5),c(5:1),0)
This version allows you to cbind two vectors of different mode:
x <- c(4:1)
y <- letters[1:5]
first <- 1 # 1 means add NA to top of shorter vector
# 0 means add NA to bottom of shorter vector
my.cbind <- function(x,y,first) {
if(length(x)<length(y)) {
if(first==1) x = c(rep(NA, length(y)-length(x)),x);y=y
if(first==0) x = c(x,rep(NA, length(y)-length(x)));y=y
}
if(length(y)<length(x)) {
if(first==1) y = c(rep(NA, length(x)-length(y)),y);x=x
if(first==0) y = c(y,rep(NA, length(x)-length(y)));x=x
}
x <- as.data.frame(x)
y <- as.data.frame(y)
return(data.frame(x,y))
}
my.cbind(x,y,first)
# x y
# 1 NA a
# 2 4 b
# 3 3 c
# 4 2 d
# 5 1 e
my.cbind(c(1:5),letters[1:4],1)
my.cbind(c(1:5),letters[1:4],0)
my.cbind(c(1:4),letters[1:5],1)
my.cbind(c(1:4),letters[1:5],0)
my.cbind(c(1:5),letters[1:5],1)
my.cbind(c(1:5),letters[1:5],0)
A while back I had put together a function called Cbind that was meant to do this sort of thing. In its current form, it should be able to handle vectors, data.frames, and matrices as the input.
For now, the function is here: https://gist.github.com/mrdwab/6789277
Here is how one would use the function:
x <- 1:5
y <- letters[1:4]
z <- matrix(1:4, ncol = 2, dimnames = list(NULL, c("a", "b")))
Cbind(x, y, z)
# x y z_a z_b
# 1 1 a 1 3
# 2 2 b 2 4
# 3 3 c NA NA
# 4 4 d NA NA
# 5 5 <NA> NA NA
Cbind(x, y, z, first = FALSE)
# x y z_a z_b
# 1 1 <NA> NA NA
# 2 2 a NA NA
# 3 3 b NA NA
# 4 4 c 1 3
# 5 5 d 2 4
The two three functions required are padNA, dotnames, and Cbind, which are defined as follows:
padNA <- function (mydata, rowsneeded, first = TRUE) {
## Pads vectors, data.frames, or matrices with NA
temp1 = colnames(mydata)
rowsneeded = rowsneeded - nrow(mydata)
temp2 = setNames(
data.frame(matrix(rep(NA, length(temp1) * rowsneeded),
ncol = length(temp1))), temp1)
if (isTRUE(first)) rbind(mydata, temp2)
else rbind(temp2, mydata)
}
dotnames <- function(...) {
## Gets the names of the objects passed through ...
vnames <- as.list(substitute(list(...)))[-1L]
vnames <- unlist(lapply(vnames,deparse), FALSE, FALSE)
vnames
}
Cbind <- function(..., first = TRUE) {
## cbinds vectors, data.frames, and matrices together
Names <- dotnames(...)
datalist <- setNames(list(...), Names)
nrows <- max(sapply(datalist, function(x)
ifelse(is.null(dim(x)), length(x), nrow(x))))
datalist <- lapply(seq_along(datalist), function(x) {
z <- datalist[[x]]
if (is.null(dim(z))) {
z <- setNames(data.frame(z), Names[x])
} else {
if (is.null(colnames(z))) {
colnames(z) <- paste(Names[x], sequence(ncol(z)), sep = "_")
} else {
colnames(z) <- paste(Names[x], colnames(z), sep = "_")
}
}
padNA(z, rowsneeded = nrows, first = first)
})
do.call(cbind, datalist)
}
Part of the reason I stopped working on the function was that the gdata package already has a function called cbindX that handles cbinding data.frames and matrices with different numbers of rows. It will not work directly on vectors, so you need to convert them to data.frames first.
library(gdata)
cbindX(data.frame(x), data.frame(y), z)
# x y a b
# 1 1 a 1 3
# 2 2 b 2 4
# 3 3 c NA NA
# 4 4 d NA NA
# 5 5 <NA> NA NA