What is the most efficient way to traspose
> dt <- data.table( x = c(1, 1, 3, 1, 3, 1, 1), y = c(1, 2, 1, 2, 2, 1, 1) )
> dt
x y
1: 1 1
2: 1 2
3: 3 1
4: 1 2
5: 3 2
6: 1 1
7: 1 1
into:
> output
cn v1 v2 v3 v4 v5 v6 v7
1: x 1 1 3 1 3 1 1
2: y 1 2 1 2 2 1 1
dcast.data.table is supposed to be efficient, but I can't figure out how exactly it has to be done
How about data.table::transpose:
data.table(cn = names(dt), transpose(dt))
# cn V1 V2 V3 V4 V5 V6 V7
#1: x 1 1 3 1 3 1 1
#2: y 1 2 1 2 2 1 1
If you are really concerned about efficiency, this may be better:
tdt <- transpose(dt)[, cn := names(dt)]
setcolorder(tdt, c(ncol(tdt), 1:(ncol(tdt) - 1)))
tdt
# cn V1 V2 V3 V4 V5 V6 V7
#1: x 1 1 3 1 3 1 1
#2: y 1 2 1 2 2 1 1
transpose seems to be a little faster than t (which calls do_transpose), but not by a large margin. I would guess that both of these implementations are roughly near the upper bound of efficiency for non in-place transposition algorithms.
Dt <- data.table(
x = rep(c(1, 1, 3, 1, 3, 1, 1), 10e2),
y = rep(c(1, 2, 1, 2, 2, 1, 1), 10e2))
all.equal(data.table(t(Dt)), data.table(transpose(Dt)))
#[1] TRUE
microbenchmark::microbenchmark(
"base::t" = data.table(t(Dt)),
"data.table::transpose" = data.table(transpose(Dt))
)
#Unit: milliseconds
# expr min lq mean median uq max neval
#base::t 415.4200 434.5308 481.4373 458.1619 507.9556 747.2022 100
#data.table::transpose 409.5624 425.8409 474.9709 444.5194 510.3750 685.0543 100
Code to identify fields for object [temp_table] and report this via the object [temp_table_schema]
temp_table
temp_table_data_types <- sapply (temp_table, class)
temp_table_schema <- NULL
for (x in 1:(length(temp_table_data_types))) {
temp_table_schema <- base::rbind(temp_table_schema, data.table(ROWID = (x)
, COLUMN_NAME = names(temp_table_data_types[x])
, DATA_TYPE = temp_table_data_types[[x]][[1]]
, DETAILS = if(length(as.list(temp_table_data_types[[x]]))> 1) {as.list(temp_table_data_types[[x]])[[2]]} else {""}
))
}
temp_table_schema
rm(list = c("temp_table_data_types"))
Related
My dataset has 575 rows and 368 columns and it looks like this:
NUTS3_2016 URAU_CODE FUA_CODE X2018.01.01.x X2018.01.02.x X2018.01.03.x ...
1 AT130 AT001C1 AT001L3 0.46369280 0.3582241 0.2777274 ...
2 AT211 AT006C1 AT006L2 -0.04453125 -0.3092773 -0.3284180 ...
3 AT312 AT003C1 AT003L3 1.02993164 0.9640137 0.6413086 ...
4 AT323 AT004C1 AT004L3 1.21105239 1.4335363 1.2400620 ...
... ... .... ... ... ... .... ...
I want to calculate the probability that x>2.5 for each row.
I also want to calculate for how many consecutive days x remains >2.5 for each row.
What are your suggestions?
Many thanks
Attempt:
A <- c("a", "b", "c", "d", "e")
B <- c(1:5)
C <- c(1:5)
x <- data.frame(A,B,C)
x$prob <- rowMeans(x[-(1)]>2)
x
# A B C prob
# 1 a 1 1 0
# 2 b 2 2 0
# 3 c 3 3 1
# 4 d 4 4 1
# 5 e 5 5 1
We can use rle for finding the length of the maximum streak.
## Some sample data:
set.seed(47)
data = matrix(rnorm(24, mean = 2.5), nrow = 3)
data = cbind(ID = c("A", "B", "C"), as.data.frame(data))
data
# ID V1 V2 V3 V4 V5 V6 V7 V8
# 1 A 4.494696 2.218235 1.514518 1.034250 2.9938202 3.170779 1.7966118 2.749148
# 2 B 3.211143 2.608776 2.515131 1.577544 0.6717708 2.418922 2.4594218 2.159584
# 3 C 2.685405 1.414263 2.247954 2.539602 2.5914729 3.764241 0.9338379 2.917191
data$max_streak = apply(data[-1], 1, function(x) with(rle(x > 2.5), max(lengths[values])))
# ID V1 V2 V3 V4 V5 V6 V7 V8 max_streak
# 1 A 4.494696 2.218235 1.514518 1.034250 2.9938202 3.170779 1.7966118 2.749148 2
# 2 B 3.211143 2.608776 2.515131 1.577544 0.6717708 2.418922 2.4594218 2.159584 3
# 3 C 2.685405 1.414263 2.247954 2.539602 2.5914729 3.764241 0.9338379 2.917191 3
I've got a data frame like that:
df:
A B C
1 1 2 3
2 2 2 4
3 2 2 3
I would like to subtract each column with the next smaler one (A-0, B-A, C-B). So my results should look like that:
df:
A B C
1 1 1 1
2 2 0 2
3 2 0 1
I tried the following loop, but it didn't work.
for (i in 1:3) {
j <- data[,i+1] - data[,i]
}
Try this
df - cbind(0, df[-ncol(df)])
# A B C
# 1 1 1 1
# 2 2 0 2
# 3 2 0 1
Data
df <- data.frame(A = c(1, 2, 2), B = c(2, 2, 2), C = c(3, 4, 3))
We can also remove the first and last column and do the subtraction
df[-1] <- df[-1]-df[-length(df)]
data
df <- data.frame(A = c(1, 2, 2), B = c(2, 2, 2), C = c(3, 4, 3))
So I have a data.table where I need to fill in values based on the index of the column and then also based on the placeholder character. Example:
V1 V2 V3 V4
Row1 1 1 a d
Row2 1 1 a d
Row3 1 1 a d
Row4 1 2 a h
Row5 1 2 a h
Row6 1 2 a h
Row7 2 1 b i
Row8 2 1 b i
Row9 2 1 b i
Row10 2 2 b t
Row11 2 2 b t
Row12 2 2 b t
....
Row350k ...
What I need to figure out is how to write a for loop with a assignment by reference statement that slides along column 1's index. Basically
For each column index, one at a time:
For each V1 = 1 and V2 = 1 replace character 'a' with one
iteration of 0.0055 + rnorm(1, 0.0055, 0.08).
For each V1 = 1 and
V2 = 2 replace character 'a' with one iteration of 0.0055 +
rnorm(1, 0.0055, 0.08). (same variation but with another iteration of
the rnorm)
For each V1 = 2 and V1 = 1, replace character 'b' with
one iteration of 0.0055 + rnorm(1, 0.001, 0.01)
For each V1 = 2 and
V1 = 1, replace character 'b' with one iteration of 0.0055 +
rnorm(1, 0.001, 0.01) (same variation but with another iteration of
the rnorm).
And so on for each incrementing values of Col1 and Col2. In actuality its 20+ rows instead of just 2 for the second index.
Desired output then is:
Col1 Col2 Col3 Col4
Row1 1 1 0.00551 d
Row2 1 1 0.00551 d
Row3 1 1 0.00551 d
Row4 1 2 0.00553 h
Row5 1 2 0.00553 h
Row6 1 2 0.00555 h
Row7 2 1 0.0011 i
Row8 2 1 0.0011 i
Row9 2 1 0.0011 i
Row10 2 2 0.0010 t
Row11 2 2 0.0010 t
Row12 2 2 0.0010 t
....
Row350k ...
Just not sure how to do this with a loop since the values in col1 are repeated a certain num of times. Column1 has 300k plus values so the sliding loop needs to dynamically scalable.
Here's what i have tried:
for (i in seq(1, 4000, 1))
{for (ii in seq(1, 2, 1)) {
data.table[V3 == "a" , V3 := 0.0055 + rnorm(1, 0.0055, 0.08)]
data.table[V3 == "b" , V3 := 0.0055 + rnorm(1, 0.001, 0.01)]
}}
Thanks!
If I understand your problem correctly this might be of help.
library(data.table)
dt <- data.table(V1 = c(rep(1, 6), rep(2, 6)),
V2 = rep(c(rep(1, 3), rep(2, 3)), 2),
V3 = c(rep("a", 6), rep("b", 6)),
V4 = c(rep("d", 3), rep("h", 3), rep("i", 3), rep("t", 3)))
# define a catalog to join on V3 which contains the parameters for the random number generation
catalog <- data.table(V3 = c("a", "b"),
const = 0.0055,
mean = c(0.0055, 0.001),
std = c(0.08, 0.01))
# for each value of V3 generate .N (number of observations of the current V3 value) random numbers with the specified parameters
dt[catalog, V5 := i.const + rnorm(.N, i.mean, i.std), on = "V3", by = .EACHI]
dt[, V3 := V5]
dt[, V5 := NULL]
Ok so I figured out that I wasn't incrementing my counters properly. For a matrix/data table with 4000 scenarios in the 1st column each with 11 repeats in the 2nd column I used the follwing:
Col1counter <- 1
Col2counter <- 1
for(Col1counter in 1:4000) {
for(col2counter in 1:11) {
test1[V1 == col1counter & V2 == col2counter & V3 == "a" , V55 := 0.00558 + rnorm(1, 0.00558, 2)]
col2counter+ 1
}
Col1counter+ 1}
Using both indices in the conditional statement ensures that it crawls accurately through the rows.
How to find the indices of the top k (say k=3) values for each column
> dt <- data.table( x = c(1, 1, 3, 1, 3, 1, 1), y = c(1, 2, 1, 2, 2, 1, 1) )
> dt
x y
1: 1 1
2: 1 2
3: 3 1
4: 1 2
5: 3 2
6: 1 1
7: 1 1
Required output:
> output.1
x y
1: 1 2
2: 3 4
3: 5 5
Or even better (notice the additional helpful descending sort in x):
> output.2
var top1 top2 top3
1: x 3 5 1
2: y 2 4 5
Having the output would be already a great help.
We can use sort (with index.return=TRUE) after looping over the columns of the dataset with lapply
dt[, lapply(.SD, function(x) sort(head(sort(x,
decreasing=TRUE, index.return=TRUE)$ix,3)))]
# x y
#1: 1 2
#2: 3 4
#3: 5 5
Or use order
dt[, lapply(.SD, function(x) sort(head(order(-x),3)))]
If the order of the elements having same rank doesn't matter then this answer would be also valid.
The order information can be extracted from data.table index.
library(data.table)
dt = data.table(x = c(1, 1, 3, 1, 3, 1, 1), y = c(1, 2, 1, 2, 2, 1, 1))
set2key(dt, x)
set2key(dt, y)
tail.index = function(dt, index, n){
idx = attr(attr(dt, "index"), index)
rev(tail(idx, n))
}
tail.index(dt, "__x", 3L)
#[1] 5 3 7
tail.index(dt, "__y", 3L)
#[1] 5 4 2
Here's a verbose solution which I'm sure undermines the slickness of the data.table package:
dt$idx <- seq.int(1:nrow(dt))
k <- 3
top_x <- dt[order(-x), idx[1:k]]
top_y <- dt[order(-y), idx[1:k]]
dt_top <- data.table(top_x, top_y)
dt_top
# top_x top_y
# 1: 3 2
# 2: 5 4
# 3: 1 5
I have a data frame containing a list vector with jagged entries:
df = data.frame(x = rep(c(1,2), 2), y = rep(c("a", "b"), each = 2))
L = list()
for (each in round(runif(4, 1,5))) L = c(L, list(1:each))
df$L = L
For example,
x y L
1 a 1
2 a 1, 2, 3, 4
1 b 1, 2, 3
2 b 1, 2, 3
How could I create a table which counts the values of L for each x, across the values of y? So, in this example it would output something like,
1 2 3 4
X
1 2 1 1 0
2 2 2 2 1
I had some luck using
tablist = function(L) table(unlist(L))
tapply(df$L, df$x, tablist)
which produces,
$`1`
1 2 3
2 1 1
$`2`
1 2 3 4
2 2 2 1
However, I'm not sure how to go from here to a single table. Also, I'm beggining to suspect that this approach might start taking an unruly amount of time for large data frames. Any thoughts / suggestions would be greatly appreciated!
Using pylr
library(plyr)
df = data.frame(x = rep(c(1,2), 2), y = rep(c("a", "b"), each = 2))
L = list()
set.seed(2)
for (each in round(runif(4, 1,5))) L = c(L, list(1:each))
df$L = L
> df
x y L
1 1 a 1, 2
2 2 a 1, 2, 3, 4
3 1 b 1, 2, 3
4 2 b 1, 2
table(ddply(df,.(x),summarize,unlist(L)))
> table(ddply(df,.(x),summarize,unlist(L)))
..1
x 1 2 3 4
1 2 2 1 0
2 2 2 1 1
If you're not into plyr...
vals <- unique(unlist(df$L))
names(vals) <- vals
do.call("rbind",
lapply(split(df,df$x),function(byx){
sapply(vals, function(i){
sum(unlist(sapply(byx$L,"==",i)))
})
})
)