Some example data:
library(data.table)
mydat <- data.table(id1=rep(c("A","B","C"),each=3),
id2=c("D","E","G", "D","E","F","G","E","D"),
val=c(1,2,4,1,2,3, 4,2,1))
Which gives
id1 id2 val
1: A D 1
2: A E 2
3: A G 4
4: B D 1
5: B E 2
6: B F 3
7: C G 4
8: C E 2
9: C D 1
My goal is to get unique values of id2,val and then generate a variable that depends upon the unique values (e.g. the sum across unique observations as below). This variable should then be put into a column in the original data.table. I often find myself writing code like the following:
## This is the most obvious way
tmp <- unique(mydat[,.(id2,val)])
tmp[,weight:=val/sum(val)]
tmp[,val:=NULL]
mydat <- merge(mydat,tmp,by="id2",all.x=TRUE)
## A second option which doesn't require merging
mydat[,first:=FALSE]
mydat[mydat[,.I[1],by=.(id2)]$V1,first:=TRUE]
mydat[first==TRUE,weight2:=val/sum(val)]
mydat[,weight2:=max(weight,na.rm = TRUE),by=.(id2)]
mydat[,first:=NULL]
This gives
id2 id1 val weight weight2
1: D A 1 0.1 0.1
2: D B 1 0.1 0.1
3: D C 1 0.1 0.1
4: E A 2 0.2 0.2
5: E B 2 0.2 0.2
6: E C 2 0.2 0.2
7: F B 3 0.3 0.3
8: G A 4 0.4 0.4
9: G C 4 0.4 0.4
Entirely out of curiosity, is there a cleaner (more data.table) way to do this? Perhaps with self joins? Performance is important because the actual data I'm working with tends to be quite large.
I agree with #thelatemail that the approaches in the OP are already pretty clean.
Performance is important because the actual data I'm working with tends to be quite large.
If you must use this structure, there's:
setorder(mydat, id2)
mydat[unique(id2), on=.(id2), mult="first", w2 := val/sum(val)]
mydat[, w2 := nafill(w2, type="locf")]
I'm just sorting because that's shown in the desired output. To keep the original sorting, drop setorder and change the last line to mydat[order(id2), w2 := nafill(w2, type="locf")].
The nafill function is available in 1.12.3+ (so not yet on CRAN).
I would suggest instead using a set of normalized/"tidy" tables: val is an attribute of id2, so you could have an id2 table containing such things.
# same as OP's tmp
id2DT = unique(mydat[, .(id2, val)])
setkey(id2DT, id2)
id2DT[, w := val/sum(val)]
# drop redundant repeating val unless you really need it there
# to save on space and improve readability
mydat[, val := NULL]
# merge w in if/when needed
mydat[, w := id2DT[.SD, on=.(id2), x.w]]
Here is a merge-free option:
total_val <- mydat[!duplicated(id2, val), sum(val)] # Just the scalar we are after
mydat[, `:=`(val = val[1], weight = val[1] / total_val), by = id2]
# id1 id2 val weight
# 1: A D 1 0.1
# 2: B D 1 0.1
# 3: C D 1 0.1
# 4: A E 2 0.2
# 5: B E 2 0.2
# 6: C E 2 0.2
# 7: B F 3 0.3
# 8: A G 4 0.4
# 9: C G 4 0.4
Related
Let me exemplify:
library(data.table)
A <- data.table( value = c(1.3,2.1,2.7), '1' = c(0.4,0.3,0.5), '2' = c(1.1,1.3,1.7) , '3' = c(2.1,2.4,2.6) )
> A
value 1 2 3
1: 1.3 0.4 1.1 2.1
2: 2.1 0.3 1.3 2.4
3: 2.7 0.5 1.7 2.6
I would like to use x = 1,2,3 and y being each row to interpolate the number in the columns "value".
So, for the first row, x = 1,2,3 and y = 0.4, 1.1, 2.1. The value that should be interpolated is x0 = 1.3. And so on for the next rows. I came up with the following function to apply it using a data.table by rows:
## Function to Interpolate
interpol <- function(dt){
# Define X
x <- c(1,2,3)
# Grab each row of the data.table
y <- as.numeric(dt[,c('1','2','3'), with = F])
# Interpolate and get the value of Y
approx(x,y,dt[,'value', with = F])$y
}
# Apply by row
A[, interpol(.SD), by = 1:nrow(A)]
The problem is that this seems to be extremely slow for a data.table of a few million rows. What would be the best way to optimize this?
Side Note
Originally, my problem was as follows:
I had to interpolate the same A using a different table B:
A2 <- data.table(name = LETTERS[1:3], value = c(1.3,2.1,2.7))
B2 <- data.table(name = LETTERS[1:3], '1' = c(0.4,0.3,0.5), '2' = c(1.1,1.3,1.7) , '3' = c(2.1,2.4,2.6) )
> A2
name value
1: A 1.3
2: B 2.1
3: C 2.7
> B2
name 1 2 3
1: A 0.4 1.1 2.1
2: B 0.3 1.3 2.4
3: C 0.5 1.7 2.6
I decided to merge these two data.tables to obtain the one above because i believed it would be easier. But maybe for it to run faster it might be better to have them as separate data.tables?
First, I'd suggest storing the B2 data in long format:
dat = melt(B2, id="name", value.name = "v")
dat[, variable := as.numeric(as.character(variable))]
name variable v
1: A 1 0.4
2: B 1 0.3
3: C 1 0.5
4: A 2 1.1
5: B 2 1.3
6: C 2 1.7
7: A 3 2.1
8: B 3 2.4
9: C 3 2.6
From here, you can join on A2 to interpolate by group:
dat[A2, on=.(name), approx(variable, v, value)$y, by=.EACHI]
name V1
1: A 0.61
2: B 1.41
3: C 2.33
library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
This is the full data frame, cor=NA represents a correlation smaller than 0.8
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
In the above df, F is not in V1, meaning that F is not of interest
so here I remove rows where V2=F (more generally, V2 equals to value that is not in V1)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
So now, df.gp is the dataset I need to work on
I drop the unused level in V2 (which is F in the example)
df.gp$V2 <- droplevels(df.gp$V2)
I do not want to exclude the autocorrelated variables, in case some of the V1 are not correlated with others, and I would like to put each of them in a separated group
By looking at the cor, A and B are correlated, C and D are correalted, and E belongs to a group by itself.
Therefore, the example here should have three groups.
The way I see this, you may have complicated things by working your data straight into a data.frame. I took the liberty of transforming it back to a matrix.
library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat
After I had your correlation matrix, it is easy to see which indices or non NA values are shared with other variables.
a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a
X1 X2 var
1 1 2 A
2 1 2 B
3 3 4 C
4 3 4 D
5 5 6 E
Now either X1 or X2 determines your unique groupings.
Edited by cyrusjan:
The above script is a possible solution when assuming we already select the rows in with cor >= a, where a is a threshold taken as 0.8 in the above question.
Contributed by alexis_laz:
By using cutree and hclust, we can set the threshold in the script (i.e. h=0.8) as blow.
cor.gp <- data.frame(cor.gp =
cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))
I have a data.frame of this form:
sequence support
1 a-b 0.6
2 b-c 0.6
3 a-c 0.6
4 a-b-c 1.0
5 a-d 0.6
and I can transform this to the following:
1 2 3 support
1 a b <NA> 0.6
2 b c <NA> 0.6
3 a c <NA> 1.0
4 a b c 0.6
5 a d <NA> 1.0
I need to transform above table to like this:
1 2 support
1 a b 0.6
2 b c 0.6
3 a d 1.0
More specifically, I want to draw a Sankey Diagram.
So I have to transform the first data.table to the form of 'start node' and 'end node'.
For example, to draw the sequences 'a-b-c' and 'a-d', I need following data.frame:
start end
a b
b c
a d
How can I do this?
Using strsplit and apply:
# data
df1 <- read.table(text = "sequence support
1 a-b 0.6
2 b-c 0.6
3 a-c 0.6
4 a-b-c 1.0
5 a-d 0.6", header = TRUE, as.is = TRUE)
# result - input for sankey
datSankey <-
do.call(rbind,
apply(df1, 1, function(i){
x <- unlist(strsplit(i[1], "-"))
cbind.data.frame(
From = x[1:length(x) - 1],
To = x[2:(length(x))],
Weight = as.numeric(i[2]),
deparse.level = 0)
})
)
# From To Weight
# 1 a b 0.6
# 2 b c 0.6
# 3 a c 0.6
# 4.sequence1 a b 1.0
# 4.sequence2 b c 1.0
# 5 a d 0.6
# plot
library(googleVis)
plot(gvisSankey(datSankey,
from = "From", to = "To", weight = "Weight"))
We can try
library(splitstackshape)
i1 <- grepl("-[^-]+-", df$sequence)
df$sequence[i1] <- sub("-[^-]+", "", df$sequence[i1])
res <- cSplit(df[!(duplicated(df$sequence)|duplicated(df$sequence,
fromLast=TRUE)),], "sequence", "-")
res[, 2:3, with = FALSE]
# sequence_1 sequence_2
#1: a b
#2: b c
#3: a d
I would like to remove data points above 97.5% and below 2.5%. I created the following parsimonious data set to explain the issue:
y <- data.table(a = rep(c("b","d"), each = 2, times = 3), c=rep(c("e","f"),
each = 3, times = 2), seq(1,6))
I created the following script to accomplish the task:
require(data.table)
y[, trimErr := ifelse(y$V3 < quantile(y$V3, 0.95) & y$V3 > quantile(y$V3, 0.05),y$V3, NA),
by = list(a,c)]
I then got 4 warning messages, I will only provide the first warning:
Warning messages:
1: In `[.data.table`(y, , `:=`(trimErr, ifelse(y$V3 < quantile(y$V3, :
RHS 1 is length 12 (greater than the size (3) of group 1). The last 9 element(s) will be discarded.
can you please explain to me what the warning means and how can i modify my code.
Would you suggest a better code to remove the top and bottom 2.5% of the data. Thanks in advance.
You're grouping by a and c, but passing in a vector that is the length of the entire data.table, instead of just the data for each group.
You don't need the y$ inside the [.data.table call
y[, trimErr:=ifelse(V3 < quantile(V3, 0.95) & V3 > quantile(V3, 0.05),V3, NA),
by=list(a,c)]
y
# a c V3 trimErr
# 1: b e 1 NA
# 2: b e 2 2
# 3: d e 3 NA
# 4: d f 4 NA
# 5: b f 5 5
# 6: b f 6 NA
# 7: d e 1 NA
# 8: d e 2 2
# 9: b e 3 NA
#10: b f 4 NA
#11: d f 5 5
#12: d f 6 NA
I have the below diagram
These red arrows represent weighting factors for each node in the direction that they are pointing to. The input file is the value of the factor and the direction.
Is this factors diagram plottable with R?
First some dummy data which (I hope) emulate yours (which is hard to say considering how little information you gave):
ow <- expand.grid(c(1.5,2.5),c(1.5,2.5))
row.names(ow)<-letters[1:4]
pw <- expand.grid(1:3,1:3)
row.names(pw)<-LETTERS[1:9]
B <- rbind(expand.grid("a",row.names(pw)[c(1,2,4,5)]),
expand.grid("b",row.names(pw)[c(2,3,5,6)]),
expand.grid("c",row.names(pw)[c(4,5,7,8)]),
expand.grid("d",row.names(pw)[c(5,6,8,9)]))
B <- cbind(B,abs(rnorm(16)))
So we have:
# The location of your oil wells:
ow
Var1 Var2
a 1.5 1.5
b 2.5 1.5
c 1.5 2.5
d 2.5 2.5
# Of your production wells:
pw
Var1 Var2
A 1 1
B 2 1
C 3 1
D 1 2
E 2 2
F 3 2
G 1 3
H 2 3
I 3 3
#And a b value for each pairs of neighbouring oil/production wells:
Var1 Var2 abs(rnorm(16))
1 a A 1.78527757
2 a B 1.61794028
3 a D 1.80234599
4 a E 0.04202002
5 b B 0.90265280
6 b C 1.05214769
7 b E 0.67932237
8 b F 0.11497430
9 c D 0.26288589
10 c E 0.50745137
11 c G 0.74102529
12 c H 1.43919338
13 d E 1.04111278
14 d F 0.49372216
15 d H 0.21500663
16 d I 0.20156929
And here is a simple function that plot more or less the kind of graph you showed:
weirdplot <- function(ow_loc, pw_loc, B,
pch_ow=19, pch_pw=17,
col_ow="green", col_pw="blue", col_b="red", breaks){
# with ow_loc and pw_loc the locations of your wells
# B the correspondance table
# pch_ow and pch_pw the point type for the wells
# col_b, col_ow and col_pw the colors for the arrows and the wells
# and breaks a vector of size categories for b values
plot(pw_loc,type="n")
b<-cut(B[,3], breaks=breaks)
for(i in 1:nrow(B)){
start=ow_loc[row.names(ow)==B[i,1],]
end=pw_loc[row.names(pw)==B[i,2],]
arrows(x0=start[,1],y0=start[,2],
x1=end[,1], y1=end[,2], lwd=b[i], col=col_b)
}
points(pw_loc, pch=pch_pw, col=col_pw)
points(ow_loc, pch=pch_ow, col=col_ow)
}
So with the values we created earlier:
weirdplot(ow, pw, B, breaks=c(0,0.5,1,1.5,2))
It's not particularly pretty but it should get you started.