Weighted row average in time series join - r

Hello I'm looking for the cleanest/fastest way to solve the following problem:
My setup looks like this
library(data.table)
set.seed(1234)
DT1 <- data.table(replicate(12,runif(5)))
setnames(DT1,LETTERS[1:12])
DT1[,time:=100]
DT2 <- data.table(time=rep(100,12), grp=rep(c("X","Y","Z"),each=4),
sub=LETTERS[1:12], weight=sample(1:100,12))
options(digits=2)
DT1
A B C D E F G H I J K L time
1: 0.11 0.6403 0.69 0.84 0.32 0.811 0.46 0.76 0.55 0.50 0.074 0.50 100
2: 0.62 0.0095 0.54 0.29 0.30 0.526 0.27 0.20 0.65 0.68 0.310 0.49 100
3: 0.61 0.2326 0.28 0.27 0.16 0.915 0.30 0.26 0.31 0.48 0.717 0.75 100
4: 0.62 0.6661 0.92 0.19 0.04 0.831 0.51 0.99 0.62 0.24 0.505 0.17 100
5: 0.86 0.5143 0.29 0.23 0.22 0.046 0.18 0.81 0.33 0.77 0.153 0.85 100
> DT2
time grp sub weight
1: 100 X A 87
2: 100 X B 5
3: 100 X C 32
4: 100 X D 2
5: 100 Y E 23
6: 100 Y F 68
7: 100 Y G 29
8: 100 Y H 48
9: 100 Z I 99
10: 100 Z J 52
11: 100 Z K 11
12: 100 Z L 80
I want to compute a weighted average (per row) of the columns of DT1 by referencing the groups, subclasses & weights from DT2, while joining per time point.
E.g. so DT1 then gets columns X,Y & Z bound to it, so in this case the column X of the first row is 87*0.11 + 5*0.64 + 32*0.69 + 2*0.84 / (87 + 5 + 32 + 2)
There are millions of rows in DT1 with different time points, so memory might be a limiting factor though
Any advice would be much appreciated!

Something like this perhaps:
library(reshape2)
setkey(DT2, time, sub)
DT2[melt(DT1, id.var = 'time')[, row := 1:.N, by = list(time, variable)]][,
sum(weight * value) / sum(weight), by = list(time, grp, row)]
# time grp row V1
# 1: 100 X 1 0.29
# 2: 100 X 2 0.57
# 3: 100 X 3 0.51
# 4: 100 X 4 0.69
# 5: 100 X 5 0.69
# 6: 100 Y 1 0.67
# 7: 100 Y 2 0.36
# 8: 100 Y 3 0.52
# 9: 100 Y 4 0.71
#10: 100 Y 5 0.31
#11: 100 Z 1 0.50
#12: 100 Z 2 0.59
#13: 100 Z 3 0.51
#14: 100 Z 4 0.39
#15: 100 Z 5 0.59
You can also reshape the above result if you like:
# assuming you called the above table "res"
dcast.data.table(res, row + time ~ grp)
#Using 'V1' as value column. Use 'value.var' to override
# row time X Y Z
#1: 1 100 0.29 0.67 0.50
#2: 2 100 0.57 0.36 0.59
#3: 3 100 0.51 0.52 0.51
#4: 4 100 0.69 0.71 0.39
#5: 5 100 0.69 0.31 0.59

Related

R: need help matching up table rows and getting differences

I have chromatographic data in a table organized by peak position and integration value of various samples. All samples in the table have a repeated measurement as well with a different sample log number.
What I'm interested in, is the repeatability of the measurements of the various peaks. The measure for that would be the difference in peak integration = 0 for each sample.
The data
Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03
with Log1 is the original sample log number, and Log2 is the repeat log number.
How can I construct a new variable for every peak (being the difference PeakX_Log1 - PeakX_Log2)?
Mind that in my example I only have 5 peaks. The real-life situation is a complex mixture involving >20 peaks, so very hard to do it by hand.
If you will only have two values for each sample, something like this could work:
df <- data.table::fread(
"Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03"
)
library(tidyverse)
new_df <- df %>%
mutate(Log = ifelse(Log1 < Log2,"Log1","Log2")) %>%
select(-Log1,-Log2) %>%
pivot_longer(cols = starts_with("Peak"),names_to = "Peak") %>%
pivot_wider(values_from = value, names_from = Log) %>%
mutate(Variation = Log1 - Log2)
new_df
# A tibble: 15 × 5
Sample Peak Log1 Log2 Variation
<chr> <chr> <dbl> <dbl> <dbl>
1 A Peak1 0.2 0.21 -0.0100
2 A Peak2 0.8 0.7 0.100
3 A Peak3 0.3 0.33 -0.0300
4 A Peak4 0 0 0
5 A Peak5 0 0.1 -0.1
6 B Peak1 0.25 0.2 0.05
7 B Peak2 0.73 0.73 0
8 B Peak3 0.29 0.37 -0.08
9 B Peak4 0.01 0 0.01
10 B Peak5 0.04 0.03 0.01
11 C Peak1 0.2 0.22 -0.0200
12 C Peak2 0.8 0.81 -0.0100
13 C Peak3 0.3 0.31 -0.0100
14 C Peak4 0 0.04 -0.04
15 C Peak5 0.07 0 0.07

R read.table fill empty data with value above

I have a text file with horrible formatting that I need to read into R. I am reading a bunch of other files that don't have horrible formatting with read.table, so I would like to continue to use this function, if possible.
The file looks like this:
M D YY CONC
7 1 78 15
0.00
0.15
1.06
1.21
10.91
34.55
69.09
87.27
73.67
38.65
12.27
2.27
6.52
0.45
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.19
0.96
4.59
4.55
4.59
7.25
7.13
11.60
1.06
0.15
1.50
1.16
0.00
0.00
0.00
0.00
0.00
7 1 78 16
0.00
0.00
0.00
0.00
7.25
1.50
9.00
20.25
51.25
55.00
53.75
3.13
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.80
0.98
4.00
2.47
5.63
3.50
7.88
0.43
2.30
0.00
0.00
0.00
0.00
0.00
0.00
0.00
7 1 78 17
4.15
0.00
0.00
0.15
2.27
16.36
54.37
67.96
58.07
3.58
0.89
0.20
0.52
0.59
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
5.44
0.00
3.09
3.26
7.17
9.39
8.65
3.09
0.45
7.41
3.18
0.00
2.05
0.00
There is one CONC per hour on the date provided in the first row. My ultimate goal will be to have the date repeat and add a column for hour. So the first bit should look like:
M D YY H CONC
7 1 78 1 15
7 1 78 2 0.00
7 1 78 3 0.15
7 1 78 4 1.06
7 1 78 5 1.21
7 1 78 6 10.91
7 1 78 7 34.55
7 1 78 8 69.09
I can read in the file using this:
monitor_datai <- read.table(file =file,header = TRUE, stringsAsFactors = FALSE, skip = 0, sep = "", fill = TRUE)
BUT the issue with that approach is that the data reads in filling the first column with the month (if provided on that line) or concentration (if no month was provided for that line). Looking something like this:
head(monitor_datai)
V1 V2 V3 V4
1 7.00 1 78 15
2 0.00 NA NA NA
3 0.15 NA NA NA
4 1.06 NA NA NA
5 1.21 NA NA NA
6 10.91 NA NA NA
So, I need help reading in the file and fixing the formatting.
Thanks!
Here is my approach, using the weapons of the data.table-package
I was not sure what the values of H shuld become... just 1:128, of a sequency by group, or ...? Please specify and I'll add it into the answer..
I included comments and in between results in the code below, so you (hopefully) can follow the steps and adjust if/where needed
library( data.table )
#read the file as-is, complete lines, no separator
DT <- fread( "./temp/testfile.txt", sep = "", skip = 1, header = FALSE )
# head(DT)
# V1
# 1: 7 1 78 15
# 2: 0.00
# 3: 0.15
# 4: 1.06
# 5: 1.21
# 6: 10.91
#get column names from the file, store in a vector
colnames = names( fread( "./temp/testfile.txt", sep = " ", nrows = 1, header = TRUE ) )
#split the rows with a space in them to the for desired columns,
# use a space (or multiple in a row) as separator
DT[ grepl(" ", V1), (colnames) := tstrsplit( V1, "[ ]+", perl = TRUE ) ]
# V1 M D YY CONC
# 1: 7 1 78 15 7 1 78 15
# 2: 0.00 <NA> <NA> <NA> <NA>
# 3: 0.15 <NA> <NA> <NA> <NA>
# 4: 1.06 <NA> <NA> <NA> <NA>
# 5: 1.21 <NA> <NA> <NA> <NA>
# ---
# 124: 7.41 <NA> <NA> <NA> <NA>
# 125: 3.18 <NA> <NA> <NA> <NA>
# 126: 0.00 <NA> <NA> <NA> <NA>
# 127: 2.05 <NA> <NA> <NA> <NA>
# 128: 0.00 <NA> <NA> <NA> <NA>
#where CONC is.na, copy the value of V1
DT[ is.na( CONC ), CONC := V1 ]
# V1 M D YY CONC
# 1: 7 1 78 15 7 1 78 15
# 2: 0.00 <NA> <NA> <NA> 0.00
# 3: 0.15 <NA> <NA> <NA> 0.15
# 4: 1.06 <NA> <NA> <NA> 1.06
# 5: 1.21 <NA> <NA> <NA> 1.21
# ---
# 124: 7.41 <NA> <NA> <NA> 7.41
# 125: 3.18 <NA> <NA> <NA> 3.18
# 126: 0.00 <NA> <NA> <NA> 0.00
# 127: 2.05 <NA> <NA> <NA> 2.05
# 128: 0.00 <NA> <NA> <NA> 0.00
#now we can drop the V1-column
DT[, V1 := NULL]
#set all columns to the right (numeric) type
DT[, (names(DT)) := lapply( .SD, as.numeric ) ]
#and fill down the missing values of M, D and YY
setnafill( DT, type = "locf", cols = c("M", "D", "YY") )
# M D YY CONC
# 1: 7 1 78 15.00
# 2: 7 1 78 0.00
# 3: 7 1 78 0.15
# 4: 7 1 78 1.06
# 5: 7 1 78 1.21
# ---
# 124: 7 1 78 7.41
# 125: 7 1 78 3.18
# 126: 7 1 78 0.00
# 127: 7 1 78 2.05
# 128: 7 1 78 0.00

Create a single heatmap based on two symmetric matrices in R

Suppose I have two symmetric matrices:
library(Matrix)
set.seed(123)
s1<-forceSymmetric(matrix(round(rnorm(25),2),5))
colnames(s1)<-LETTERS[1:5]
rownames(s1)<-LETTERS[6:10]
diag(s1)<-1
s2<-forceSymmetric(matrix(round(rbinom(25,25,0.3),2),5))
colnames(s2)<-LETTERS[1:5]
rownames(s2)<-LETTERS[6:10]
diag(s2)<-1
s1
# 5 x 5 Matrix of class "dsyMatrix"
# A B C D E
# F 1.00 1.72 1.22 1.79 -1.07
# G 1.72 1.00 0.36 0.50 -0.22
# H 1.22 0.36 1.00 -1.97 -1.03
# I 1.79 0.50 -1.97 1.00 -0.73
# J -1.07 -0.22 -1.03 -0.73 1.00
s2
# 5 x 5 Matrix of class "dsyMatrix"
# A B C D E
# F 1 6 8 7 9
# G 6 1 5 9 8
# H 8 5 1 10 9
# I 7 9 10 1 1
# J 9 8 9 1 1
What I wanted is to generate a single heatmap where the upper diagonal and its legend is based on matrix s1 while the lower diagonal and its legend is based on matrix s2. Here is a similar one I can found:
Get upper diagonal
reverse = s1[,ncol(s1):1]
diag(reverse) = 0
reverse[lower.tri(reverse, diag = FALSE)] <- 0
upper = reverse[,ncol(reverse):1]
Get lower diagonal
reverse1 = s2[,ncol(s2):1]
diag(reverse1) = 0
reverse1[upper.tri(reverse1, diag = FALSE)] <- 0
upper1 = reverse1[,ncol(reverse1):1]
Time to add them up.
merged = as.matrix(upper+upper1)
merged
A B C D E
F 1.00 1.72 1.22 1.79 0
G 1.72 1.00 0.36 0.00 8
H 1.22 0.36 0.00 10.00 9
I 1.79 0.00 10.00 1.00 1
J 0.00 8.00 9.00 1.00 1
Plot it.
heatmap(merged)
You can find nicer heatmap plots elsewhere

Find position of elements of a dataframe inside other dataframe with R

I have the following dataframe (DF_A):
PARTY_ID PROBS_3001 PROBS_3002 PROBS_3003 PROBS_3004 PROBS_3005 PROBS_3006 PROBS_3007 PROBS_3008
1: 1000000 0.03 0.58 0.01 0.42 0.69 0.98 0.55 0.96
2: 1000001 0.80 0.37 0.10 0.95 0.77 0.69 0.23 0.07
3: 1000002 0.25 0.73 0.79 0.83 0.24 0.82 0.81 0.01
4: 1000003 0.10 0.96 0.53 0.59 0.96 0.10 0.98 0.76
5: 1000004 0.36 0.87 0.76 0.03 0.95 0.40 0.53 0.89
6: 1000005 0.15 0.78 0.24 0.21 0.03 0.87 0.67 0.64
And I have this other dataframe (DF_B):
V1 V2 V3 V4 PARTY_ID
1 0.58 0.69 0.96 0.98 1000000
2 0.69 0.77 0.80 0.95 1000001
3 0.79 0.81 0.82 0.83 1000002
4 0.76 0.96 0.96 0.98 1000003
5 0.76 0.87 0.89 0.95 1000004
6 0.64 0.67 0.78 0.87 1000005
I need to find the position of the elements of the DF_A in the DF_B to have something like this:
PARTY_ID P1 P2 P3 P4
1 1000000 3 6 9 7
...
Currently I'm working with match function but it takes a lot of time (I have 400K rows). I'm doing this:
i <- 1
while(i < nrow(DF_A)){
position <- match(DF_B[i,],DF_A[i,])
i <- i + 1
}
Although it works, it's very slow and I know that it's not the best answer to my problem. Can anyone help me please??
You can merge and then Map with a by group operation:
df_a2 <- df_a[setDT(df_b), on = "PARTY_ID"]
df_a3 <- df_a2[, c(PARTY_ID,
Map(f = function(x,y) which(x==y),
x = list(.SD[,names(df_a), with = FALSE]),
y = .SD[, paste0("V",1:4), with = FALSE])), by = 1:nrow(df_a2)]
setnames(df_a3, paste0("V",1:5), c("PARTY_ID", paste0("P", 1:4)))[,nrow:=NULL]
df_a3
# PARTY_ID P1 P2 P3 P4
#1: 1000000 3 6 9 7
#2: 1000001 7 6 2 5
#3: 1000002 4 8 7 5
#4: 1000003 9 3 3 8
#5: 1000003 9 6 6 8
#6: 1000004 4 3 9 6
#7: 1000005 9 8 3 7
Here is an example on 1 milion rows with two columns. It takes 14 ms on my computer.
# create data tables with matching ids but on different positions
x <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), y=sample(LETTERS, 1000000, replace=TRUE)))
y <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), z=sample(LETTERS, 1000000, replace=TRUE)))
# add column to both data tables which will store the position in x and y
x$x_row_nr <- 1:nrow(x)
y$y_row_nr <- 1:nrow(y)
# set key in both data frames using matching columns name
setkey(x, "id")
setkey(y, "id")
# merge data tables into one
z <- merge(x,y)
# now you just use this to extract what is the position
# of 100 hundreth record in x data table in y data table
z[x_row_nr==100, y_row_nr]
z will contain matching row records from both datasets with there columns attached.

Sorting values in dataframe by order of values in another dataframe R

I would like to sort values in columns of the xy1 dataframe, based on the increasing order of values in columns of the xy dataframe.
x <- c(3,1,7,45,22,2)
y <- c(23,65,1,23,2,11)
xy <- data.frame(x,y)
x1 <- c(0.34,0.3,0.7,0.22,0.67,0.87)
y1 <- c(0.4,0.13,0.17,0.72,0.61,0.7)
xy1 <- data.frame(x1,y1)
> xy
x y
1 3 23
2 1 65
3 7 1
4 45 23
5 22 2
6 2 11
> xy1
x1 y1
1 0.34 0.40
2 0.30 0.13
3 0.70 0.17
4 0.22 0.72
5 0.67 0.61
6 0.87 0.70
The following is a new data.frame result that I desire - note it deals with repeated observations (two the same values in y). x1 and y1 are now sorted according to the order of values in each column of xy dataframe.
x1 y1
1 0.30 0.17
2 0.87 0.61
3 0.34 0.70
4 0.70 0.40
5 0.67 0.72
6 0.22 0.13
You can use the order function to get the sorting order of a vector.
x <- c(3,1,7,45,22,2)
y <- c(23,65,1,23,2,11)
xy <- data.frame(x,y)
x1 <- c(0.34,0.3,0.7,0.22,0.67,0.87)
y1 <- c(0.4,0.13,0.17,0.72,0.61,0.7)
xy1 <- data.frame(x1,y1)
result <- data.frame(x1[order(x)], y1[order(y)])
result
This produces
x1.order.x.. y1.order.y..
1 0.30 0.17
2 0.87 0.61
3 0.34 0.70
4 0.70 0.40
5 0.67 0.72
6 0.22 0.13
You can beautify the output by setting the column names in the result:
data.frame(x1=x1[order(x)], y1=y1[order(y)])
Now if you don't want to manually type in everything but have two data frames with the same dimensions that you can use this one-liner
sapply(1:ncol(xy1), function(i) {xy1[order(xy[,i]), i]})
which produces
[,1] [,2]
[1,] 0.30 0.17
[2,] 0.87 0.61
[3,] 0.34 0.70
[4,] 0.70 0.40
[5,] 0.67 0.72
[6,] 0.22 0.13
As this is based on ordering corresponding columns on both datasets, Map can be used
xy1[] <- Map(function(x,y) x[order(y)], xy1, xy)
xy1
# x1 y1
#1 0.30 0.17
#2 0.87 0.61
#3 0.34 0.70
#4 0.70 0.40
#5 0.67 0.72
#6 0.22 0.13
Or another option is to order based on the col of 'xy', 'xy'
xy1[] <- as.matrix(xy1)[order(col(xy), xy)]
xy1
# x1 y1
#1 0.30 0.17
#2 0.87 0.61
#3 0.34 0.70
#4 0.70 0.40
#5 0.67 0.72
#6 0.22 0.13
You could try this:
library(tidyverse)
df_1 <- xy %>%
bind_cols(xy1) %>%
arrange(x) %>%
select(x1)
df_2 <- xy %>%
bind_cols(xy1) %>%
arrange(y) %>%
select(y1)
df <- bind_cols(df_1, df_2)
Which returns:
# A tibble: 6 x 2
x1 y1
<dbl> <dbl>
1 0.30 0.17
2 0.87 0.61
3 0.34 0.70
4 0.70 0.40
5 0.67 0.72
6 0.22 0.13
Basically just arrange x1 and y1 by x and y separately, then combine x1 and y1.

Resources