inner join with multiple conditions r data table - r

I am trying to do an inner join using data table that has multiple, fairly dynamic conditions. I am getting tripped up on the syntax. First, I create two objects, x and x2 that I want to do an inner join with.
set.seed(1)
#generate data
x = data.table(CJ(t=1:10, d=1:3,p1s=seq(1,3,by=0.1),p1sLAST=seq(1,3,by=0.1)))
x[d==1,p1sLAST:=3]
x=x[p1s<=p1sLAST]
x2 = data.table(CJ(tprime=1:10, p1sLASTprm=seq(1,3,by=0.1)))
With the objects:
> x
t d p1s p1sLAST
1: 1 1 1.0 3.0
2: 1 1 1.0 3.0
3: 1 1 1.0 3.0
4: 1 1 1.0 3.0
5: 1 1 1.0 3.0
---
9026: 10 3 2.8 2.9
9027: 10 3 2.8 3.0
9028: 10 3 2.9 2.9
9029: 10 3 2.9 3.0
9030: 10 3 3.0 3.0
> x2
tprime p1sLASTprm
1: 1 1.0
2: 1 1.1
3: 1 1.2
4: 1 1.3
5: 1 1.4
---
206: 10 2.6
207: 10 2.7
208: 10 2.8
209: 10 2.9
210: 10 3.0
Now, I want to do these last three steps in a single inner join.
joined = x[,x2[],by=names(x)]
joined=joined[p1sLASTprm==p1s & d!=3 | d==3 & p1sLASTprm==3]
joined=joined[tprime==t+1]
Resulting in the final output:
> joined
t d p1s p1sLAST tprime p1sLASTprm
1: 1 1 1.0 3.0 2 1.0
2: 1 1 1.1 3.0 2 1.1
3: 1 1 1.2 3.0 2 1.2
4: 1 1 1.3 3.0 2 1.3
5: 1 1 1.4 3.0 2 1.4
---
4343: 9 3 2.8 2.9 10 3.0
4344: 9 3 2.8 3.0 10 3.0
4345: 9 3 2.9 2.9 10 3.0
4346: 9 3 2.9 3.0 10 3.0
4347: 9 3 3.0 3.0 10 3.0

I do not think a single inner join can accomplish those 3 steps since there is a | and most likely a union of results will be required.
A more memory efficient approach could be:
ux <- unique(x)[, upt := t+1]
rbindlist(list(
ux[d!=3][x2,
c(mget(names(ux)), mget(names(x2))),
on=c("p1s"="p1sLASTprm", "upt"="tprime"),
nomatch=0L],
ux[d==3][x2[p1sLASTprm==3],
c(mget(names(ux)), mget(names(x2))),
on=c("upt"="tprime"),
nomatch=0L]
))

Related

organize data, to match quantities of table A to the quantities of table B, and vise versa

I have this two tables. I want each row of two tables match in quantity, and difference of those numbers to go to 1 row below and match with next row.
Current Data:
Quant
Symbol
Price
20
B
1.5
15
B
1.8
31
B
1.9
14
B
2.2
20
B
2.3
10
B
2.5
and
Quant
Symbol
Price
20
S
2.6
10
S
2
35
S
1.8
15
S
1.6
10
S
1.5
I would like it to turn to this table.
Quant
Symbol
Price
20
B
1.5
10
B
1.8
5
B
1.8
30
B
1.9
1
B
1.9
14
B
2.2
10
B
2.3
10
B
2.3
and
Quant
Symbol
Price
20
S
2.6
10
S
2
5
S
1.8
30
S
1.8
1
S
1.6
14
S
1.6
10
S
1.5

Setting cluster id in MClust

When I cluster a dataset using MClust, I use the following code-
i = 2
print(paste("Number of clusters =", i))
cluster_model1 <- Mclust(cc[2:6], G=i)
When I repeat the clustering, the cluster classification (id) in each iteration can remain the same or it can change from 1 to 2 or 2 to 1. Is it possible to set the cluster id so that it does not change arbitrarily. I want to see how many times data from 10 imputed datasets belongs to cluster 1 or cluster 2. I can calculate this only if the cluster id remains the same.
The dataset cc has this data
head(cc[2:6])
ea pa sa en pn
1 1.0 1.0 1.0 2.2 1.6
2 3.2 2.4 1.0 3.2 1.8
3 1.2 1.0 1.0 2.0 1.0
4 1.6 1.2 1.2 1.0 1.2
5 3.6 1.0 1.6 4.0 2.6
6 1.6 1.0 1.4 1.4 1.2
When I cluster, the classification could be
head(cluster_model1$classification)
[1] 2 1 2 1 1 1
or
head(cluster_model1$classification)
[1] 1 2 1 2 2 2
While the clustering results are correct, is it possible to set it as 2 1 2 1 1 1 every time the clustering is done.

R: convert data frame values from integer to double based on the column index

I have a data frame, where each column corresponds to variable and each row corresponds to the numerical category, e.g. 0, 1 or 2.
df <- data.frame(TKI=c(1,1,2,0,1),
Chemo=c(1,2,2,0,1),
Radio=c(1,1,2,0,1),
EGFR=c(1,2,2,0,1),
ALK=c(1,1,2,0,1))
df
TKI Chemo Radio EGFR ALK
1 1 1 1 1 1
2 1 2 1 2 1
3 2 2 2 2 2
4 0 0 0 0 0
5 1 1 1 1 1
I would like to convert each value to the double based on the column index. For example, the aforementioned table will be transformed into this:
1.1 2.1 3.1 4.1 5.1
1.1 2.2 3.1 4.2 5.1
1.2 2.2 3.2 4.2 5.2
1.0 2.0 3.0 4.0 5.0
1.1 2.1 3.1 4.1 5.1
I also would like to obtain the list of columns and their indexes, such as this:
1 - TKI
2 - Chemo
3 - Radio
4 - EGFR
5 - ALK
How can I do this conversion?
Thank you for your time and help!
You can use col to get the column index and divide the dataframe by 10 to create a decimal value and add the two numbers together.
col(df) + df/10
# TKI Chemo Radio EGFR ALK
#1 1.1 2.1 3.1 4.1 5.1
#2 1.1 2.2 3.1 4.2 5.1
#3 1.2 2.2 3.2 4.2 5.2
#4 1.0 2.0 3.0 4.0 5.0
#5 1.1 2.1 3.1 4.1 5.1
To get column names and their position you can do :
ref_df <- data.frame(index = seq_along(df),
names = names(df))
# index names
#1 1 TKI
#2 2 Chemo
#3 3 Radio
#4 4 EGFR
#5 5 ALK

Creating new column names from existing column names using paste function

Assume I have a data frame df with variables A, B and C in it. I would like to create 3 more corresponding columns with names A_ranked, B_ranked and C_ranked. It doesn't matter how I will fill them for the sake of this question, so let's assume that I will set them all to 5. I tried the following code:
for (i in 1:length(df)){
df%>%mutate(
paste(colnames(df)[i],"ranked", sep="_")) = 5
}
I also tried:
for (i in 1:length(df)){
df%>%mutate(
as.vector(paste(colnames(df)[i],"ranked", sep="_")) = 5
}
And:
for (i in 1:length(df)){
df$paste(colnames(df)[i],"ranked", sep="_")) = 5
}
No one them seems to work. Can somebody please tell me what is the correct way to do this?
Here is a data.table option using the iris data set (here we create 4 more columns based on colnames of existing columns).
# data
df <- iris[, 1:4]
str(df)
# new columns
library(data.table)
setDT(df)[, paste(colnames(df), "ranked", "_") := 5][]
# output
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length ranked _
1: 5.1 3.5 1.4 0.2 5
2: 4.9 3.0 1.4 0.2 5
3: 4.7 3.2 1.3 0.2 5
4: 4.6 3.1 1.5 0.2 5
5: 5.0 3.6 1.4 0.2 5
---
146: 6.7 3.0 5.2 2.3 5
147: 6.3 2.5 5.0 1.9 5
148: 6.5 3.0 5.2 2.0 5
149: 6.2 3.4 5.4 2.3 5
150: 5.9 3.0 5.1 1.8 5
Sepal.Width ranked _ Petal.Length ranked _ Petal.Width ranked _
1: 5 5 5
2: 5 5 5
3: 5 5 5
4: 5 5 5
5: 5 5 5
---
146: 5 5 5
147: 5 5 5
148: 5 5 5
149: 5 5 5
150: 5 5 5
# If you want to fill new columns with different values you can try something like
setDT(df)[, paste(colnames(df), "ranked", "_") := list(Sepal.Length/2,
Sepal.Width/2,
Petal.Length/2,
Petal.Width/2)][]
This should work:
df[paste(names(df), "ranked", sep = "_")] <- 5
df
# A B C A_ranked B_ranked C_ranked
# 1 1 2 3 5 5 5
Data:
df <- data.frame(A = 1, B = 2, C = 3)
Does this help?
dat <- data.frame(A=5,B=5,C=5)
dat %>%
mutate_each(funs(ranked=sum)) %>%
head()

How to remove first and last N rows in a dataframe using R?

I have a dataframe df of 2M+ rows that looks like this:
LOCUS POS COUNT
1: CP007539.1 1 4
2: CP007539.1 2 7
3: CP007539.1 3 10
4: CP007539.1 4 15
5: CP007539.1 5 21
6: CP007539.1 6 28
Currently I am using this in order to remove the first and last 1000 rows:
> df_adj = df[head(df$POS,-1000),]
> df_adj = tail[(df_adj$POS,-1000),]
But even I can see that this can be done in a better way. Suggestions?
You can perform this specifying the range of rows you want to leave in the final dataset:
df_adj <- df[1001:( nrow(df) - 1000 ),]
Just make sure you have enough rows to perform this. A safer approach might be:
df_adj <- if( nrow(df) > 2000 ) df[1001:( nrow(df) - 1000 ),] else df
Another way would be to combine seq_len with nrow and range:
df <- head(iris, 10)
df[-range(seq_len(nrow(df))), ]
which will generate the output below:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5.0 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa

Resources