Populating data from one data.table to another - r

I have a distance matrix (as data.table) showing pairwise distances between a number of items, but not all items are in the matrix. I need to create a larger data.table that has all the missing items populated. I can do this with matrices fairly easily:
items=c("a", "b", "c", "d")
small_matrix=matrix(c(0, 1, 2, 3), nrow=2, ncol=2,
dimnames=list(c("a", "b"), c("a", "b")))
# create zero matrix of the right size
full_matrix <- matrix(0, ncol=length(items), nrow=length(items),
dimnames=list(items, items))
# populate items from the small matrix
full_matrix[rownames(small_matrix), colnames(small_matrix)] <- small_matrix
full_matrix
# a b c d
# a 0 2 0 0
# b 1 3 0 0
# c 0 0 0 0
# d 0 0 0 0
What is the equivalent of that in data.table? I can create an 'id' column in small_DT and use it as the key, but I'm not sure how to overwrite items in full_DT that has the same id/column pair.

Let's convert to data.table and keep the row names as an extra column:
dts = as.data.table(small_matrix, keep = T)
# rn a b
#1: a 0 2
#2: b 1 3
dtf = as.data.table(full_matrix, keep = T)
# rn a b c d
#1: a 0 0 0 0
#2: b 0 0 0 0
#3: c 0 0 0 0
#4: d 0 0 0 0
Now just join on the rows, and assuming small matrix is always a subset you can do the following:
dtf[dts, names(dts) := dts, on = 'rn']
dtf
# rn a b c d
#1: a 0 2 0 0
#2: b 1 3 0 0
#3: c 0 0 0 0
#4: d 0 0 0 0
Above assumes version 1.9.5+. Otherwise you'll need to set the key first.

Suppose you have these two data.table:
dt1 = as.data.table(small_matrix)
# a b
#1: 0 2
#2: 1 3
dt2 = as.data.table(full_matrix)
# a b c d
#1: 0 0 0 0
#2: 0 0 0 0
#3: 0 0 0 0
#4: 0 0 0 0
You can't operate like with data.frame or matrix, eg by doing:
dt2[rownames(full_matrix) %in% rownames(small_matrix), names(dt1), with=F] <- dt1
This code will raise an error, because to affect new values, you need to use the := operator:
dt2[rownames(full_matrix) %in% rownames(small_matrix), names(dt1):=dt1][]
# a b c d
#1: 0 2 0 0
#2: 1 3 0 0
#3: 0 0 0 0
#4: 0 0 0 0

Related

Create an adjacency matrix from unbalanced trade flow data in R

I have a dataset of bilateral trade flows of dimension 84x244.
How can I balance the dataset to look like a 244x244 matrix but keeping the same order and names as the columns?
Non-symmetric matrix
For example the matrix resembles:
A B C D
B 0 0 0 1
D 2 0 0 0
and it should look like
A B C D
A 0 0 0 0
B 0 0 0 1
C 0 0 0 0
D 2 0 0 0
With A B C D as row and column names
Here are two methods that ensure the column names and row names are effectively the same, using a default value of 0 for missing rows/columns. These do not assume that the columns are always full; if this is guaranteed, then you can ignore the column-adding portions.
Both start with:
m <- as.matrix(read.table(header=TRUE, text="
A B C D
B 0 0 0 1
D 2 0 0 0"))
First
needrows <- setdiff(colnames(m), rownames(m))
m <- rbind(m, matrix(0, nrow=length(needrows), ncol=ncol(m), dimnames=list(needrows, colnames(m))))
needcols <- setdiff(rownames(m), colnames(m))
m <- cbind(m, matrix(0, nrow=nrow(m), ncol=length(needcols), dimnames=list(rownames(m), needcols)))
m
# A B C D
# B 0 0 0 1
# D 2 0 0 0
# A 0 0 0 0
# C 0 0 0 0
And to order the rows same as the columns ... note that if there are row names not present in the column names, they will be removed in this, though you can include them with another setdiff if needed.
m[colnames(m),]
# A B C D
# A 0 0 0 0
# B 0 0 0 1
# C 0 0 0 0
# D 2 0 0 0
Second
allnames <- sort(unique(unlist(dimnames(m))))
m2 <- matrix(0, nrow=length(allnames), ncol=length(allnames),
dimnames=list(allnames, allnames))
m2[intersect(rownames(m), allnames), colnames(m)] <-
m[intersect(rownames(m), allnames), colnames(m)]
m2[rownames(m), intersect(colnames(m), allnames)] <-
m[rownames(m), intersect(colnames(m), allnames)]
m2
# A B C D
# A 0 0 0 0
# B 0 0 0 1
# C 0 0 0 0
# D 2 0 0 0
Here is a base R solution. The basic idea is that, you first construct a square matrix will all zeros and assign row names with its column names, and then assign value to the rows according to row names, i.e.,
M <- `dimnames<-`(matrix(0,nrow = ncol(m),ncol = ncol(m)),
replicate(2,list(colnames(m))))
M[rownames(m),] <- m
such that
> M
A B C D
A 0 0 0 0
B 0 0 0 1
C 0 0 0 0
D 2 0 0 0

Count the appearance of a string and the belonging result in the rows above

I have a data frame like this:
df <- data.frame(value = c("a","b","b","d","a","b","b","d","a","b","c","d"),
pattern = c("NA","a","ab","abb","bbd","bda","dab","abb","bbd","bda","dab","abc"))
The value column indicates the actual behaviour, and the pattern shows the cummulative behaviour before this action happens.
Now I want to compare the patterns with the 4 patterns above and count the number of appearances, plus the number of appearance of the belonging letter in the "value"-column, to calculate the expected result.
The result should look like this:
value pattern apperance a b c d exp.result
1 a NA 0 0 0 0 0 <NA>
2 b a 0 0 0 0 0 <NA>
3 b ab 0 0 0 0 0 <NA>
4 d abb 0 0 0 0 0 <NA>
5 a bbd 0 0 0 0 0 <NA>
6 b bda 0 0 0 0 0 <NA>
7 b dab 0 0 0 0 0 <NA>
8 d abb 1 0 0 0 1 d
9 a bbd 1 1 0 0 0 a
10 b bda 1 0 1 0 0 b
11 c dab 1 0 1 0 0 b
12 d abc 0 0 0 0 0 <NA>
I hope somebody can help me with this problem.
You can use this approach :
df <- data.frame(
value = c("a","b","b","d","a","b","b","d","a","b","c","d"),
pattern = c(NA,"a","ab","abb","bbd","bda","dab","abb","bbd","bda","dab","abc"))
win <- 4
analyzeWindow <- function(idx){
idxs <- max(1,idx-win):(idx-1)
if(idx == 1) idxs <- integer()
winDF <- df[idxs,]
winDF <- winDF[na.omit(winDF$pattern == df$pattern[idx]),]
expValWeights <- unlist(as.list(table(winDF$value)))
c(appearances=nrow(winDF),expValWeights)
}
newCols <- t(sapply(1:nrow(df),analyzeWindow))
df2 <- cbind(df,newCols)
df2$exp.result <- colnames(newCols)[-1][max.col(newCols[,-1],ties.method='first')]
df2$exp.result[rowSums(newCols[,-1]) == 0] <- NA
> df2
value pattern appearances a b c d exp.result
1 a <NA> 0 0 0 0 0 <NA>
2 b a 0 0 0 0 0 <NA>
3 b ab 0 0 0 0 0 <NA>
4 d abb 0 0 0 0 0 <NA>
5 a bbd 0 0 0 0 0 <NA>
6 b bda 0 0 0 0 0 <NA>
7 b dab 0 0 0 0 0 <NA>
8 d abb 1 0 0 0 1 d
9 a bbd 1 1 0 0 0 a
10 b bda 1 0 1 0 0 b
11 c dab 1 0 1 0 0 b
12 d abc 0 0 0 0 0 <NA>
NOTE:
This code requires the "value" column being of type factor. Use as.factor if it isn't.
The function rollapply from the package zoo may be helpful.
Define your original data.frame and load package:
library(zoo)
df <- data.frame(value = c("a","b","b","d","a","b",
"b","d","a","b","c","d"),
pattern = c("NA","a","ab","abb","bbd","bda",
"dab","abb","bbd","bda","dab","abc"))
Define a function that will spit out the number of times the fifth element appears in the first four:
f <- function(x) sum(x[5] == x[1:4])
Apply this function using rollapply:
df$appearance <- rollapply(df$pattern, 5, f, align = 'right', fill = NA)
I'm not sure if I'm interpreting your letter columns right, but you could use the same (or similar) function for the individual letters and then split the resulting column into 4 based on the value column.
df$letters <- rollapply(df$value, 5, f, align = 'right', fill = NA)
df$a <- 0
df$a[df$value == 'a'] <- df$letters[df$value == 'a']
It's up to you how to handle the NA values at the start.
If I may take a guess, it looks like you are working with DNA codons. In the off chance you haven't done so already, you may want to take a look at existing packages. Bioconductor in particular has a number of useful ones for working with biological data.

how to select subset only by [] in r?

a<-data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
I want to convert the element of q1,q2 which !=1 to 0,and I want to use only [].I believe all the subset can be done by [].
a[grep("q\\d",colnames(a),perl=TRUE)!=1,grep("q\\d",colnames(a),perl=TRUE)]<-0
but it doesn't work, what's the problem?
We create the a numeric index of the column names that start with 'q' followed by numbers ('nm1'), use that to subset the columns in 'a' and assign the values that are not equal to 1 in that subset to 0.
nm1 <- grep("q\\d+", names(a))
a[nm1][a[nm1] != 1] <- 0
and make sure we have the columns as character class by using stringsAsFactors= FALSE in the data.frame
The above replacement is based on a logical matrix (a[nm1]!=1) which may create memory problems if the dataset is really big. In that case, it is better to loop through the columns and replace with 0
a[nm1] <- lapply(a[nm1], function(x) replace(x, x!=1, 0))
data
a <- data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
Just in case, if you know column names, you can use them for indexing.
a<-data.frame(q1=rep(c(1,'A','B'),4), q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
col_n <- c("q1", "q2")
a[, col_n][a[, col_n]!=1]<-0
> a
q1 q2 w1
1 1 1 1
2 0 0 A
3 0 0 B
4 1 0 C
5 0 1 1
6 0 0 A
7 1 0 B
8 0 0 C
9 0 1 1
10 1 0 A
11 0 0 B
12 0 0 C
data.table approach:
a<-data.table(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
a[,grep("^q", colnames(a), value = T):=lapply(a[,grep("^q", colnames(a), value = T), with = F], function(x) ifelse(x == 1, 1, 0))]
> a
q1 q2 w1
1: 1 1 1
2: 0 0 A
3: 0 0 B
4: 1 0 C
5: 0 1 1
6: 0 0 A
7: 1 0 B
8: 0 0 C
9: 0 1 1
10: 1 0 A
11: 0 0 B
12: 0 0 C

R - Create matrix from 3 raw vector

I have 3 vectors as the following:
A <- c("A", "B", "C", "D", "E")
B <- c("1/1/1", "1/1/1", "2/1/1", "2/1/1", "3/1/1")
C <- c(1, 1, -1, 1, -1)
and I want to create a matrix like the following using these 3 vectors:
- 1/1/1 2/1/1 3/1/1
A 1 0 0
B 1 0 0
C 0 -1 0
D 0 1 0
E 0 0 -1
where vector A and B are rows and columns respectively and I have the data as C.
Any help would be appreciated.
Use ?xtabs
xtabs(C ~ A+B)
# B
#A 1/1/1 2/1/1 3/1/1
# A 1 0 0
# B 1 0 0
# C 0 -1 0
# D 0 1 0
# E 0 0 -1
You can try:
`[<-`(array(0,c(length(unique(A)),length(unique(B))),
list(unique(A),unique(B))),
cbind(A,B),C)
# 1/1/1 2/1/1 3/1/1
#A 1 0 0
#B 1 0 0
#C 0 -1 0
#D 0 1 0
#E 0 0 -1
Another option is acast from reshape2 after creating a data.frame
library(reshape2)
acast(data.frame(A, B, C), A~B, value.var = "C", fill =0)
# 1/1/1 2/1/1 3/1/1
#A 1 0 0
#B 1 0 0
#C 0 -1 0
#D 0 1 0
#E 0 0 -1

Vectorizing a for-loop that merges two data frames by column

Suppose I have two dataframes df1 and df2.
df1 <- data.frame(matrix(c(0,0,1,0,0,1,1,1,0,1),ncol=10,nrow=1))
colnames(df1) <- LETTERS[seq(1,10)]
df2 <- data.frame(matrix(c(1,1,1,1),ncol=4,nrow=1))
colnames(df2) <- c("C","D","A","I")
Some of the column names in df2 match column names in df1 and df1 always contains every possible column name that can occur in df2. I want to append df1 with a new row which holds the value of df2 for matching columns and a 0 for non-matching columns. My current approach uses a for-loop:
for(i in 1:ncol(df1)){
if(colnames(df1)[i] %in% colnames(df2)){
df1[2,i] <- df2[1,which(colnames(df2)==colnames(df1)[i])]
} else {
df1[2,i] <- 0
}
}
Well, it works. But I wonder if there is a cleaner (and faster) solution for this task, perhaps taking advantage of vectorized operations.
res <-merge(df1,df2,all=T)[,colnames(df1)]
res[is.na(res)] <- 0
res
# A B C D E F G H I J
# 1 0 0 1 0 0 1 1 1 0 1
# 2 1 0 1 1 0 0 0 0 1 0
Possibly more efficient would be rbind_all from "dplyr":
library(dplyr)
rbind_list(df1, df2)
# A B C D E F G H I J
# 1 0 0 1 0 0 1 1 1 0 1
# 2 1 NA 1 1 NA NA NA NA 1 NA
Assign to "res" and replace NA with "0" in the same way identified by #akrun.
Just using assignment:
df1[2,] <- 0
df1[2,names(df2)] <- df2
# A B C D E F G H I J
#1 0 0 1 0 0 1 1 1 0 1
#2 1 0 1 1 0 0 0 0 1 0
...and just to prove it works with other values:
df2$C <- 8
df1[2,] <- 0
df1[2,names(df2)] <- df2
# A B C D E F G H I J
#1 0 0 1 0 0 1 1 1 0 1
#2 1 0 8 1 0 0 0 0 1 0

Resources