I was wondering if you guys can help me building an adjacency matrix. I have data in CVS format like this:
Paper_ID Author
2 Foster-McGregor, N.
3 Van Houte, M.
4 van de Meerendonk, A.
5 Farla, K.
6 van Houte, M.
6 Siegel, M.
8 Farla, K.
11 Farla, K.
11 Verspagen, B.
As you can see the column "Paper_ID" has a repeated value of 11, meaning that "Farla, K." and "Verspagen, B." are coauthors of a publication. I need to build a square weighted matrix using the names of the authors, counting the times that they are collaborating together.
Does the following do what you are looking for?
# simulate data.
d <- data.frame(
id=c(2,3,4,5,6,6,8,11,11,12,12),
author=c("FN", "VM","VA","FK","VM","SM","FK","FK","VB","FK","VB")
)
d
id author
1 2 FN
2 3 VM
3 4 VA
4 5 FK
5 6 VM
6 6 SM
7 8 FK
8 11 FK
9 11 VB
10 12 FK
11 12 VB
# create incidence matrix:
m <- xtabs(~author+id,d)
m
id
author 2 3 4 5 6 8 11 12
FK 0 0 0 1 0 1 1 1
FN 1 0 0 0 0 0 0 0
SM 0 0 0 0 1 0 0 0
VA 0 0 1 0 0 0 0 0
VB 0 0 0 0 0 0 1 1
VM 0 1 0 0 1 0 0 0
# convert to adjacency matrix.
# tcrossprod does "m %*% t(m)"
tcrossprod(m)
author
author FK FN SM VA VB VM
FK 4 0 0 0 2 0
FN 0 1 0 0 0 0
SM 0 0 1 0 0 1
VA 0 0 0 1 0 0
VB 2 0 0 0 2 0
VM 0 0 1 0 0 2
Note that crossprod() will give you the incidence matrix for the id variable (i.e. will do t(m) %*% m).
Related
I have a dataset (STATPOP2016 by Swiss Federal Statistical Office) that contains number of households of different sizes per each hectar of Swiss territory. In other terms, for each hectar i I have:
x1 households consisting of one individual
x2 households consisting of two individuals
...
x6 households with 6 or more individuals (I consider them as having 6 people for simplicity).
I need to create a variable that will show me interquartile range for the households' number per each hectar. I have the code that works, but it is very slow. Is there a smarter way to do the same thing?
There is my code:
# Vector that contains all possible sizes of households
vector_hh_size <- c(1:6)
# Variable for interquantile range in household size. A is my dataframe
A$hh_size_IQR <- 0
# Vector that contains frequency of each size of household in a given hectar
vector_hh_frequency <- c(0,0,0,0,0,0)
for (i in 1:NROW(A)) {
for (j in 1:6){
vector_hh_frequency[j] <- eval(parse(text = paste("A$hh",j,"[",i,"]",sep = "")))
}
A$hh_size_IQR[i] <- wtd.quantile(vector_hh_size, weights = vector_hh_frequency)[4] - wtd.quantile(vector_hh_size, weights = vector_hh_frequency)[2]
}
Here is example of data:
hh1 hh2 hh3 hh4 hh5 hh6 IQR
1 0 3 0 0 0 0 0
2 0 3 0 0 0 0 0
3 0 0 3 0 0 0 0
4 0 3 0 0 0 0 0
5 3 6 3 3 0 0 1
6 0 3 0 0 3 0 3
7 11 7 4 7 3 0 3
8 3 3 0 3 0 0 3
9 3 3 0 3 0 0 3
10 0 3 0 0 0 0 0
#OBSis observation number, hhi shows how many households with i people there are. IQR is interquartile range for each observation - this is the variable I am building.
Here is a shorter version of your code:
library("Hmisc")
A <- read.table(header=TRUE, text=
" hh1 hh2 hh3 hh4 hh5 hh6
1 0 3 0 0 0 0
2 0 3 0 0 0 0
3 0 0 3 0 0 0
4 0 3 0 0 0 0
5 3 6 3 3 0 0
6 0 3 0 0 3 0
7 11 7 4 7 3 0
8 3 3 0 3 0 0
9 3 3 0 3 0 0
10 0 3 0 0 0 0")
vector_hh_size <- 1:ncol(A)
myIQR <- function(Ai) wtd.quantile(vector_hh_size, weights=Ai)[4] - wtd.quantile(vector_hh_size, weights=Ai)[2]
A$IQR <- apply(A, 1, myIQR)
# > A
# hh1 hh2 hh3 hh4 hh5 hh6 IQR
# 1 0 3 0 0 0 0 0
# 2 0 3 0 0 0 0 0
# 3 0 0 3 0 0 0 0
# 4 0 3 0 0 0 0 0
# 5 3 6 3 3 0 0 1
# 6 0 3 0 0 3 0 3
# 7 11 7 4 7 3 0 3
# 8 3 3 0 3 0 0 3
# 9 3 3 0 3 0 0 3
# 10 0 3 0 0 0 0 0
I would like to say thank you in advance for anyone who looks at my question and shares their thoughts and experiences. I am trying to run a quadratic assignment procedure (QAP) on correlations of behaviors between a community of five individuals. I have ten matrices that represent frequencies of behavior between individuals, and I calculated correlations (pearson's r) between pairs of matrices. For example, I found the correlation between matrix 1 and matrix 2, matrix 2 and matrix 3, matrix 3 and matrix 4... and so on. I then wanted to assess the significance of these correlations using the qaptest function from the R package sna. As per the R documentation on qaptest, I placed all of my matrices into an array. I then calculated the QAP p-value between pairs of matrices (matrix 1 and matrix 2, matrix 2 and matrix 3... etc.). However, I noticed that if I changed the number of matrices in the array (for example, if I only placed the first five into the array), the QAP p-values for the first set of matrices changed dramatically. Based on my understanding of arrays and QAP, this should not happen because the removed matrices have nothing to do with running a QAP test on matrix 1 and matrix 2. Has anyone else ran into this problem before? I included my matrices and my script below.
Here are my matrices in a list format (in the code below, this is the step where I made filelist1. The second half of the code only uses matrices 1-5):
[[1]]
1 2 3 4 5
1 1 0 0 0 0
2 5 0 3 5 0
3 0 0 0 0 0
4 0 0 0 0 0
5 2 0 1 0 0
[[2]]
1 2 3 4 5
1 0 0 1 0 0
2 3 6 10 1 2
3 0 0 0 0 0
4 0 5 0 0 0
5 0 0 5 0 0
[[3]]
1 2 3 4 5
1 0 1 0 0 0
2 2 0 5 7 0
3 0 0 0 0 3
4 1 0 0 0 0
5 1 2 2 3 0
[[4]]
1 2 3 4 5
1 0 6 0 0 2
2 2 0 8 5 0
3 0 5 0 0 0
4 1 0 0 0 0
5 0 0 1 3 2
[[5]]
1 2 3 4 5
1 0 0 0 0 0
2 1 0 2 5 1
3 0 0 0 0 0
4 1 2 3 0 1
5 0 3 3 1 0
[[6]]
1 2 3 4 5
1 0 0 0 0 0
2 2 0 3 0 3
3 0 0 0 0 0
4 1 0 4 0 0
5 1 5 7 0 0
[[7]]
1 2 3 4 5
1 0 0 0 0 0
2 2 0 6 0 3
3 0 0 0 0 0
4 6 0 4 0 0
5 1 0 2 0 0
[[8]]
1 2 3 4 5
1 0 0 0 1 0
2 2 0 1 6 0
3 0 0 0 0 0
4 0 0 0 0 0
5 6 0 2 2 0
[[9]]
1 2 3 4 5
1 0 0 0 0 0
2 0 0 2 3 2
3 0 0 0 0 0
4 0 0 0 0 0
5 1 0 2 0 0
[[10]]
1 2 3 4 5
1 0 0 0 0 0
2 1 0 1 1 0
3 0 0 0 0 0
4 0 0 0 0 0
5 6 0 1 2 0
This is my R script:
# read in all ten of the matrices
a<-read.csv("test1.csv")
b<-read.csv("test2.csv")
c<-read.csv("test3.csv")
d<-read.csv("test4.csv")
e<-read.csv("test5.csv")
f<-read.csv("test6.csv")
g<-read.csv("test7.csv")
h<-read.csv("test8.csv")
i<-read.csv("test9.csv")
j<-read.csv("test10.csv")
filelist<-list(a,b,c,d,e,f,g,h,i,j) #place files in a list
filelist1<-lapply(filelist,function(x){
x<-x[1:5, 2:6] #choose only columns in the matrix
colnames(x)<-1:5 #rename columns according to identity
x<-as.matrix(x) #make a matrix
return(x)
})
ee<-array(dim=c(5,5,10)) #create an empty array
array<-function(files) {
names(files) <- c("c1","c2","c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10") #name the matrices
invisible(lapply(names(files), function(x) assign(x,files[[x]],envir=.GlobalEnv))) #place the matrices in a global environment
ee[,,1]<-c(c1) #place each matrix in order into the array
ee[,,2]<-c(c2)
ee[,,3]<-c(c3)
ee[,,4]<-c(c4)
ee[,,5]<-c(c5)
ee[,,6]<-c(c6)
ee[,,7]<-c(c7)
ee[,,8]<-c(c8)
ee[,,9]<-c(c9)
ee[,,10]<-c(c10)
return(ee) #return the completely filled in array
}
a.array<-array(filelist1) # apply the function to the list of matrices
q1.2<-qaptest(a.array,gcor,g1=1,g2=2) #run the qaptest funtion
#a.array is the array with the matrices,gcor tells the function that we want a correlation
#g1=1 and g2=2 indicates that the qap analysis should be run between the first and second matrices in the array.
summary.qaptest(q1.2) #provides a summary of the qap results
#in this case, the p-value is roughly: p(f(perm) >= f(d)): 0.176
############ If I take out the last five matrices, the q1.2 p-value changes dramatically
#first clear the memory or R will not create another blank array
rm(list = ls())
a<-read.csv("test1.csv") #read in all five files
b<-read.csv("test2.csv")
c<-read.csv("test3.csv")
d<-read.csv("test4.csv")
e<-read.csv("test5.csv")
filelist<-list(a,b,c,d,e) #create a list of the files
filelist1<-lapply(filelist,function(x){
x<-x[1:5, 2:6] #include only the matrix
colnames(x)<-1:5 #rename the columns
x<-as.matrix(x) #make it a matrix
return(x)
})
ee<-array(dim=c(5,5,5)) #this time the array only has five slots
array<-function(files) {
names(files) <- c("c1","c2","c3", "c4", "c5")
invisible(lapply(names(files), function(x) assign(x,files[[x]],envir=.GlobalEnv)))
ee[,,1]<-c(c1)
ee[,,2]<-c(c2)
ee[,,3]<-c(c3)
ee[,,4]<-c(c4)
ee[,,5]<-c(c5)
return(ee)
}
a.array<-array(filelist1)
q1.2<-qaptest(a.array,gcor,g1=1,g2=2)
#in this case, the p-value is roughly: p(f(perm) >= f(d)): 0.804
summary.qaptest(q1.2)
I cannot think of a reason why the p-values would be so different when I am analyzing the exact same pair of matrices. The only difference is the number of additional matrices placed in the array. Has anyone else experienced this issue?
Thank you!
qaptest() reads graphs from the first dimension of the array, not the last. So ee[,,1]<-c(c1) (etc.) should read ee[1,,]<-c(c1) (etc.). When you place all the graph in the first dimension, the qaptests should yield identical results. Personally, I prefer using list() instead of array() with qaptest.
Given a dataset in the following form:
> Test
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0
23 39045 0 0 0
I can compress these data to remove zero rows with the following code:
a=subset(Test, Total!=0)
> a
Pos Watson Crick Total
4 39026 2 1 3
6 39028 0 4 4
8 39030 0 1 1
12 39034 1 0 1
15 39037 3 0 3
16 39038 2 0 2
18 39040 0 1 1
How would I code the reverse transformation? i.e. To convert dataframe a back into the original form of Test.
More specifically: without any access to the original data, how would I re-expand the data (to include all sequential "Pos" rows) for an arbitrary range of Pos?
Here, the ID column is irrelevant. In a real example, the ID numbers are just row numbers created by R. In a real example, the compressed dataset will have sequential ID numbers.
Here's another possibility, using base R. Unless you explicitly provide the initial and the final value of Pos, the first and the last index value in the restored dataframe will correspond to the values given in the "compressed" dataframe a:
restored <- data.frame(Pos=(a$Pos[1]:a$Pos[nrow(a)])) # change range if required
restored <- merge(restored,a, all=TRUE)
restored[is.na(restored)] <- 0
#> restored
# Pos Watson Crick Total
#1 39026 2 1 3
#2 39027 0 0 0
#3 39028 0 4 4
#4 39029 0 0 0
#5 39030 0 1 1
#6 39031 0 0 0
#7 39032 0 0 0
#8 39033 0 0 0
#9 39034 1 0 1
#10 39035 0 0 0
#11 39036 0 0 0
#12 39037 3 0 3
#13 39038 2 0 2
#14 39039 0 0 0
#15 39040 0 1 1
Possibly the last step can be combined with the merge function by using the na.action option correctly, but I didn't find out how.
You need to know at least the Pos values you want to fill in. Then, it's a combination of join and mutate operations in dplyr.
Test <- read.table(text = "
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0")
library(dplyr)
Nonzero <- Test %>% filter(Total > 0)
All_Pos <- Test %>% select(Pos)
Reconstruct <-
All_Pos %>%
left_join(Nonzero) %>%
mutate_each(funs(ifelse(is.na(.), 0, .)), Watson, Crick, Total)
In my code, All_Pos contains all valid positions as a one-column data frame; the mutate_each() call converts NA values to zeros. If you only know the largest MaxPos, you can construct it using
All_Pos <- data.frame(seq_len(MaxPos))
Hi guys I have a question regarding an matrix operation in R. I have a data set like the one bellow:
Sample Data:
d <- data.frame(id=c(2,3,4,5,6,6,8,11,11,11,12,12,12),author=c("FN","VM","VA","FK","VM","SM","FK","FK","VB","VA","FK","VB","VA"))
d
id author
1 2 FN
2 3 VM
3 4 VA
4 5 FK
5 6 VM
6 6 SM
7 8 FK
8 11 FK
9 11 VB
10 11 VA
11 12 FK
12 12 VB
13 12 VA
1)Created an Incidence Matrix:
> m <- xtabs(~author+id,d)
> m
id
author 2 3 4 5 6 8 11 12
FK 0 0 0 1 0 1 1 1
FN 1 0 0 0 0 0 0 0
SM 0 0 0 0 1 0 0 0
VA 0 0 1 0 0 0 1 1
VB 0 0 0 0 0 0 1 1
VM 0 1 0 0 1 0 0 0
What I want to do is to generate pair combinations from the author list, in column 2, by multiplying each row. For instance for the pair FK-VA, its corresponding rows in the incidence matrix are this:
FK 0 0 0 1 0 1 1 1
VA 0 0 1 0 0 0 1 1
The expected outcome in my matrix should produce a multiplication by each element of the rows:
FK-VA (0*0),(0*0),(0*1),(1*0),(0*0),(1*0),(1*1),(1*1)
FK-VA 0 0 0 0 0 1 1
2)Expected outcome would be this matrix(m):
FK FN 0 0 0 0 0 0 0
FK SM 0 0 0 0 0 0 0
FK VA 0 0 0 0 0 1 1
FK VB 0 0 0 0 0 1 1
FK VM 0 0 0 0 0 0 0
FN SM 0 0 0 0 0 0 0
FN VA 0 0 0 0 0 0 0
FN VB 0 0 0 0 0 0 0
FN VM 0 0 0 0 0 0 0
SM VA 0 0 0 0 0 0 0
SM VB 0 0 0 0 0 0 0
SM VM 0 0 0 1 0 0 0
VA VB 0 0 0 0 0 1 1
VA VM 0 0 0 0 0 0 0
VB VM 0 0 0 0 0 0 0
3) Delete Empty rows.
As you can see I need help for the steps 2 and 3.
Thank you
Mario
May be this helps
indx <- combn(dimnames(m)$author,2)
res <- cbind(t(indx), as.data.frame(do.call(rbind,
lapply(split(indx, col(indx)), function(x) m[x[1],]*m[x[2],]))))
colnames(res)[1:2] <- paste0('author', 1:2)
head(res,3)
# author1 author2 2 3 4 5 6 8 11 12
#1 FK FN 0 0 0 0 0 0 0 0
#2 FK SM 0 0 0 0 0 0 0 0
#3 FK VA 0 0 0 0 0 0 1 1
Or
cbind(t(indx),as.data.frame(t(combn(dimnames(m)$author,2,
FUN=function(x) m[x[1],] * m[x[2],]))))
If you want to subset the rows that have at least some value other than 0
res1 <- res[!!rowSums(res[,-(1:2)]),]
Update
For the sum, you can do rowSums
res$Sum <- rowSums(res[,-(1:2)])
head(res,3)
# author1 author2 2 3 4 5 6 8 11 12 Sum
#1 FK FN 0 0 0 0 0 0 0 0 0
#2 FK SM 0 0 0 0 0 0 0 0 0
#3 FK VA 0 0 0 0 0 0 1 1 2
Update2
Regarding the second question of dividing res1 by CL
CL <- colSums(res1[,-(1:2)])
CL <- CL-1
CL[ CL<1 ] <- 0
res1[-(1:2)]/CL[col(res1[-(1:2)])]
Update3
Regarding the new dataset,
d <- read.csv('AuthorsRevised.csv', stringsAsFactors=FALSE)
m <- xtabs(~Authors+ID,d)
indx <- combn(dimnames(m)$Authors,2)
dim(indx)
#[1] 2 435711
res <- cbind(t(indx), as.data.frame(do.call(rbind,
lapply(split(indx, col(indx)), function(x) m[x[1],]*m[x[2],]))))
colnames(res)[1:2] <- paste0('author', 1:2)
dim(res)
#[1] 435711 534
res[1:3,1:3]
# author1 author2 1
#1 Abe S.-i. Achterberg W. 0
#2 Abe S.-i. Adebowale B.O.A. 0
#3 Abe S.-i. Aghion P. 0
Another option if you just want the sum would be
t1 <- crossprod(table(d))
t1[upper.tri(t1, diag=TRUE)] <- NA
library(reshape2)
res1 <- melt(t1, na.rm=TRUE)[,c(2:1,3)]
I have been using the textmatrix() function for a while to create DTMs which I can further use for LSI.
dirLSA<-function(dir){
dtm<-textmatrix(dir)
return(lsa(dtm))
}
textdir<-"C:/RProjects/docs"
dirLSA(textdir)
> tm
$matrix
D1 D2 D3 D4 D5 D6 D7 D8 D9
1. 000 2 0 0 0 0 0 0 0 0
2. 20 1 0 0 1 0 0 1 0 0
3. 200 1 0 0 0 0 0 0 0 0
4. 2014 1 0 0 0 0 0 0 0 0
5. 2015 1 0 0 0 0 0 0 0 0
6. 27 1 0 0 0 0 0 0 1 0
7. 30 1 0 0 0 1 0 1 0 0
8. 31 1 0 2 0 0 0 0 0 0
9. 40 1 0 0 0 0 0 0 0 0
10. 45 1 0 0 0 0 0 0 0 0
11. 500 1 0 0 0 0 0 1 0 0
12. 600 1 0 0 0 0 0 0 0 0
728. bias 0 0 0 2 0 0 0 0 0
729. biased 0 0 0 1 0 0 0 0 0
730. called 0 0 0 1 0 0 0 0 0
731. calm 0 0 0 1 0 0 0 0 0
732. cause 0 0 0 1 0 0 0 0 0
733. chauhan 0 0 0 2 0 0 0 0 0
734. chief 0 0 0 8 0 0 1 0 0
Textmatrix() is a function which takes a directory(folder path) and returns a document-wise term frequency. This is used in further analysis like Latent Semantic Indexing/Allocation(LSI/LSA)
However, a new problem that came across me is that if I have tweet data in batch files (~500000 tweets/batch) and I want to carry out similar operations on this data.
I have code modules to clean up my data, and I want to pass the cleaned tweets directly to the LSI function. The problem I face is that the textmatrix() does not support it.
I tried looking at other packages and code snippets, but that didn't get me any further. Is there any way I can create a line-term matrix of sorts?
I tried sending table(tokenize(cleanline[i])) into a loop, but it wont add new columns for words not already there in the matrix. Any workaround?
Update: I just tried this:
a<-table(tokenize(cleanline[10]))
b<-table(tokenize(cleanline[12]))
df1<-data.frame(a)
df1
df2<-data.frame(b)
df2
merge(df1,df2, all=TRUE)
I got this:
> df1
Var1 Freq
1 6
2 " 2
3 and 1
4 home 1
5 mabe 1
6 School 1
7 then 1
8 xbox 1
> b<-table(tokenize(cleanline[12]))
> df2<-data.frame(b)
> df2
Var1 Freq
1 13
2 " 2
3 BillGates 1
4 Come 1
5 help 1
6 Mac 1
7 make 1
8 Microsoft 1
9 please 1
10 Project 1
11 really 1
12 version 1
13 wish 1
14 would 1
> merge(df1,df2)
Var1 Freq
1 " 2
> merge(df1,df2, all=TRUE)
Var1 Freq
1 6
2 13
3 " 2
4 and 1
5 home 1
6 mabe 1
7 School 1
8 then 1
9 xbox 1
10 BillGates 1
11 Come 1
12 help 1
13 Mac 1
14 make 1
15 Microsoft 1
16 please 1
17 Project 1
18 really 1
19 version 1
20 wish 1
21 would 1
I think I'm close.
Try something like this
ll <- list(df1,df2)
dtm <- xtabs(Freq ~ ., data = do.call("rbind", ll))
Something that works for me:
textLSA<-function(text){
a<-data.frame(table(tokenize(text[1])))
colnames(a)[2]<-paste(c("Line",1),collapse=' ')
df<-a
for(i in 1:length(text)){
a<-data.frame(table(tokenize(text[i])))
colnames(a)[2]<-paste(c("Line",i),collapse=' ')
df<-merge(df,a, all=TRUE)
}
df[is.na(df)]<-0
dtm<-as.matrix(df[,-1])
rownames(dtm)<-df$Var1
return(lsa(dtm))
}
What do you think of this code?