I am trying to convert cross-sectional data into an adjacency matrix, as I want to analyze how often certain variables are present together with social network analysis.
In case empirical examples would help with the logic, it's basically analogous to presenting 4 people with a choice of three objects; they can choose from 0 to 3 of the objects. I'd like to analyze how commonly different objects were chosen together and visualize this as a network of preferences.
The data is set up as cross-sectional data, below:
ID1 <- c(1,0,0)
ID2 <- c(1,0,1)
ID3 <- c(1,1,1)
ID4 <- c(0,0,0)
IDs <- c("1","2","3","4")
df <- data.frame(rbind(ID1, ID2, ID3, ID4))
df <- cbind(IDs, df)
colnames(df) <- c("ID", "Var1", "Var2", "Var3")
I'd like to create a weighted adjacency matrix for Var1, Var2 and Var3, with each cell containing the total number of times the two variables occur together among the observations.
So the basic procedure I was thinking about is to create a separate matrix for each row (each ID number) with a 1 or 0 for each cell indicating whether or not both variables are present for the ID. And then add these matrices together, so the final matrix gives the total number of joint appearances.
I've been looking around and haven't quite gotten it right. I thought of using outer, but it'd need to work for each column in sequence. This answer was pretty close, but I wasn't exactly sure how they were adding together the values. I ended up with a list of matrices, but the values didn't correspond to the initial data-
Convert categorical data in data frame to weighted adjacency matrix. And this answer was also close, although it seemed to have a different type of data. It gave me an adjacency matrix based on the IDs-
http://r.789695.n4.nabble.com/Conversion-to-Adjacency-Matrix-td794102.html
Here is very messy code to manually create a matrix for one observation, just so you get a sense for what I'm going for (using a vector representing just the first ID observation)
ID1 <- c(1,0,0)
var1 <- ID1[[1]]
var2 <- ID1[[2]]
var3 <- ID1[[3]]
onetwo <- var1 * var2
onethree <- var1 * var3
twothree <- var2 * var3
oneone <- var1 * var1
twotwo <- var2 * var2
threethree <- var3 * var3
rows1 <- rbind(oneone, onetwo, onethree)
rows2 <- rbind(onetwo, twotwo, twothree)
rows3 <- rbind(onethree, twothree, threethree)
df2 <- cbind(rows1, rows2, rows3)
This obviously is not ideal, my actual dataset has 198 observations and 33 variables, so even with looping or the use of apply functions it would be very inefficient.
I can't tell if I'm making this more difficult than it needs to be, or if I'm trying to force my data to do something it wasn't meant to do. But if anyone has run into this sort of task before, please let me know. Is there a way to create the desired adjacency matrix directly? Should I transfer this into an edge list first, and is there a good way to do that? Is there code that would make the first step(creating a matrix for each row of the dataframe) more efficient?
Thanks for your help,
I'm not sure if I understand the question, but is this what you want?
nc=33
nr=198
m3<-matrix(sample(0:1,nc*nr,replace=TRUE),nrow=nr)
df3<-data.frame(m3)
m3b <-matrix(0,nrow=nc,ncol=nc)
for(i in seq(1,nc)) {
for (j in seq(1,nc)) {
t3<-table(df3[,i],df3[,j])
m3b[i,j] = t3[2,2] # t3[2,2] contains the count of df3[,i] = df3[,j] = 1
# or
# t3 = sum(df3[,i]==df3[,j] & df3[,i] == 1)
# m3b[i,j] = t3
}
}
or, if you want the sum of the product, which gives the same result if everything is 1 or 0
m3c <-matrix(0,nrow=nc,ncol=nc)
for(i in seq(1,nc)) {
for (j in seq(1,nc)) {
sv=0
for (k in seq(1,nr)) {
vi = df3[k,i]
vj = df3[k,j]
sv=sv+vi*vj
}
m3c[i,j] = sv
}
}
Related
I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.
Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.
# Code to generate minimum working example (country year pairs).
library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
# Function to generate minimum working example data
simulateCountryData = function(N=200, NEACH = 20, SEED=100){
variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
variableOne[variableOne<0]<-0
variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
variableTwo[variableTwo<0]<-0
variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
variableThree[variableTwo<0]<-0
geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
# Putting it all together
AllData<-data.frame(geocodeNum,
year,
variableOne,
variableTwo,
variableThree)
return(AllData)
}
# This runs the function and generates the data
mySimData = simulateCountryData()
I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).
# Example pvalue
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo)
Finally, the end result should look something like this :
myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)
# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1)
myMatrix$Rval = runif(9,0.2,1)
myMatrix
Thanks again :)
This will compute r and p for all the unique pairs.
# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
# lookup names of variables
v1 <- myVariables[p[1]]
v2 <- myVariables[p[2]]
# perform the cor.test()
htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
# record pertinent info in a dataframe
data.frame(Var1 = v1,
Var2 = v2,
Pval = htest$p.value,
Rval = unname(htest$estimate))
})
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)
I have a quite complex set of functions that I need to apply to four different dummy variables with same core name but different number at the end. I am looking to apply these functions in one go rather than repeating it four times.
As an example, here's a made up dataset just for illustrative purposes:
n <- c(1:100)
var1 <-NA
var1[n < 20] <- 1
var1[n >50] <- 0
var2 <-NA
var2[n < 30] <- 1
var2[n >50] <- 0
var3 <-NA
var3[n < 10] <- 1
var3[n >40] <- 0
var4 <-NA
var4[n < 20] <- 1
var4[n > 450] <- 0
df <- data.frame(var1, var2, var3, var4, n)
In terms of the functions I need to loop over, it's mainly three with regards to these variables. I need to be able to first subset the dataframe, create a new variable for each of the original ones, and write the new results into a dataframe. Please don't ask me why I need to do these, they're a part of a much larger code.
These are the steps I need to perform but on all 4:
df_sub <- subset(df, !is.na(df$var1))
sample1 <- nrow(df_sub[df_sub$var1 == 1,])
if(sample1 < 35) {
a1 <- NA
} else {
a1 <- mean(df_sub$n[df_sub$var1==1])
new_df <- data.frame(a1,a2,a3,a4)
I was thinking of looping over the suffix but I cannot figure out how R deals with this. I found a solution for creating a variable in a loop through assign() (https://stats.stackexchange.com/questions/10838/produce-a-list-of-variable-name-in-a-for-loop-then-assign-values-to-them)
But I still cannot figure out how to deal with the subset. And more generally, how I would go about looping over a number in variable name rather than column number, list, etc.
Alternatively if there is a way to create a function in which I can actually create variables to export into environment outside of this function and then apply the function to var1 - var4 in df and still get 4 different versions of a (a1 - a4) in a new_df.
You can start the loop and update the variable over which you work by using get() and then use assign(). As an example:
for (i in 1:number_of_variables){
variable=get(paste0("var",i))
... work on the variable ...
# Returns
assign(paste0("df_sub",i), ... your result ...)
}
I am trying to make feasible the computation of dissimilarity within records of a massive dataset (600,000 records).
The first task is to compute the dissimilarity using Euclidean Distance between one single record and the whole data.frame excluding that record.
Considering the following sample:
mydf <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm(5))
one_row <- mydf[1,]
The question articulates in two steps:
use a vectorized operation to return a vector of length 4 with the dissimilarity values of one_row compared to each row of mydf[-1,]
from the vector of point 1., extract the index of the row more similar to one_row
Then, I could iterate this process for every row in mydf and, therefore, finding for each row its most similar row. This would allow me to perform agglomerative clustering as well as computing statistics criterion like Silhoutte that are based on distance matrix.
Update
One possible approach is to replicate one_row to the same size of mydf and vectorize the similarity computation by performing it pair-wise.
replicated <- [rep(1, 5), 1:ncol(a)]
Correct Answer
Both the answers of Jesse Tweedle and won782 are correct to my question.
The positive aspect of Jesse Tweedle's is the possibility of customizing the distance function allowing to use mixed data types. The negative side is that it is not a single expression but it is a pipe of functions.
The positive aspect of won782 is that it is performed in a single expression. The negative aspect is that it only works for matrices, therefore, numeric variables.
I choose won782 answer because his solution can be easily extended to be used as fundamental component for computing Silhouette Criterion without storing the dissimilarity matrix.
If I understood your question correctly, you want to perform rowwise operation for a given vector and compute euclidean distance with every rows.
mydf <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm(5))
one_row <- mydf[1,]
result = apply(mydf, 1, function(x) {
sqrt(sum((x - one_row)^2))
})
result
[1] 0.000000 3.333031 3.737814 1.875482 4.216042
The result is vector of euclidean distances. Then, you can do which.min function to find the index of minimum value.
Using matrix operation :
sqrt(rowSums((t(t(as.matrix(mydf)) - as.numeric(one_row)))^2))
Benchmark two methods on larger dataset
> mydf <- data.frame(var1 = rnorm(10000), var2 = rnorm(10000), var3 = rnorm(10000))
> one_row <- mydf[1,]
> # Matrix operation method
> system.time({
+ sqrt(rowSums((t(t(as.matrix(mydf)) - as.numeric(one_row)))^2))
+ })
user system elapsed
0.000 0.000 0.001
> # Apply Method
> system.time({
+ apply(mydf, 1, function(x) {
+ sqrt(sum((x - one_row)^2))
+ })
+ })
user system elapsed
5.186 0.014 5.204
So clearly, matrix operation is faster method.
Problem:
You could use dist on mydf, but the answer will be too big for your computer (1e11-ish elements). So the challenge is to calculate the euclidean distance for each row x the whole dataset. You don't want to replicate the whole thing over and over because you'll be doing it 600,000 times. But you can write a vectorized function to calculate the Euclidean distance, and use tidyverse things to apply it succintly.
Answer:
Write a function euc and vectorize it over the second argument.
library(tidyverse)
euc <- function(x, y) {
sqrt(sum((x - y)^2))
}
euc_ <- Vectorize(euc, vectorize.args = "y")
calculate_distances <- function(row, df) {
dists <- euc_(row, split(df, 1:nrow(df)))
# gives you name of row and distance that gives minimum distance.
dists[dists>0 & dists == min(dists[dists>0])] %>% enframe()
}
Then the calculate_distances function calculates the euclidean distance from a single row to the rest of the dataset, then collapses the argument down to the name and value of the one with the minimum distance (excluding itself, so we need to include dist>0).
Then you combine the vars into one column (which makes it easier to pass to a function like calculate_distances without specifying the column names, var1, etc). Then use mutate and map to apply the function to every row, then unnest to unpack the results (and keep the original data, if you like).
mydf <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm(5))
mydf %>%
mutate(n = row_number()) %>%
group_by(n) %>%
nest(var1, var2, var3) %>%
mutate(ans = map(data, calculate_distances, df = mydf)) %>%
unnest(ans, data)
# A tibble: 5 x 6
n name value var1 var2 var3
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 4 1.027080 0.035684445 0.3152272 1.9001506
2 2 5 1.453509 -0.985996620 0.2650241 -0.2146157
3 3 2 1.645737 0.009665813 -0.8393461 0.4907029
4 4 1 1.027080 0.314943627 0.9910671 1.1789382
5 5 2 1.453509 0.436344415 0.5309611 -0.3521368
👍 Good luck! 🤞 Hope this helps.
I have this piece of script for R and I want to adjust it a little bit.
Here's the script I have, mydata is an imported .csv file of n columns:
library(orddom)
R=6
delta = numeric (R)
for (i in 1:R) {`
a <- data.matrix(sample(mydata, 2, replace=FALSE))
drops <- c(colnames(a))
b <- data.matrix(mydata[,!(names(mydata) %in% drops)])
a1 <- na.omit(t(matrix(a,1)))
b1 <- na.omit(t(matrix(b,1)))
colnames(a1) <- c("Group 1")
colnames(b1) <- c("Group 2")
delta [i] <- abs(as.numeric(orddom(a1, b1, alpha = 0.05, paired=FALSE)[13,1]))
The problem is that for vector a, the columns of mydata get resampled randomly, leading to several equal delta values, because every time the iterative process start again there is a possibility that the same set of columns get selected.
Now I want the columns to be not randomly resampled. So I want all the possible column combinations, column 1 and 2 and 3 is the same combination as column 2 and 1 and 3 and so on, avoiding combinations of one column with itself, without repetition.
Is there a way to exclude column combinations that have already been selected before?
Then I would like to calculate delta for every combination and store it in a vector.
orddom: Ordinal Dominance Statistics
You can try the following:
#get the combos outside the loop
combos<-combn(length(mydata),2)
R<-ncol(combos)
delta<-numeric(R)
#in the loop, replace the first line
a <- mydata[,combos[,i]]
#the rest should be ok
There are some improvements you could make in the code but they are not relevant in what you are asking.
I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?
Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)
If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.
After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}