Generate a Preference Matrix in R?

Generate a Preference Matrix in R? - r

I'm using r to analyze an undirected network of individuals with ethnicities as attributes. I want to create a tie accounts table, or "preference matrix," a square matrix where values of ethnicity are arrayed on both dimensions, and each cell tells you how many ties correspond to that type of relationship. (so from this you can calculate the probability of one group throwing ties to another group - but I just want to use it as an argument in igraph's preference.game function). here's what I tried:
# I create a variable for ethnicity by assigning the names of my vertices to their corresponding ethnicities
eth <- atts$Ethnicity[match(V(mahmudNet)$name,atts$Actor)]
# I create an adjacency matrix from my network data
mat <- as.matrix(get.adjacency(mahmudNet))
# I create the dimensions for my preference matrix from the Ethnicity values
eth.value <- unique(sort(eth))
# I create an empty matrix using these dimensions
eth.mat <- array(NA,dim=c(length(eth.value),length(eth.value)))
# I create a function that will populate the empty cells of the matrix
for (i in eth.value){
for (j in eth.value){
eth.mat[i,j] <- sum(mat[eth==i,eth==j])
}
}
My problem is at the end, I think. I need to figure out an expression that tells R how to populate the cells. the expression I put doesn't seem to work, but I want it so that potentially I could go
a <- sum(mat[eth=="White", eth=="Black"])
And then "a" would return the sum of all the cells in the adjacency matrix that correspond to a White-Black relationship.
Here's a sample of my data:
# data frame with Ethnicity attributes:
Actor Ethnicity
1 Sultan Mahmud of Siak 2
2 Daeng Kemboja 1
3 Raja Kecik of Trengganu 1
4 Raja Alam 2
5 Tun Dalam 2
6 Raja Haji 1
7 The Suliwatang 1
8 Punggawa Miskin 1
9 Tengku Selangor 1
10 Tengku Raja Said 1
11 Datuk Bendahara 2
12 VOC 3
13 King of Selangor 1
14 Dutch at Batavia 3
15 Punggawa Tua 2
16 Raja Tua Encik Andak 1
17 Raja Indera Bungsu 2
18 Sultan of Jambi 2
19 David Boelen 3
20 Datuk Temenggong 2
21 Punggawa Opu Nasti 1
# adjacency matrix with relations
Daeng Kemboja Punggawa Opu Nasti Raja Haji Daeng Cellak
Daeng Kemboja 0 1 1 1
Punggawa Opu Nasti 1 0 1 0
Raja Haji 1 1 0 0
Daeng Cellak 1 0 0 0
Daeng Kecik 1 0 0 0
Daeng Kecik
Daeng Kemboja 1
Punggawa Opu Nasti 0
Raja Haji 0
Daeng Cellak 0
Daeng Kecik 0

This is a simple job for table, once you have your data in the right shape.
First a sample dataset:
# fake ethnicity data by actor
actor_eth <- data.frame(actor = letters[1:10],
eth = sample(1:3, 10, replace=T))
# fake adjacency matrix
adj_mat <- matrix(rbinom(100, 1, .5), ncol=10)
dimnames(adj_mat) <- list(letters[1:10], letters[1:10])
# blank out lower triangle & diagonal,
# so random data is not asymetric & no self-ties
adj_mat[lower.tri(adj_mat)] <- NA
diag(adj_mat) <- NA
Here's our fake adjacency matrix:
a b c d e f g h i j
a NA 1 1 1 0 0 1 1 0 1
b NA NA 0 1 0 1 0 0 1 0
c NA NA NA 1 1 0 0 1 0 0
d NA NA NA NA 1 0 0 1 1 0
e NA NA NA NA NA 0 0 1 0 1
f NA NA NA NA NA NA 1 1 0 1
g NA NA NA NA NA NA NA 1 1 0
h NA NA NA NA NA NA NA NA 0 0
i NA NA NA NA NA NA NA NA NA 1
j NA NA NA NA NA NA NA NA NA NA
Here's our fake eth table:
actor eth
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 3
7 g 3
8 h 3
9 i 1
10 j 2
So what you want to do is 1) put this in long format, so you have a bunch of rows with a source actor and a target actor, each representing a tie. Then 2) replace the actor name with ethnicity, so you have ties with source/target ethnicity. Then 3) you can just use table to make a cross tab.
# use `melt` to put this in long form, omitting rows showing "non connections"
library(reshape2)
actor_ties <- subset(melt(adj_mat), value==1)
# now replace the actor names with their ethnicities to get create a data.frame
# of ties by ethnicty
eth_ties <-
data.frame(source_eth = with(actor_eth, eth[match(actor_ties$Var1, actor)]),
target_eth = with(actor_eth, eth[match(actor_ties$Var2, actor)]))
# now here's your cross tab
table(eth_ties)
Result:
target_eth
source_eth 1 2 3
1 0 2 1
2 2 0 1
3 3 5 9

Related

Split numbers from letters in a mixed string of characters and put it into columns using Regex in R

I have a string of basketball player stats like in the example below:
stats <- c("40pt 2rb 1as 2st 2to 4trey 11-20fg 14-14ft",
"7pt 5rb 1as 2st 1bl 3to 3-5fg 1-4ft",
"0pt 1rb 1as 0-2fg")
Ideally I would like to transform this string into tabular format:
This is the key for each column:
pt=points
rb=rebounds
as=assists
st=steals
bl=blocks
to=turnovers
trey=3 pointers made
fg=field goals made-attempted
ft=free throws made-attempted

We split the string at the boundary between letter and digit to create the list ('lst'), loop through the list, change it to a data.frame with column names from the alternate split values, rbind the elements with rbindlist, split the elements having - to multiple columns with cSplit and convert the NA values to 0
library(data.table)
library(splitstackshape)
lst <- strsplit(stats, "(?<=[0-9])(?=[a-z])|\\s+", perl = TRUE)
lst1 <- lapply(lst, function(x)
as.data.frame.list(setNames(x[c(TRUE, FALSE)], x[c(FALSE, TRUE)])))
res <- cSplit(rbindlist(lst1, fill = TRUE), c('fg', 'ft'), '-')
for(nm in seq_along(res)){
set(res, i = NULL, j = nm, value = as.numeric(as.character(res[[nm]])))
set(res, i = which(is.na(res[[nm]])), j = nm, value = 0)
}
res
# pt rb as st to trey bl fg_1 fg_2 ft_1 ft_2
#1: 40 2 1 2 2 4 0 11 20 14 14
#2: 7 5 1 2 3 0 1 3 5 1 4
#3: 0 1 1 0 0 0 0 0 2 0 0

use dcast from reshape 2 package:
m=gsub("(\\d+)-(\\d+)(\\w+)","\\1\\3_m \\2\\3_a",stats)
n=gsub("(\\d+)(\\S*)","\\1 \\2",gsub("\\s","\n",m))
o=cbind(read.table(text=n),group=rep(1:length(n),lengths(strsplit(n,"\n"))))
dcast(o,group~V2,value.var="V1")
group as bl fg_a fg_m ft_a ft_m pt rb st to trey
1 1 1 NA 20 11 14 14 40 2 2 2 4
2 2 1 1 5 3 4 1 7 5 2 3 NA
3 3 1 NA 2 0 NA NA 0 1 NA NA NA
Using base R
> m=gsub("(\\d+)-(\\d+)(\\w+)","\\1\\3_m \\2\\3_a",stats)
> n=gsub("(\\d+)(\\S*)","\\1 \\2",gsub("\\s","\n",m))
> o=lapply(n,function(x)rev(read.table(text=x)))
> p=Reduce(function(x,y)merge(x,y,by="V2",all=T),o)
> read.table(text=do.call(paste,data.frame(t(p))),h=T)
as fg_a fg_m ft_a ft_m pt rb st to trey bl
1 1 20 11 14 14 40 2 2 2 4 NA
2 1 5 3 4 1 7 5 2 3 NA 1
3 1 2 0 NA NA 0 1 NA NA NA NA

Creating plot with geom_scatterpie and geom_scatterpie_legend with normalised values

I am trying to create a scatterpie plot with the scatterpie package in R. My data looks something like this
EEE Innovation n equal negative positive n_mod
0 0 2 NA 2 NA 0.3162278
0 1 6 4 2 NA 0.5477226
0 2 1 NA 1 NA 0.2236068
0 3 2 NA 2 NA 0.3162278
0 5 1 1 NA NA 0.2236068
1 0 4 2 1 1 0.4472136
1 1 14 4 5 5 0.5916080
1 2 9 3 2 4 0.4743416
1 3 1 NA 1 NA 0.1581139
1 5 1 NA 1 NA 0.1581139
2 1 3 NA 2 1 0.2738613
3 0 1 NA 1 NA 0.1581139
3 1 3 1 2 NA 0.2738613
3 2 4 NA 2 2 0.3162278
4 0 3 2 1 NA 0.2738613
4 1 14 5 3 6 0.5916080
4 2 14 4 NA 10 0.5916080
For creating my plot I use this command:
ggplot() +
geom_scatterpie(aes(x=EEE,y=Innovation, r = n_mod), data=pie_data,
cols=c("equal","negative","positive")) +
geom_scatterpie_legend((all_pie_data$n_mod), n=7,
labeller= function(x) x=sort(unique(pie_data$n)))
I use n_mod which I got with
for (l in 1:17) {
all_pie_data$n_mod[l] <- sqrt(all_pie_data$n[l]/40)
}
instead of n as radius because the radii of the pies would be too large for my graph and smaller pies would be buried under the larger ones. For the legend I want to have the radii of the n_mod, but with the label of the "real" n values.
When i try to create this plot I get the following error message:
Error in $<-.data.frame(*tmp*, "label", value = c(1L, 2L, 3L, 4L, :
replacement has 7 rows, data has 5
This error does not show up if I use anything lower than 24 in my n_mod creation:
for (l in 1:17) {
all_pie_data$n_mod[l] <- sqrt(all_pie_data$n[l]/24)
}
The pies generated by this are still to large for my graphs:
Does anyone have an idea how I can solve this problem or another way to create smaller pies?
P.S: This is my first question here, if I did something wrong with the formatting or any information is missing I am willing to improve!

You could set "r" to:
r = n_mod/2
This should make them look smaller.
ggplot() +
geom_scatterpie(aes(x=EEE,y=Innovation, r = n_mod/2), data=pie_data,
cols=c("equal","negative","positive")) +
geom_scatterpie_legend((all_pie_data$n_mod), n=7,
labeller= function(x) x=sort(unique(pie_data$n)))

Reformat categorical data in R

I have a categorical dataset that I am trying to summarize that has inherent differences in the nature of questions that were asked. The data below represent a questionnaire that had standard close-ended questions, but also questions where one could choose multiple answers from a list. "village" and "income" represent close-ended questions. "responsible.1"...etc... represent a list where the respondent either said yes or no to each.
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
j both DLNR NA DEQ NA Public
k regular.income DLNR NA NA NA NA
k regular.income DLNR CRM DEQ Mayor NA
l both DLNR NA NA Mayor NA
j both DLNR CRM NA Mayor NA
m regular.income DLNR NA NA NA Public
What I want is a 3-way table output with "village" and the suite of of "responsible" responsible variables wrapped up into a ftable. This way, I could use the table with numerous R packages for graphs and analyses.
RESPONSIBLE
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
j both 2 1 1 1 1
k regular income 2 1 1 1 0
l both 1 0 0 1 0
m regular income 1 0 0 0 1
as.data.frame(table(village, responsible.1) would get me the first, but I can't figure out how to get the entire thing wrapped up in a nice ftable.

> aggregate(dat[-(1:2)], dat[1:2], function(x) sum(!is.na(x)) )
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
1 j both 2 1 1 1 1
2 l both 1 0 0 1 0
3 k regular.income 2 1 1 1 0
4 m regular.income 1 0 0 0 1
I'm guessing you actually had another grouping vector , perhaps the first "responsible" column?
I don't really understand the sorting rules but reversing the order of the grouping columns may be closer to what you posted:
> aggregate(dat[-(1:2)], dat[2:1], function(x) sum(!is.na(x)) )
INCOME VILLAGE responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
1 both j 2 1 1 1 1
2 regular.income k 2 1 1 1 0
3 both l 1 0 0 1 0
4 regular.income m 1 0 0 0 1

How do you plot/analyze variables in R based on their common value in a data frame?

I have a data frame that I'm working with that contains experimental data. For the purposes of this post we can limit the discussion to 3 columns: ExperimentID, ROI, isContrast, isTreated, and, Value. ROI is a text-based factor that indicates where a region-of-interest is drawn, e.g. 'ROI_1', 'ROI_2',...etc. isTreated and isContrast are binary fields indicating whether or not some treatment was applied. I want to make a scatter plot comparing the values of, e.g., 'ROI_1' vs. 'ROI_2 ', which means I need the data paired in such a way that when I plot it the first X value is from Experiment_1 and ROI_1, the first Y value is from Experiment_1 and ROI_2, the next X value is from Experiment_2 and ROI_1, the next Y value is from Experiment_2 and ROI_2, etc. I only want to make this comparison for common values of isContrast and isTreated (i.e. 1 plot for each combination of these variables, so 4 plots altogether.
Subsetting doesn't solve my problem because data from different experiments/ROIs was sometimes entered out of numerical order.
The following code produces a mock data set to demonstrate the problem
expID = c('Bob','Bob','Bob','Bob','Lisa','Lisa','Lisa','Lisa','Alice','Alice','Alice','Alice','Joe','Joe','Joe','Joe','Bob','Bob','Alice','Alice','Lisa','Lisa')
treated = c(0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0)
contrast = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1)
val = c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,6,7,8,9,10,11)
roi = c(rep('A',16),'B','B','B','B','B','B')
myFrame = data.frame(ExperimentID=expID,isTreated = treated, isContrast= contrast,Value = val, ROI=roi)
ExperimentID isTreated isContrast Value ROI
1 Bob 0 0 1 A
2 Bob 0 1 2 A
3 Bob 1 0 3 A
4 Bob 1 1 4 A
5 Lisa 0 0 1 A
6 Lisa 0 1 2 A
7 Lisa 1 0 3 A
8 Lisa 1 1 4 A
9 Alice 0 0 1 A
10 Alice 0 1 2 A
11 Alice 1 0 3 A
12 Alice 1 1 4 A
13 Joe 0 0 1 A
14 Joe 0 1 2 A
15 Joe 1 0 3 A
16 Joe 1 1 4 A
17 Bob 0 0 6 B
18 Bob 0 1 7 B
19 Alice 0 0 8 B
20 Alice 0 1 9 B
21 Lisa 0 0 10 B
22 Lisa 0 1 11 B
Now let's say I want to scatter plot values for A vs. B. That is to say, I want to plot x vs. y where {(x,y)} = {(Bob's Value from ROI A, Bob's Value from ROI B), (Alice's Value from ROI A, Alices Value from ROI B)},...} etc. and these all must have the same values for isTreated and isContrast for the comparison to make sense. Now, if I just go an subset I'll get something like:
> x= myFrame$Value[(myFrame$ROI == 'A') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> x
[1] 1 1 1 1
> y= myFrame$Value[(myFrame$ROI == 'B') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> y
[1] 6 8 10
Now as you can see the values in y correspond to the first rows of Bob, Lisa, Alice and Joe, respectively but the values of y Bob, Alice and Lisa respectively, and there is no value for Joe.
So say I ignored the value for Joe because that data is missing for B and just decided to plot the first 3 values of x vs. the first 3 values of y. The data are still out of order because x = (Bob, Lisa, Alice) but y = (Bob, Alice, Lisa) in terms of where the values are coming from. So I would like to now how to make vectors such that the order is correct and the plot makes sense.

Similar to #Matthew, with ggplot:
The idea is to reshape your data so the the values from ROI=A and RIO=B are in different columns. This can be done (with your sample data) as follows:
library(reshape2)
zz <- dcast(myFrame,
value.var="Value",
formula=ExperimentID+isTreated+isContrast~ROI)
zz
ExperimentID isTreated isContrast A B
1 Alice 0 0 1 8
2 Alice 0 1 2 9
3 Alice 1 0 3 NA
4 Alice 1 1 4 NA
5 Bob 0 0 1 6
6 Bob 0 1 2 7
7 Bob 1 0 3 NA
8 Bob 1 1 4 NA
9 Joe 0 0 1 NA
10 Joe 0 1 2 NA
11 Joe 1 0 3 NA
12 Joe 1 1 4 NA
13 Lisa 0 0 1 10
14 Lisa 0 1 2 11
15 Lisa 1 0 3 NA
16 Lisa 1 1 4 NA
Notiice that your sample data is rather sparse (lots of NA's).
To plot:
library(ggplot2)
ggplot(zz,aes(x=A,y=B,color=factor(isTreated))) +
geom_point(size=4)+facet_wrap(~isContrast)
Produces this:
The reason there are no blue points is that, in your sample data, there are no occurrences of isTreated=1 and ROI=B.

Something like this, perhaps:
myFrameReshaped <- reshape(myFrame, timevar='ROI', direction='wide', idvar=c('ExperimentID','isTreated','isContrast'))
plot(Value.B ~ Value.A, data=myFrameReshaped)
To condition by the isTreated and isContrast variables, lattice comes in handy:
library(lattice)
xyplot(Value.B~Value.A | isTreated + isContrast, data=myFrameReshaped)
Values that are not present for one of the conditions give NA, and are not plotted.
head(myFrameReshaped)
## ExperimentID isTreated isContrast Value.A Value.B
## 1 Bob 0 0 1 6
## 2 Bob 0 1 2 7
## 3 Bob 1 0 3 NA
## 4 Bob 1 1 4 NA
## 5 Lisa 0 0 1 10
## 6 Lisa 0 1 2 11

Combine similar rows across two data frames

Still getting the gist of R. I have two data frames where the rows are named with different coordinates (e.g. x_1013y_41403; see below). The coordinates form sets of five, each set makes a cross if plotted onto a grid. The center coordinate is in one data frame, and the four peripheral coordinates are in the other.
Center A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA
Peripheral A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0
To form a set, the peripheral coordinates would have the same x or y location as the center coordinate. For example, the center coordinate x_723y_6363 would be associated with x_723y_6100 and x_723y_6627 (same x location), as well as x_459y_6363 and x_987y_6363 (same y location).
I would like to combine the coordinates into their respective sets, and name the set with the center coordinate. For the case above, I would end up with two rows, where each row is the summation of a set.
A B C D E F
x_723y_6363.txt 554 5 604 1 645 8
x_749y_41403.txt 14 4 6 0 13 0
I am not sure at all how this can be done. I have thought about creating regular expressions to pick out the x and y coordinates individually and then doing a comparison across the two data frames. Any help would be greatly appreciated!

I hope someone else comes up with a better answer as this is ugly. I would first split the .txt names into x and y values then loop over each of the variables that is NA in center and sum all values that are share an x or y value with that center. Edit: Changed the sapply to make it slightly nicer.
center <- read.table(textConnection("
A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA"),
header = TRUE)
peripheral <- read.table(textConnection("
A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0"),
header = TRUE)
xpat <- "^([^y]+).*"
ypat <- ".*(y_[0-9]+)\\.txt"
center$x <- gsub(xpat, "\\1", rownames(center))
center$y <- gsub(ypat, "\\1", rownames(center))
peripheral$x <- gsub(xpat, "\\1", rownames(peripheral))
peripheral$y <- gsub(ypat, "\\1", rownames(peripheral))
vars <- c("B", "D", "F")
center[vars] <- sapply(peripheral[vars], function(col)
apply(center, 1, function(row) sum(col[peripheral$x %in% row["x"] | peripheral$y %in% row["y"]]) )
)
R> center
A B C D E F x y
x_723y_6363.txt 554 5 604 1 645 8 x_723 y_6363
x_749y_41403.txt 14 4 6 0 13 0 x_749 y_41403

Another option:
# function to split coordinates x and y:
f <- function(DF) structure(
t(sapply(strsplit(row.names(DF), "[_y.]"), `[`, c(2,4))),
dimnames=list(NULL, c("x", "y")))
# get x and y for peripheral data:
P <- cbind(Peripheral, f(Peripheral))
# get x and y for centers, and mark ids:
C <- cbind(Center, f(Center), id=1:nrow(Center))
# matching:
Q <- merge(merge(P, C[,c("x","id")], all=TRUE), C[,c("y","id")], by="y", all=TRUE)
# prepare for union:
R <- within(Q, {id <- ifelse(is.na(id.y), id.x, id.y); id.x <- NULL; id.y <- NULL})
# join everything and aggregate:
S <- rbind(R, C)
aggregate(S[,3:8], by=list(id=S$id), FUN=sum, na.rm=TRUE)
Result:
id A B C D E F
1 1 554 5 604 1 645 8
2 2 14 4 6 0 13 0