Convert categorical data in data frame to weighted adjacency matrix - r

I have the following data frame, call it DF, which is a data frame consisting in three vectors: "Chunk" "Name," and "Frequency." I need to turn it into a NameXName adjacency matrix where Names are considered adjacent when they reside in the same chunk. So for example, in the first lines, Gretel and Friedrich are adjacent because they are both in Chunk2. And the weight of the relationship should be based on "Frequency," precisely the number of times they are co-present in the same chunk, so for the Gretel/Friedrich example, Frequency(Gretel)+Frequency(Friedrich)-1 = 5
Chunk Name Frequency
1 2 Gretel 2
2 2 Pollock 1
3 2 Adorno 1
4 2 Friedrich 4
5 3 Max 1
6 3 Horkheimer 1
7 3 Adorno 1
8 4 Friedrich 5
9 4 Pollock 1
10 4 March 1
11 5 Comte 3
12 7 Jaspers 1
13 7 Huxley 2
14 8 Nietzsche 1
15 8 Sade 2
16 8 Felix 1
17 8 Weil 1
18 8 Western 1
19 8 Lowenthal 1
20 8 Kant 1
21 8 Hitler 1
I started to crack at this by splitting the data frame according to DF$Chunk,
> DF.split<-split(DF, DF$Chunk)
$`2`
Chunk Name Frequency
1 2 Gretel 2
2 2 Pollock 1
3 2 Adorno 1
4 2 Friedrich 4
$`3`
Chunk Name Frequency
5 3 Max 1
6 3 Horkheimer 1
7 3 Adorno 1
$`4`
Chunk Name Frequency
8 4 Friedrich 5
9 4 Pollock 1
10 4 March 1
which I thought got closer, but it returns list items that I am having trouble turning back into workable data frames.
I have also tried to start by turning this into a ChunkXName adjacency matrix:
> chunkbyname<-tapply(DF$Frequency , list(DF$Name,DF$Chunk) , as.character )
with the hopes of multiplying chunkbyname by its transpose to get the NAmeXName matrix, but it seems this is the matrix is too sparse or complex (Error in a %*% b : requires numeric/complex matrix/vector arguments).
Any help getting this data frame into an adjacency matrix greatly appreciated.

Is this what you are looking for?
df3 <- by(df, df$Chunk, function(x){
mm <- outer(x$Frequency, x$Frequency, "+") - 1
rownames(mm) <- x$Name
colnames(mm) <- x$Name
mm
})
df3
# $`2`
# Gretel Pollock Adorno Friedrich
# Gretel 3 2 2 5
# Pollock 2 1 1 4
# Adorno 2 1 1 4
# Friedrich 5 4 4 7
#
# $`3`
# Max Horkheimer Adorno
# Max 1 1 1
# Horkheimer 1 1 1
# Adorno 1 1 1
#
# $`4`
# Friedrich Pollock March
# Friedrich 9 5 5
# Pollock 5 1 1
# March 5 1 1

Related

Creating a new column using for/nested loop in r

Just getting started using R and I need some help in understanding the application of for/nested loop.
StudyID<-c(1:5)
SubjectID<-c(1:5)
df<-data.frame(StudyID=rep(StudyID, each=5), SubjectID=rep(SubjectID, each=1))
How can I create a new column called as ID, which would use the combination of studyID and subjectID to create a unique ID ?
So for this data, unique ID should be from 1:25.
So the final data looks like this:
UniqueID<- c(1:25)
df<-cbind(df,UniqueID)
View(df)
Is there any other way which is faster and more efficient that looping ?
Using the dplyr package, you could do:
library(dplyr)
df$Id = group_indices(df,StudyID,SubjectID)
This returns:
#StudyID SubjectID Id
# 1 1 1
# 1 2 2
# 1 3 3
# 1 4 4
# 1 5 5
# 2 1 6
# 2 2 7
# 2 3 8
# 2 4 9
# 2 5 10
# 3 1 11
# 3 3 13
# 3 4 14
# 3 5 15
# 4 1 16
# 4 2 17
# 4 3 18
# 4 4 19
# 4 5 20
# 5 1 21
# 5 2 22
# 5 3 23
# 5 4 24
# 5 5 25
Another method to achieve that without loading any library (base R) would be this (assuming data frame is sorted based on the two columns):
StudyID<-c(1:5)
SubjectID<-c(1:5)
df<-data.frame(StudyID=rep(StudyID, each=5), SubjectID=rep(SubjectID, each=1))
df$uniqueID <- cumsum(!duplicated(df[1:2]))
or you can use this solution, mentioned in the comments (I prefer this over the first solution):
df$uniqueID <- as.numeric(factor(do.call(paste, df)))
The output would be:
> print(df, row.names = FALSE)
#StudyID SubjectID uniqueID
# 1 1 1
# 1 2 2
# 1 3 3
# 1 4 4
# 1 5 5
# 2 1 6
# 2 2 7
# 2 3 8
# 2 4 9
# 2 5 10
# 3 1 11
# 3 2 12
# 3 3 13
# 3 4 14
# 3 5 15
# 4 1 16
# 4 2 17
# 4 3 18
# 4 4 19
# 4 5 20
# 5 1 21
# 5 2 22
# 5 3 23
# 5 4 24
# 5 5 25
You could go for interaction in base R:
df$uniqueID <- with(df, as.integer(interaction(StudyID,SubjectID)))
For example (this example expresses better what you are after):
set.seed(10)
df <- data.frame(StudyID=sample(5,10,replace = T), SubjectID=rep(1:5,times=2))
df$uniqueID <- with(df, as.integer(interaction(StudyID,SubjectID)))
# StudyID SubjectID uniqueID
# 1 3 1 3
# 2 2 2 6
# 3 3 3 11
# 4 4 4 16
# 5 1 5 17
# 6 2 1 2
# 7 2 2 6
# 8 2 3 10
# 9 4 4 16
# 10 3 5 19

Compute degree of each vertex from data frame

I have the following dataset:
V1 V2
2 1
3 1
3 2
4 1
4 2
4 3
5 1
6 1
7 1
7 5
7 6
I tried to compute the degree of each vertex with the code
e<-read.table("ex.txt")
library(igraph)
g1<-graph.data.frame(e, directed=FALSE)
adj<- get.adjacency(g1,type=c("both", "upper", "lower"),attr=NULL, names=TRUE, sparse=FALSE)
d<-rowSums(adj)
e$degreeOfV1<-d[e$V1]
e$degofV2<-d[e$V2]
the degree given by this code is not correct.
The problem with this code is that the nodes have be inputted into your graph in a different order than you expected:
V(g1)
# + 7/7 vertices, named:
# [1] 2 3 4 5 6 7 1
The first node in the graph (corresponding to element 1 of your d object) is actually node number 2 in e, element 2 is node number 3 in e, etc.
You can deal with this by using the node names instead of the node numbers when calculating the degrees:
d <- degree(g1)
e$degreeOfV1 <- d[as.character(e$V1)]
e$degreeOfV2 <- d[as.character(e$V2)]
# V1 V2 degreeOfV1 degreeOfV2
# 1 2 1 3 6
# 2 3 1 3 6
# 3 3 2 3 3
# 4 4 1 3 6
# 5 4 2 3 3
# 6 4 3 3 3
# 7 5 1 2 6
# 8 6 1 2 6
# 9 7 1 3 6
# 10 7 5 3 2
# 11 7 6 3 2
Basically the way this works is that degree(g1) returns a named vector of the degrees of each node in your graph:
(d <- degree(g1))
# 2 3 4 5 6 7 1
# 3 3 3 2 2 3 6
When you index by strings (as.character(e$V1) instead of e$V1), then you get the node by name instead of by index number.

R cumulative sum based upon other columns

I have a data.frame as below. The data is sorted by column txt and then by column val. summ column is sum of value in val colummn and the summ column value from the earlier row provided that the current row and the earlier row have same value in txt column...How could i do this in R?
txt=c(rep("a",4),rep("b",5),rep("c",3))
val=c(1,2,3,4,1,2,3,4,5,1,2,3)
summ=c(1,3,6,10,1,3,6,10,15,1,3,6)
dd=data.frame(txt,val,summ)
> dd
txt val summ
1 a 1 1
2 a 2 3
3 a 3 6
4 a 4 10
5 b 1 1
6 b 2 3
7 b 3 6
8 b 4 10
9 b 5 15
10 c 1 1
11 c 2 3
12 c 3 6
If by "most earlier" (which in English is more properly written "earliest") you mean the nearest, which is what is implied by your expected output, then what you're talking about is a cumulative sum. You can apply cumsum() separately to each group of txt with ave():
dd <- data.frame(txt=c(rep("a",4),rep("b",5),rep("c",3)), val=c(1,2,3,4,1,2,3,4,5,1,2,3) );
dd$summ <- ave(dd$val,dd$txt,FUN=cumsum);
dd;
## txt val summ
## 1 a 1 1
## 2 a 2 3
## 3 a 3 6
## 4 a 4 10
## 5 b 1 1
## 6 b 2 3
## 7 b 3 6
## 8 b 4 10
## 9 b 5 15
## 10 c 1 1
## 11 c 2 3
## 12 c 3 6

Reflecting changes in a dataframe by Modifying a list containing the dataframe

I have a list containing a couple of data frames. I'm applying lapply on the list and directing output to the same list itself. I expected that this would change the dataframes themselves, but it doesn't. Can someone help with this? I guess it should be quite straight forward, but can't find anything that helps.
Thanks.
Sample data: (Source: Change multiple dataframes in a loop)
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame3 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
ll <- list(data_frame1,data_frame2,data_frame3)
ll <- lapply(ll,function(df){
df$log_a <- log(df$a) ## new column with the log a
df$tans_col <- df$a+df$b+df$c ## new column with sums of some columns or any other
df
})
Results:
ll
[[1]]
a b c log_a tans_col
1 1 3 4 0.0000000 8
2 5 6 4 1.6094379 15
3 3 1 1 1.0986123 5
4 3 5 9 1.0986123 17
5 2 5 2 0.6931472 9
[[2]]
a b c log_a tans_col
1 6 2 8 1.7917595 16
2 0 7 4 -Inf 11
3 9 2 1 2.1972246 12
4 1 2 9 0.0000000 12
5 2 1 2 0.6931472 5
[[3]]
a b c log_a tans_col
1 0 4 2 -Inf 6
2 0 1 9 -Inf 10
3 1 9 7 0.000000 17
4 5 2 1 1.609438 8
5 1 3 1 0.000000 5
data_frame1
a b c
1 1 3 4
2 5 6 4
3 3 1 1
4 3 5 9
5 2 5 2

Replicate each row of data.frame and specify the number of replications for each row?

I am programming in R and I got the following problem:
I have a data String jb, that is quite long. Heres a simple version of it:
jb: a b frequency jb.expanded: a b
5 3 2 5 3
5 7 1 5 3
9 1 40 5 7
12 4 5 9 1
12 5 13 9 1
... ...
I want to replicate the rows and the frequency of the replication is the column frequency. That means, the first row is replicated two times, the second row is replicated 1 time and so on. I already solved that problem with the code
jb.expanded <- jb[rep(row.names(jb), jb$freqency), 1:2]
Now here is the problem:
Whenever any number in the frequency corner is greater than 10, the number of replicated columns is wrong. For example:
Frequency: 43 --> 14 columns
40 --> 13 columns
13 --> 11 columns
14 --> 12 columns
Can you help me? I have no idea how to fix that, I also cannot find anything on the internet.
Thanks for your help!
Update
Upon revisiting this question, I have a feeling that #Codoremifa was correct in their assumption that your "frequency" column might be a factor.
Here's an example if that were the case. It won't match your actual data since I don't know what other levels are in your dataset.
mydf$F2 <- factor(as.character(mydf$frequency))
## expandRows(mydf, "F2")
mydf[rep(rownames(mydf), mydf$F2), ]
# a b frequency F2
# 1 5 3 2 2
# 1.1 5 3 2 2
# 1.2 5 3 2 2
# 2 5 7 1 1
# 3 9 1 40 40
# 3.1 9 1 40 40
# 3.2 9 1 40 40
# 3.3 9 1 40 40
# 4 12 4 5 5
# 4.1 12 4 5 5
# 4.2 12 4 5 5
# 4.3 12 4 5 5
# 4.4 12 4 5 5
# 5 12 5 13 13
# 5.1 12 5 13 13
Hmmm. That doesn't look like 61 rows to me. Why not? Because rep uses the numeric values underlying the factor, which is quite different in this case from the displayed value:
as.numeric(mydf$F2)
# [1] 3 1 4 5 2
To properly convert it, you would need:
as.numeric(as.character(mydf$F2))
# [1] 2 1 40 5 13
Original answer
A while ago I wrote a function that is a bit more of a generalization of #Simono101's answer. The function looks like this:
expandRows <- function(dataset, count, count.is.col = TRUE) {
if (!isTRUE(count.is.col)) {
if (length(count) == 1) {
dataset[rep(rownames(dataset), each = count), ]
} else {
if (length(count) != nrow(dataset)) {
stop("Expand vector does not match number of rows in data.frame")
}
dataset[rep(rownames(dataset), count), ]
}
} else {
dataset[rep(rownames(dataset), dataset[[count]]),
setdiff(names(dataset), names(dataset[count]))]
}
}
For your purposes, you could just use expandRows(mydf, "frequency")
head(expandRows(mydf, "frequency"))
# a b
# 1 5 3
# 1.1 5 3
# 2 5 7
# 3 9 1
# 3.1 9 1
# 3.2 9 1
Other options are to repeat each row the same number of times:
expandRows(mydf, 2, count.is.col=FALSE)
# a b frequency
# 1 5 3 2
# 1.1 5 3 2
# 2 5 7 1
# 2.1 5 7 1
# 3 9 1 40
# 3.1 9 1 40
# 4 12 4 5
# 4.1 12 4 5
# 5 12 5 13
# 5.1 12 5 13
Or to specify a vector of how many times to repeat each row.
expandRows(mydf, c(1, 2, 1, 0, 2), count.is.col=FALSE)
# a b frequency
# 1 5 3 2
# 2 5 7 1
# 2.1 5 7 1
# 3 9 1 40
# 5 12 5 13
# 5.1 12 5 13
Note the required count.is.col = FALSE argument in those last two options.
Nearly. You want to pass [ a vector of row indices, not row.names. Try this...
jb[ rep( seq_len( nrow(jb) ) , times = jb$frequency ) , ]
rep( seq_len( nrow(jb) ) , times = jb$frequency )
# [1] 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [39] 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5
This might be more of a comment but seeing that all the other answers are suggesting new options - if you correct the spelling of jb$freqency when creating jb.expanded, and convert jb$frequency to an integer then the construction you mention in your question also works.
And why I suspect jb$frequency is a factor is because the incorrect frequencies are neatly ordered as 11,12,13,14.

Resources