Table of Raw Data to Adjacency Matrix/Sociomatrix - r

I have a data table arranged like so:
ID Category 1 Category 2 Category 3
Name 1 Example 1 Example 2 Example 3
Name 2 Example 1 Example 2 Example 4
Name 3 Example 5 Example 6 Example 4
.... .... .... .....
I'm trying to turn it into a table like this:
Name 1 Name 2 Name 3 ....
Name 1 0 2 0
Name 2 2 0 1
Name 3 0 1 0
....
Where each cell in the output table represents how many of the categories were the same when compared between IDs. This could also be how many of the categories were different, either one will work. I've looked into adjacency matrices and sociomatrices on stack overflow, as well as some of the matrix matching recommendations, but I don't think that my data table is set up properly. Does anyone have any recommendations on how this should be done?
EDIT: Ah, apologies. I'm using R as my program. Left that bit out

You can do this by putting your data into a long format first, at which point it becomes a pretty straightforward exercise:
# your data
tdf <- data.frame(ID = paste0("Name ", 1:3), cat1 = paste0("Example ", c(1,1,5)),
cat2 = paste0("Example ", c(2,2,6)),
cat3= paste0("Example ", c(3,4,4)))
tdf
#> ID cat1 cat2 cat3
#> 1 Name 1 Example 1 Example 2 Example 3
#> 2 Name 2 Example 1 Example 2 Example 4
#> 3 Name 3 Example 5 Example 6 Example 4
# the categories are extraneous, what matters is the relationship of ID to
# the Example values, so we melt the df to long format using the
# melt function from the package reshape2
lfd <- reshape2::melt(tdf, id.vars = "ID")
#> Warning: attributes are not identical across measure variables; they will
#> be dropped
# create an affiliation matrix
adj1 <- as.matrix(table(lfd$ID, lfd$value))
adj1
#>
#> Example 1 Example 2 Example 3 Example 4 Example 5 Example 6
#> Name 1 1 1 1 0 0 0
#> Name 2 1 1 0 1 0 0
#> Name 3 0 0 0 1 1 1
# Adjacency matrix is simply the product
id_id_adj_mat <- adj1 %*% t(adj1)
# Set the diagonal to zero (currently diagonal displays degree of each node)
diag(id_id_adj_mat) <- 0
id_id_adj_mat
#>
#> Name 1 Name 2 Name 3
#> Name 1 0 2 0
#> Name 2 2 0 1
#> Name 3 0 1 0

Related

How do you prepare longitudinal data for survival analysis with various specifications?

I have a question regarding longitudinal study analysis and work with R.
I have the following data format:
ID Visit Behaviour Distance_to_first_visit_in_month
1 0 1 0
1 1 1 6
1 2 1 12
1 3 1 50
2 0 3 0
2 1 3 8
2 2 3 16
2 3 3 25
2 4 3 40
2 5 3 60
3 0 1 0
3 1 1 6
3 2 1 12
3 3 3 24
3 4 3 30
3 5 3 55
I need the data in the following format:
ID Visit Behaviour Distance_to_first_visit_in_month Status
1 0 1 0 0
2 0 3 0 1
3 3 3 24 1
If a person has 1 every time until the end he should be only censored because the study is finished. If a person has 3 for the first time I need the Distance_to_to_first_visit_in_month because there he has the status 1 in the Kapplan-Meyer curve.
I tried to filter the maximal Distance_to_first_visit_in_month and get the Behaviour. When I bring the data to the wide format it is easy to get those. But I can't get the Distance_to_first_visit_in_month when the person 3 as Behaviour at the beginning or when otherwise.
I have 300IDs with sometimes 11 visits so I can't prepare the data manuell.
Do you have an idea?
Thanks you in advance.
Best Christina
As you don't explain how to aggregate your data to the second dataset, I can only show you how to get the ID's that match your conditions and how to implement the status variable. See this example:
library(dplyr)
# get id's with only 1
id_list1 <- lapply(df %>% split(.$ID),function(x){
if(unique(x$ID)==1){
return(unique(x$ID))
}
}) %>%
unlist()
# get id's with 3 as first value
id_list3 <- lapply(df %>% split(.$ID),function(x){
if(x[x$Visit==0,"Behaviour"]==3){
return(unique(x$ID))
}
}) %>%
unlist()
df %>%
mutate(Status = ifelse(ID %in% id_list3,1,0)) %>%
mutate(new_dist = ifelse(!ID %in% id_list3,Distance_to_first_visit_in_month,NA))
Please note that you'll get named vectors in id_list1 and id_list3. There are no duplicates, just the name of the element matching the element.
And do you mean Visit number 0 with "at the beginning"? Otherwise you'll have to adjust x$Visit==0.

subseting columns by the name of rows of another dataframe

I need to subset the columns of a dataframe taking into account the rownames of another dataframe.(in R)
Im trying to select the representative species of Brazilian Amazon subseting a great Brazilian database taking into account the percentage of representative location, information which is in another dataframe
> a <- data.frame("John" = c(2,1,1,2), "Dora" = c(1,1,3,2), "camilo" = c(1:4),"alex"=c(1,2,1,2))
> a
John Dora camilo alex
1 2 1 1 1
2 1 1 2 2
3 1 3 3 1
4 2 2 4 2
> b <- data.frame("SN" = 1:3, "Age" = c(15,31,2), "Name" = c("John","Dora","alex"))
> b
SN Age Name
1 1 15 John
2 2 31 Dora
3 3 2 alex
> result <- a[,rownames(b)[1:3]]
Error in `[.data.frame`(a, , rownames(b)[1:3]) :
undefined columns selected
I want to get this dataframe
John Dora alex
1 2 1 1
2 1 1 2
3 1 3 1
4 2 2 2
The simple a[,b$Name] does not work because b$Name is considered a factor. Be careful because it won't throw an error but you will get the wrong answer!
But this is easy to fit by using a[,as.character(b$Name)]instead!

How to keep User ID using Rtsne package

I want to use T-SNE to visualize user's variable but I want to be able to join the data to the user's social information.
Unfortunately, the output of Rtsne doesn't seems to return data with the user id..
The data looks like this:
client_id recency frequen monetary
1 2 1 1 1
2 3 3 1 2
3 4 1 1 2
4 5 3 1 1
5 6 4 1 2
6 7 5 1 1
and the Rtsne output:
x y
1 -6.415009 -0.4726438
2 -16.027732 -9.3751709
3 17.947615 0.2561859
4 1.589996 13.8016613
5 -9.332319 -13.2144419
6 10.545698 8.2165265
and the code:
tsne = Rtsne(rfm[, -1], dims=2, check_duplicates=F)
Rtsne preserves the input order of the dataframe you pass to it.
Try:
Tsne_with_ID = cbind.data.frame(rfm[,1],tsne$y)
and then just fix the first column name:
colnames(Tsne_with_ID)[1] <- paste(colnames(rfm)[1])

for loop & if function in R

I was writing a loop with if function in R. The table is like below:
ID category
1 a
1 b
1 c
2 a
2 b
3 a
3 b
4 a
5 a
I want to use the for loop with if function to add another column to count each grouped ID, like below count column:
ID category Count
1 a 1
1 b 2
1 c 3
2 a 1
2 b 2
3 a 1
3 b 2
4 a 1
5 a 1
My code is (output is the table name):
for (i in 2:nrow(output1)){
if(output1[i,1] == output[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
But the result returns as all count column values are all "1".
ID category Count
1 a 1
1 b 1
1 c 1
2 a 1
2 b 1
3 a 1
3 b 1
4 a 1
5 a 1
Please help me out... Thanks
There are packages and vectorized ways to do this task, but if you are practicing with loops try:
output1$rn <- 1
for (i in 2:nrow(output1)){
if(output1[i,1] == output1[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
With your original code, when you made this call output1[i-1,"rn"]+1 in the third line of your loop, you were referencing a row that didn't exist on the first pass. By first creating the row and filling it with the value 1, you give the loop something explicit to refer to.
output1
# ID category rn
# 1 1 a 1
# 2 1 b 2
# 3 1 c 3
# 4 2 a 1
# 5 2 b 2
# 6 3 a 1
# 7 3 b 2
# 8 4 a 1
# 9 5 a 1
With the package dplyr you can accomplish it quickly with:
library(dplyr)
output1 %>% group_by(ID) %>% mutate(rn = 1:n())
Or with data.table:
library(data.table)
setDT(output1)[,rn := 1:.N, by=ID]
With base R you can also use:
output1$rn <- with(output1, ave(as.character(category), ID, FUN=seq))
There are vignettes and tutorials on the two packages mentioned, and by searching ?ave in the R console for the last approach.
looping solution will be painfully slow for bigger data. Here is one line solution using data.table:
require(data.table)
a<-data.table(ID=c(1,1,1,2,2,3,3,4,5),category=c('a','b','c','a','b','a','b','a','a'))
a[,':='(category_count = 1:.N),by=.(ID)]
what you want is actually a column of factor level. do this
df$count=as.numeric(df$category)
this will give out put as
ID category count
1 1 a 1
2 1 b 2
3 1 c 3
4 2 a 1
5 2 b 2
6 3 a 1
7 3 b 2
8 4 a 1
9 5 a 1
provided your category is already a factor. if not first convert to factor
df$category=as.factor(df$category)
df$count=as.numeric(df$category)

Conditionally dropping duplicates from a data.frame

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.
Suppose my dataset is:
dat <- read.table(text = "
id s
1 2
1 2
1 1
1 3
1 3
1 3
2 3
2 3
3 2
3 2",
header=TRUE)
What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:
id s
1 2
1 2
1 1
1 3
2 3
3 2
3 2
I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?
Like this:
subset(dat, s != 3 | s == 3 & !duplicated(dat))
# id s
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 3
# 7 2 3
# 9 3 2
# 10 3 2
Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:
dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

Resources