Create Edge list for (900+) nodes (in R) - r

I have a problem creating an edge list of 406351 edges (possible combinations of two nodes out of 902 unique nodes). The nodes correspond to doctors, and the edges correspond to the number of patients two nodes share in a period of time (one year).
I have medical claims data, with one observation being one visit of one patient to a particular doctor.
So far, what worked for me was to: First, table the doctors against the patient's id, obtaining the number of visits each patient had with each doctor. After that, I took the list of unique doctors' id, and create an edge list with all the possible 2 elements combinations. Finally, I "filled up" the edge list with a for loop that looks into each combination of columns(doctors) in the table, and counts how many particular patients had visits>0 for both columns(doctors).
This works, but my problem is that the loop is too slow and I would like to know if there is a faster way of doing this.
Here an example of my approach so far:
#DATA
case_number<-c("123","3456","5433","5678","9874","8566")
doctor_id<-c("333","444","555","333","666","555")
patient_id<-c("1","2","2","2","1","1")
DATA<-data.frame(case_number,doctor_id,patient_id)
#Table doc vs patients
table<- as.data.table(as.data.frame.matrix(table(DATA$patient_id,
DATA$doctor_id)))
#Create edge list
Drs<-unique(DATA$doctor_id)
edge_list<-as.data.table(t(combn(as.vector(unique(Drs)), 2)))
#'fill up' edge list
for (z in colnames(table)) {
for (y in colnames(table)) {
edge_list<-edge_list[(V1==z & V2==y) | (V2==z & V1==y),
Weight:=nrow(table[table[[z]]>0 & table[[y]]>0])]}}

Never mind. I realized that with some matrix algebra I could create an adjacency matrix and then use the network package to create the edge list. Comment if you need the code!

Related

R: Optimally Sharing Cookies Within Groups of Friends

I am working with the R programming language.
Suppose there are 100 people - each person is denoted with an ID from 1:100. Each person can be friends with other people. The dataset can be represented in graph/network format and looks something like this:
# Set the seed for reproducibility
set.seed(123)
# Generate a vector of ID's from 1 to 100
ids <- 1:100
# Initialize an empty data frame to store the "from" and "to" values
edges <- data.frame(from=integer(), to=integer(), stringsAsFactors=FALSE)
# Iterate through the ID's
for(id in ids) {
# Randomly select a minimum of 1 and a maximum of 8 neighbors for the current ID
neighbors <- sample(ids[ids != id], size=sample(1:8, size=1))
# Add a new row to the data frame for each "to" value
for(neighbor in neighbors) {
edges <- rbind(edges, data.frame(from=id, to=neighbor))
}
}
As we can see, the data can be visualized to reveal the graph/network format:
library(igraph)
library(visNetwork)
# Convert the data frame to an igraph object
g <- graph_from_data_frame(edges, directed=FALSE)
# Plot the graph
plot(g)
# Optional visualization
#visIgraph(g)
Now, suppose each person in this dataset has has a certain number of cookies. This looks something like this:
set.seed(123)
cookies = data.frame(id = 1:100, number_of_cookies = c(abs(as.integer(rnorm(25, 15, 5))), abs(as.integer(rnorm(75, 5, 5)))))
Here is my question:
I want to make sure that no person in this dataset has less than 12 cookies - that is, if someone has less than 12 cookies, they can "pool" together their cookies with their neighbors (e.g. first-degree neighbors, second-degree neighbors, ... n-th degree neighbors ... until this condition is satisfied) and see if they now have more than 12 cookies
However, I also want to make sure that during this pooling process, no pooled group of friends has more than 20 cookies (i.e. this might require having to "un-pool" neighbors that were previously pooled together)
And finally, if someone is in a group with some other people - this person can not be then placed into another group (i.e. no "double dipping")
I wrote a function that takes an ID as an input and then returns the total number of cookies for this ID and all of this ID's neighbors:
library(data.table)
library(dplyr)
sum_cookies_for_id <- function(id) {
# Get the connected IDs for the given ID
connected_ids <- c(id, edges[edges$from == id | edges$to == id, "to"])
# Sum the number of cookies for all connected IDs
sum(cookies[cookies$id %in% connected_ids, "number_of_cookies"])
}
# Test the function
sum_cookies_for_id(23)
But beyond this, I am not sure how to continue.
Can someone please show me how I might be able to continue writing the code for this problem? Can Dynamic Programming be used in such an example?
Thanks!
Notes:
I think that this problem might have a "stochastic" nature - that is, depending on which ID you begin with and which other ID's you group this specific ID with ... might eventually lead to fewer or more people with less than 12 cookies
I think that writing a "greedy" algorithm that performs random groupings might be the easiest option for such a problem?
In the future, I would be interested in seeing how more "sophisticated algorithms" (e.g. Genetic Algorithm) could be used to make groupings such that the fewest number of people with less than 12 cookies are left behind.
Food for Thought: Is it possible that perhaps some pre-existing graph/network clustering algorithm could be used for this problem while taking into consideration these total/sum constraints (e.g. Louvain Community Detection)?

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

Conditionally removing duplicates in R (20K observations)

I am currently working in a large data set looking at duplicate water rights. Each right holder is assigned an RightID, but some were recorded twice for clerical purposes. However, some rightIDs are listed more than once and do have relevance to my end goal. One example: there are double entries when a metal tag number was assigned to a specific water right. To avoid double counting the critical information I need to delete an observation.
I have this written at the moment,
#Updated Metal Tag Number
for(i in 1:nrow(duplicate.rights)) {
if( [i, "RightID"]==[i-1, "RightID"] & [i,"MetalTagNu"]=![i-1, "MetalTagNu"] ){
remove(i)
}
print[i]
}
The original data frame is set up similarly:
RightID Source Use MetalTagNu
1-0000 Wolf Creek Irrigation N/A
1-0000 Wolf Creek Irrigation 12345
1-0001 Bear River Domestic N/A
1-0002 Beaver Stream Domestic 00001
1-0002 Beaver Stream Irrigation 00001
E.g. right holder 1-0002 is necessary to keep because he is using his water right for two different purposes. However, right holder 1-0000 is unnecessary a repeat.
Right holder 1-0000 I need to eliminate but right holder 1-0002 is valuable to my end goal. I should also note that there can be up to 10 entries for a single rightID but out of those 10 only 1 is an unnecessary duplicate. Also, the duplicate and original entry will not be next to each other in the dataset.
I am quite the novice so please forgive my poor previous attempt. I know i can use the lapply function to make this go faster and more efficiently. Any guidance there would be much appreciated.
So I would suggest the following:
1) You say that you want to keep some duplicates (metal tag number was assigned to a specific water right). I don't know what this means. But I assume that it is something like this - if metal tag number = 1 then even if there are duplicates, you want to keep them. So I propose that you take these rows in your data (let's call this data) out:
data_to_keep <- data[data$metal_tag_number == 1, ]
data_to_dedupe <- data[data$metal_tag_number != 1, ]
2) Now that you have the two dataframes, you can dedupe the dataframe data_to_dedupe with no problem:
deduped_data = data_to_dedupe[!duplicated(data_to_dedupe$dedupe_key), ]
3) Now you can merge the two dataframes back together:
final_data <- rbind(data_to_keep, deduped_data)
If this is what you wanted please up-mark and suggest that the answer is correct. Thanks!
Create a new column,key, which is a combination of RightID & Use.
Assuming your dataframe is called df,
df$key <- paste(df$RightID,df$Use)
Then, remove duplicates using this command :
df1 <- df[!duplicated(df[,1],)]
df1 will have no duplicates.

How to create contingency table with multiple criteria subpopulation from weighted data using svyby in the survey package?

I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.

How to select multiple cells in a matrix and perform an operation on corresponding cells in another matrix of the same size?

I am trying to write an R script to do pollution routing in world rivers, and need some help on selecting matrix cell coordinates and applying these to other matrices of the same dimension.
My data: I have several matrices corresponding to hydrological parameters of world rivers on a half degree grid (360 rows, 720 columns). These matrices represent flow accumulation (how many cells flow into this cell), flow direction (which of the 8 surrounding cells does the load of certain cell flow to) and pollutant load.
My idea: compute pollutant load in each grid cell from the start to the end of a river. I can base this on flow accumulation (low to high). However, each river basin can have multiple cells with the same flow accumulation value.
The problem: I need to select all matrix cells of each value of flow accumulation (low to high), find their coordinates (row,column), and transfer the corresponding pollutant load to the correct adjacent cell using the flow direction matrix. I have tried various ways, but selecting the coordinates of the correct cells and applying these to another matrix I cannot get to work.
I will give an example of what I have tried, using two for loops on one single river basin. In this example, a flow direction value of 1 means that the pollutant load needs to be transferred to the adjacent cell to the right (row is the same, column +1):
BasinFlowAccumulation <-FlowAccumulation[Basin]
BasinFlowAccumulationMaximum <- max(BasinFlowAccumulation)
BasinFlowDirection <-FlowDirection[Basin]
BasinPollutant <-Pollutant[Basin]
b<-0
for(i in 0:BasinFlowAccumulationMaximum){
cells.index<-which(BasinFlowAccumulation[]==b, arr.ind=TRUE)
for (j in 1:length(cells.index)){
print(BasinFlowDirection[cells[j]])
Row<-BasinPollutant[cells[j[1]]]
Column<-BasinPollutant[cells[j[2]]]
ifelse(BasinFlowDirection[cells.index[j]]==1, BasinPollutant[Row,(Column+1)]<-BasinPollutant[Row,(Column+1)]+Basinpollutant[Row,Column]
}
b<-b+1
}
Any advice would be greatly appreciated!

Resources