I am trying to combine two CSV files together in a graph_from_data_frame using multiple cores.
I already have the code developed (see below), I just need to adapt it to use more than one core.
The two csv file examples are posted below. Due to the volume of data in the csv's multiple cores are needed.
id
123
321
231
423
353
534
345
646
346
from to weight
123 456 2
123 435 3
432 654 2
342 543 4
234 323 3
432 543 4
234 543 1
234 654 1
edges <- read.csv("/Users/holly/edgeR.csv", header=T, as.is=T)
nodes <- read.csv("/Users/holly/nodeR.csv", header=T, as.is=T)
#libraries
library(igraph)
library(tictoc)
library(network)
library(data.table)
#Edges data set includes from and to addresses for block 200k from Neo4j
#edges <- read.csv("/Users/jonathanbailey/edges.csv", header=T, as.is=T)
#Node data s et contains all the address nodes ids for block 200k
#nodes <- read.csv("/Users/jonathanbailey/nodes.csv", header=T, as.is=T)
#Show titles of data set
head(nodes)
head(edges)
#Remove the weights column
#edges$weights <- NULL
#Removes duplicate values in the nodes data set
nodes <- nodes[!duplicated(nodes$id),]
# persuades the data into a two-column matrix format for igraph
el=as.matrix(edges)
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])
#Creates a graph in R with edges and nodes
clustergraph1 <- graph_from_data_frame(el, directed = FALSE, vertices = nodes)
#Assigns the louvain algoritm to the above graph
Community200k <- cluster_louvain(clustergraph1)
Is there a way to make the two csv files merge into the graph data frame using parallel cores?
Related
I am trying to build a data frame so I can generate a Plot with a specific set of data, but I am having trouble getting the data into a table correctly.
So, here is what I have available from a data query:
> head(c, n=10)
EVTYPE FATALITIES INJURIES
834 TORNADO 5633 91346
856 TSTM WIND 504 6957
170 FLOOD 470 6789
130 EXCESSIVE HEAT 1903 6525
464 LIGHTNING 816 5230
275 HEAT 937 2100
427 ICE STORM 89 1975
153 FLASH FLOOD 978 1777
760 THUNDERSTORM WIND 133 1488
244 HAIL 15 1361
I then tried to generate a set of data variables to build a finished a data.frame like this:
a <- c(c[1,1], c[1,2], c[1,3])
b <- c(c[6,1], c[4,2] + c[6,2], c[4,3] + c[6,3])
d <- c(c[2,1], c[2,2], c[2,3])
e <- c(c[3,1], c[3,2], c[3,3])
f <- c(c[5,1], c[5,2], c[5,3])
g <- c(c[7,1], c[7,2], c[7,3])
h <- c(c[8,1], c[8,2], c[8,3])
i <- c(c[9,1], c[9,2], c[9,3])
j <- c(c[10,1], c[10,2], c[10,3])
k <- c(c[11,1], c[11,2], c[11,3])
df <- data.frame(a,b,d,e,f,g,h,i,j)
names(df) <- c("Event", "Fatalities","Injuries")
But, that is failing miserably. What I am getting is a long string of all the data variables, repeated 10 times. nice trick, but that is not what I am looking for.
I would like to get a finished data.frame with ten (10) rows of the data, like it was originally, but with my combined data in place. Is that possible.
I am using R version 3.5.3. and the tidyverse library is not available for install on that version.
Any ideas as to how I can generate that data.frame?
If a barplot is what you're after, here's a piece of code to get you that:
First, you need to get the data in the right format (that's probably what you tried to do in df), by column-binding the two numerical variables using cbindand transposing the resulting dataframe using t(i.e., turning rows into columns and vice versa):
plotdata <- t(cbind(c$FATALITIES, c$INJURIES))
Then set the layout to your plot, with a wide margin for the x-axis to accommodate your long factor names:
par(mfrow=c(1,1), mar = c(8,3,3,3))
Now you're ready to plot the data; you grab the labels from c$EVTYPE, reduce the label size in cex.names and rotate them with las to avoid overplotting:
barplot(plotdata, beside=T, names = c$EVTYPE, col=c("red","blue"), cex.names = 0.7, las = 3)
(You can add main =to define the heading to your plot.)
That's the barplot you should obtain:
I have a large data set with a field containing a combined FIPS code and zip code, and another data set with population weighted centroids for block groups combined with some zip code data. I want to stratify my data by "FIPS code" and then assign each row a set of coordinates for a block group centroid, where the centroid's probability of being selected is proportional to its population.
I was originally using a sample of the data (1000 rows) and the strata function from the sampling package, which worked fine. Now that I want to do this for every row in the data set, however, I'm getting this error:
Error in strata(popCenters2, stratanames = "FIPS", method = "systematic", :
not enough obervations in the stratum 1
I suspect that this is because strata does not use replacement and my data set is much larger than the centroid data set.
This is the code I used with the strata function applied to my sample:
## Combined fields to match format of other data
popCenters2 <- within(popCenters2,
FIPS <- paste(stateFIPS,
countyFIPS,
zipcode,
sep = ""))
sample %>% group_by(FIPS) %>% count() -> sampleCounts
popCenters2[order(popCenters2$FIPS), ] -> popCenters2
sampleCounts[order(sampleCounts$FIPS), ] -> sampleCounts
st = strata(popCenters2, stratanames = "FIPS", method = "systematic", size =
sampleCounts$n, pik = popCenters2$contribPop)
stTable = getdata(popCenters2, st)
My sample had 5 rows with the "FIPS" variable equal to 4200117325, this is the centroid data corresponding to that:
FIPS tract blkGroup latitude longitude contribPop
4200117325 030200 1 +40.000254 -077.137559 452
4200117325 030200 2 +39.959070 -077.160354 324
4200117325 030400 1 +39.915855 -077.406954 194
4200117325 030400 2 +39.923503 -077.298505 131
4200117325 030400 3 +39.878509 -077.307547 173
4200117325 030400 4 +39.873705 -077.360488 176
4200117325 030400 5 +39.880362 -077.412175 108
4200117325 030500 1 +39.926149 -077.227283 630
4200117325 030500 2 +39.921269 -077.260640 459
My question is, how can I reproduce this sort of procedure if, for example, my actual data set has 20 rows corresponding to 4200117325? I've read through the documentation for the strata function and a few others (Strata from DescTools, the survey package) but have been unable to find anything that allows replacement.
I want to create a hierarchical cluster to show types of careers and the balance that those who are in those careers have in their bank account.
I a dataset with two variables, job and balance:
job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88
I want the result to look like this:
Where A, B ,C etc are the job categories.
Can anyone help me start this or give me some help?
I have no idea how to begin.
Thanks!
You can start by using the distand hclust functions.
df <- read.table(text = " job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88")
dist computes the distance between each element (by default, the euclidian distance):
distances <- dist(df$balance)
You can then cluster you values using the distance matrix generated above:
clusters <- hclust(distances)
By default, hclust applies complete-linkage clustering to your data.
Finally, you can plot your results as a tree:
plot(clusters, labels = df$job)
Here, we clustered all the entries in your data frame, that's why some jobs are duplicated. If you want to have a single value per job, you can for example take the mean balance for each job using tapply:
means <- tapply(df$balance, df$job, mean)
And then cluster the jobs:
distances <- dist(means)
clusters <- hclust(distances)
plot(clusters)
You can then try to use other distance measures or other clustering algorithms (see help(dist) and help(hclust) for other methods).
I am doing a multiple part project. To begin with I had a data set which provided the deposits per district over the years. After scrubbing the data set, I was able to create a data frame, which provides the growth of deposits by district. I have growth of deposits by 3 different kinds of institutions - foreign banks, public banks and private banks in 3 different data frames as the # of rows differs in each frame. I have been asked to create 3 maps (heat maps) with deposit growth against each of the kind of banks.
My data frame looks like the attached picture.
I want to make a heat map for the growth column. enter image description here
Thanks.
Maybe I provide some spam by this answer, so delete it without hasitation.
I'll show you how I make some heatmaps in R:
Fake data:
Gene Patient_A Patient_B Patient_C Patient_D
BRCA1 52 46 124 148
TP53 512 487 112 121
FOX3D 841 658 321 364
MAPK1 895 541 198 254
RASA1 785 554 125 69
ADAM18 12 65 85 121
hmcols <- rev(redgreen(2750))
heatmap.2(hm_mx, scale="row", key=TRUE, lhei=c(2,5), symkey="FALSE", density.info="none", trace="none", cexRow=1.1, cexCol=1.1, col=hmcols, dendrogram = "none")
In case of read.table you propably will have to convert data frame to matrix and put first column as a row names to avoid errors from R:
hm <- read.table("hm1.txt", sep = '\t', header=TRUE, stringsAsFactors=FALSE)
row.names(hm) <- hm$Gene
hm_mx <- data.matrix(hm)
hm_mx <- hm_mx[,-c(1)]
I would like run some sna analysis. I work with RStudio and the igraph Package.
My input data is from a text file (created from excel as a tab seperated text file).
The data file has 3 columns. 1st and 2nd row are network data (vertices) and the 3rd row is the weight for each edge. I use airport connections data that looks like this:
1 54 28382 (Airport ID Origin Airport / Airport ID Destination Airport / Passanger number as a weight)
I loaded id with these commands:
USAN_num1 <- read.table('USAN_num.txt', header=T)
USAN_g_num1 <- graph.data.frame(USAN_num1)
> summary(USAN_g_num1)
Vertices: 626
Edges: 7078
Directed: TRUE
No graph attributes.
Vertex attributes: name.
Edge attributes: PAX.
Data looks like this:
ORIGN DESTN PAX
1 1 604 646
2 2 42 3736
3 2 118 5189
Now to the problem that occured:
My network consints of 6 different clusters when I check it with igraph. Even when I create a graphical picture of my network it has 6 seperated parts. That makes totally no sense since my data should be connected to one network. I checked through my dataset and there really are not different sub-networks.
Here is the cluster characteristics I get:
$csize
[1] 5 608 2 4 5 2
$no
[1] 6
One vertice in a small cluster is even a huge airport that should be connected to many others and not just 1 other...
UPDATE:
I now updated to the newest igraph version but it still does not work.
I uploaded an exemplary part of my data as a .txt file here: USAN_numS.txt
Would be great if someone has an idea on what I did wrong.
Thank you
So, as I said above the in my comment, a possible source of confusion is that your graph has symbolic vertex names that are actually numbers and don't match igraph's vertex ids. The workaround is to drop the vertex names, or to specify them explicitly when creating the graph, so that they match the igraph vertex ids.
But your graph really has multiple components, see the following code, where I check it in the original table, that two vertices only appear exactly once in the table, and they form a component of two by themselves.
Maybe the network really has multiple components, or there are mistakes in the file.
library(igraph)
USAN_num1 <- read.table('USAN_numS.txt', header=T)
USAN_g_num1 <- graph.data.frame(USAN_num1,
vertices=data.frame(id=1:max(USAN_num1[,1:2])))
clu <- clusters(USAN_g_num1)
clu$csize
## [1] 5 607 2 4 5 1 2 1
## The '1's appear because we counted the vertices that are
## not in the table
## Third component has two vertices only, let's check them in the
## original table
which(clu$membership == 3)
## [1] 64 617
## List the table rows where any of these two appear
USAN_num1[ USAN_num1[,1] %in% c(64, 617) | USAN_num1[,1] %in% c(64, 617), ]
## ORIGN DESTN PAX
## 691 64 617 636