What is membership in community detection? [closed] - r

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am finding it hard to understand what membership and modularity returns and why is it exactly used.
wc <- walktrap.community(karate)
modularity(wc)
membership(wc)
plot(wc, karate)
for the above code I get the following when I execute membership:
[1] 1 1 2 1 5 5 5 1 2 2 5 1 1 2 3 3 5 1 3 1 3 1 3 4 4 4 3 4 2 3 2 2 3
for the above code I get the following when I execute modularity:
[1] 0.3532216
I read the documentation but still a bit confusing.

The result of walktrap.community is a partition of your graph into communities which are numbered with id's from 1 to 5 in your case. The membership function gives a vector of a community ids for every node in your graph. So in your case node 1 belongs to community 1, and node 3 belongs to community 2.
The partition of the graph into communities is based on optimizing a so called modularity function. When you call modularity you get the final value of that function after the optimization process is complete. A high value of modularity indicates a good partition of the graph into clear communities, while a low value indicates the opposite.

Related

Should multiple dummy variables start from different numbers when handling multiple categorical features in a data set? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Considering multiple independent categorical features in a data set, we want to encode multiple variables in each category. Should the dummy variables be different in each category? or is it reasonable to start the dummies in each category from 0? Consider the following example:
Distance_Group .... Airlines_with_HIGHEST_fare .......... dummies_1
============ .... ======================= ......... ========
G .......................... Atlantic Airways ................................. 0
A .......................... Bahamas Air ...................................... 1
B .......................... Bahamas Air ...................................... 1
C .......................... Jet Blue ............................................ 2
A .......................... United Airline ..................................... 3
Distance_group .... Airlines_with_LOWEST_fare ......... dummies_2
============ ....====================== ..........========
F ........................... Jet Blue .......................................... 0
E ........................... United Airline .................................. 1
A ........................... Lufthansa ........................................ 2
G .......................... Georgia Airways .............................. 3
Starting each category from 0, in first category, Jet Blue is corresponding to dummy variable: 2, in second one it is corresponding to dummy variable: 0.
Is this the right encoding for the two categories?
In case the query is needed for clarifying the example:
This Python query loops over all unique type categories while counting up.
map_dict1 = {}
for token, value in enumerate(Data['Airlines_with_HIGHEST_fare'].unique()):
map_dict1[value] = token
Data['Airlines_with_HIGHEST_fare'].replace(map_dict1, inplace=True)
The same logic also applies for the Airlines with lowest fare category for encoding airlines.
I am trying to cluster the airline fares, based on some numerical features like: Distance_Group, # passengers, etc. The above example is the two categorical features (= name of Airlines). All these features are input cells of a neural network, that's why they should be numerical. Because Neural Networks do not accept categorical variables.

Retrieve Census tract from Coordinates [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have a dataset with longitude and latitude coordinates. I want to retrieve the corresponding census tract. Is there a dataset or api that would allow me to do this?
My dataset looks like this:
lat lon
1 40.61847 -74.02123
2 40.71348 -73.96551
3 40.69948 -73.96104
4 40.70377 -73.93116
5 40.67859 -73.99049
6 40.71234 -73.92416
I want to add a column with the corresponding census tract.
Final output should look something like this (these are not the right numbers, just an example).
lat lon Census_Tract_Label
1 40.61847 -74.02123 5.01
2 40.71348 -73.96551 20
3 40.69948 -73.96104 41
4 40.70377 -73.93116 52.02
5 40.67859 -73.99049 58
6 40.71234 -73.92416 60
The tigris package includes a function called call_geolocator_latlon that should do what you're looking for. Here is some code using
> coord <- data.frame(lat = c(40.61847, 40.71348, 40.69948, 40.70377, 40.67859, 40.71234),
+ long = c(-74.02123, -73.96551, -73.96104, -73.93116, -73.99049, -73.92416))
>
> coord$census_code <- apply(coord, 1, function(row) call_geolocator_latlon(row['lat'], row['long']))
> coord
lat long census_code
1 40.61847 -74.02123 360470152003001
2 40.71348 -73.96551 360470551001009
3 40.69948 -73.96104 360470537002011
4 40.70377 -73.93116 360470425003000
5 40.67859 -73.99049 360470077001000
6 40.71234 -73.92416 360470449004075
As I understand it, the 15 digit code is several codes put together (the first two being the state, next three the county, and the following six the tract). To get just the census tract code I'd just use the substr function to pull out those six digits.
> coord$census_tract <- substr(coord$census_code, 6, 1)
> coord
lat long census_code census_tract
1 40.61847 -74.02123 360470152003001 015200
2 40.71348 -73.96551 360470551001009 055100
3 40.69948 -73.96104 360470537002011 053700
4 40.70377 -73.93116 360470425003000 042500
5 40.67859 -73.99049 360470077001000 007700
6 40.71234 -73.92416 360470449004075 044900
I hope that helps!

How to Count Number of Events Currently Elapsing When a New Event Begins from BeginTime and EndTime [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Here is some Example Data:
Begin = c("10-10-2010 12:15:35", "10-10-2010 12:20:52", "10-10-2010 12:23:45", "10-10-2010 12:25:01", "10-10-2010 12:30:29")
End = c("10-10-2010 12:24:23", "10-10-2010 12:23:30", "10-10-2010 12:45:15", "10-10-2010 12:32:11", "10-10-2010 12:45:05")
df = data.frame(Begin, End)
I want to count the number of events that have not currently finished when a new event begins and record it in a new column. So for this particular example the end result that is desired would be a column with values: 0, 1, 1, 1, 2
I need this to be coded in R please. I found a way to calculate this in SAS with a lag function but I do not like that method for various reasons and would like something that works better in R.
In reality I have 36,000 rows and this is dealing with power outages.
Someone asked me to post what I have tried, well. In SAS I was successful with a lag function as I said. That method did not work well because you have to hardcode a ton and its not efficient.
In R I tried to sort by begintime and number from 1-36k then sort by endtime and number from 1-36k and then try some ifthen logic but hit a wall and do not think that will work either.
My question was told to be edited to be made available to the community again. The only reason I can imagine is because there are too many possible answers. Well, I didn't edit anything, but I added this excerpt. In programming there will be many answers for any 'good' question that is not exactly the most simple question (but even those have many answers, especially in R). This question is one I know many people will ask throughout time and frankly it is hard to find a source of information on how to do this in R online. The answer to this question was very short and it worked perfectly. It would be a shame to not make this question available to the community as the point of stackoverflow is to attain a repertoire of great questions so basically their name will be pulled up when people google things along the lines of that question.
Maybe this helps:
library(lubridate)
library(data.table)
df <- as.data.frame(lapply(df, dmy_hms))
dt <- as.data.table(df)
setkey(dt,Begin,End)[,id:=.I]
merge(dt, foverlaps(dt,dt)[id>i.id,.N,by="Begin,End"], all.x=T)[,id:=NULL][is.na(N),N:=0][]
# Begin End N
# 1: 2010-10-10 12:15:35 2010-10-10 12:24:23 0
# 2: 2010-10-10 12:20:52 2010-10-10 12:23:30 1
# 3: 2010-10-10 12:23:45 2010-10-10 12:45:15 1
# 4: 2010-10-10 12:25:01 2010-10-10 12:32:11 1
# 5: 2010-10-10 12:30:29 2010-10-10 12:45:05 2

finding the most frequent item using bigmemory techniques and parallel computing? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
How can I find which months have the most frequent delays without using regression? The following csv is a sample of a 100MB file. I know I should use bigmemory techniques but wasn't sure how to approach this. Here months are stored as integers not factor.
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2006,1,11,3,743,745,1024,1018,US,343,N657AW,281,273,223,6,-2,ATL,PHX,1587,45,13,0,,0,0,0,0,0,0
2006,1,11,3,1053,1053,1313,1318,US,613,N834AW,260,265,214,-5,0,ATL,PHX,1587,27,19,0,,0,0,0,0,0,0
2006,1,11,3,1915,1915,2110,2133,US,617,N605AW,235,258,220,-23,0,ATL,PHX,1587,4,11,0,,0,0,0,0,0,0
2006,1,11,3,1753,1755,1925,1933,US,300,N312AW,152,158,126,-8,-2,AUS,PHX,872,16,10,0,,0,0,0,0,0,0
2006,1,11,3,824,832,1015,1015,US,765,N309AW,171,163,132,0,-8,AUS,PHX,872,27,12,0,,0,0,0,0,0,0
2006,1,11,3,627,630,834,832,US,295,N733UW,127,122,108,2,-3,BDL,CLT,644,6,13,0,,0,0,0,0,0,0
2006,1,11,3,825,820,1041,1021,US,349,N177UW,136,121,111,20,5,BDL,CLT,644,4,21,0,,0,0,0,20,0,0
2006,1,11,3,942,945,1155,1148,US,356,N404US,133,123,121,7,-3,BDL,CLT,644,4,8,0,,0,0,0,0,0,0
2006,1,11,3,1239,1245,1438,1445,US,775,N722UW,119,120,103,-7,-6,BDL,CLT,644,4,12,0,,0,0,0,0,0,0
2006,1,11,3,1642,1645,1841,1845,US,1002,N104UW,119,120,105,-4,-3,BDL,CLT,644,4,10,0,,0,0,0,0,0,0
2006,1,11,3,1836,1835,NA,2035,US,1103,N425US,NA,120,NA,NA,1,BDL,CLT,644,0,17,0,,1,0,0,0,0,0
2006,1,11,3,NA,1725,NA,1845,US,69,0,NA,80,NA,NA,NA,BDL,DCA,313,0,0,1,A,0,0,0,0,0,0
Let's say your data.frame is called dd. If you want to see the total number of weather delays for each month across all years you can do
delay <- aggregate(WeatherDelay~Month, dd, sum)
delay[order(-delay$WeatherDelay),]
Is this closer to what you want? I don't know R well enough to sum the rows, but this at least aggregates them. I am learning, too!
delays <- read.csv("tmp.csv", stringsAsFactors = FALSE)
delay <- aggregate(cbind(ArrDelay, DepDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay) ~ Month, delays, sum)
delay
It outputs:
Month ArrDelay DepDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
1 1 10 -16 0 0 0 0
2 2 -31 -2 0 0 0 0
3 3 9 -4 0 20 0 0
Note: I changed your document a bit to provide some diversity on the Months column:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2006,1,11,3,743,745,1024,1018,US,343,N657AW,281,273,223,6,-2,ATL,PHX,1587,45,13,0,,0,0,0,0,0,0
2006,1,11,3,1053,1053,1313,1318,US,613,N834AW,260,265,214,-5,0,ATL,PHX,1587,27,19,0,,0,0,0,0,0,0
2006,2,11,3,1915,1915,2110,2133,US,617,N605AW,235,258,220,-23,0,ATL,PHX,1587,4,11,0,,0,0,0,0,0,0
2006,2,11,3,1753,1755,1925,1933,US,300,N312AW,152,158,126,-8,-2,AUS,PHX,872,16,10,0,,0,0,0,0,0,0
2006,1,11,3,824,832,1015,1015,US,765,N309AW,171,163,132,0,-8,AUS,PHX,872,27,12,0,,0,0,0,0,0,0
2006,1,11,3,627,630,834,832,US,295,N733UW,127,122,108,2,-3,BDL,CLT,644,6,13,0,,0,0,0,0,0,0
2006,3,11,3,825,820,1041,1021,US,349,N177UW,136,121,111,20,5,BDL,CLT,644,4,21,0,,0,0,0,20,0,0
2006,1,11,3,942,945,1155,1148,US,356,N404US,133,123,121,7,-3,BDL,CLT,644,4,8,0,,0,0,0,0,0,0
2006,3,11,3,1239,1245,1438,1445,US,775,N722UW,119,120,103,-7,-6,BDL,CLT,644,4,12,0,,0,0,0,0,0,0
2006,3,11,3,1642,1645,1841,1845,US,1002,N104UW,119,120,105,-4,-3,BDL,CLT,644,4,10,0,,0,0,0,0,0,0
2006,3,11,3,1836,1835,NA,2035,US,1103,N425US,NA,120,NA,NA,1,BDL,CLT,644,0,17,0,,1,0,0,0,0,0
2006,1,11,3,NA,1725,NA,1845,US,69,0,NA,80,NA,NA,NA,BDL,DCA,313,0,0,1,A,0,0,0,0,0,0

"Too few positive probabilities" error in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I wrote a code in R, but it produces a silly mistake. The first error is "Too few positive probabilities" and this leads to NAs. Therefore, the code does not work. Can you please take a look and let me know what is wrong? Here are the first 5 rows and the headings of the data (since I do not know how to upload a text file. Please tell me how if you do)
year month day n_cases n_controls weekd leapyr
1999 1 1 127 62 6 0
1999 1 2 88 46 7 0
1999 1 3 26 15 1 0
1999 1 4 606 275 2 0
1999 1 5 479 252 3 0
and here is the R code
##########
a<-read.table("e29.txt",header=T)
attach(a)
cases<-a[,4]# fourth column in data "Cases"
data<-cases[1:2555]
weeklydata<-matrix(data,7,365)
y=apply(weeklydata,2,sum)
#
T<-length(y)
N<-1000
a<-0.98
pfstate<-matrix(0,T+1,N)
pfomega<-matrix(0,T+1,N)
pfphi<-matrix(0,T+1,N)#storge of phi
pfb<-matrix(0,T+1,N)#storge of b
wts<-matrix(0,T+1,N)
wnorm<-matrix(0,T+1,N)
set.seed(046)
pfstate[1,]<-rnorm(N,0,100)#rep(0,N)#
pfomega[1,]<-runif(N,0,1)
pfb[1,]<-runif(N,0,5)
wts[1,]<-rep(1/N,N)
for(t in 2:(T+1)){
##compute means and variances of the particles cloud for sigma and omega
meanomega<-weighted.mean(pfomega[t-1,],wts[t-1,])
varomega<-weighted.mean((pfomega[t-1,]-meanomega)^2,wts[t-1,])
meanb<-weighted.mean(pfb[t-1,],wts[t-1,])
varb<-weighted.mean((pfb[t-1,]-meanb)^2,wts[t-1,])
##compute the parameters of gamma kernel
muomega<-a*pfomega[t-1,]+(1-a)*meanomega
var2omega<-(1-a^2)*varomega
alphaomega<-muomega^2/var2omega
betaomega<-muomega/var2omega
mub<-a*pfb[t-1,]+(1-a)*meanb
var2b<-(1-a^2)*varb
alphab<-mub^2/var2b
betab<-mub/var2b
##1.1 draw the auxiliary indicator varibales
probs<-wts[t-1,]*dpois(y[t-1],exp(pfstate[t-1,]))
auxInd<-sample(N,N,replace=TRUE,prob=probs)
##1.2 draw the values of variances of sigma and omega and delta
pfomega[t,]<-rgamma(N,shape=alphaomega[auxInd],rate= betaomega[auxInd])
pfb[t,]<-rgamma(N,shape=alphab[auxInd],rate= betab[auxInd])
pfphi[t,]<-(pfb[t,]-1)/(1+pfb[t,])
##1.3 draw the states
pfstate[t,]<-rnorm(N,mean=pfphi[t,]*pfstate[t-1,auxInd],sd=sqrt(pfomega[t,]))
##compute the weigths
wts[t,]<-exp(dpois(y[t-1],exp(pfstate[t,]),log=TRUE)-
dpois(y[t-1],exp(pfstate[t-1,auxInd]),log=TRUE))
#print(wts)
wnorm[t,]<-wts[t,]/sum(wts[t,])
#print(wnorm)
}
### The first error occurs here
Error in sample.int(x, size, replace, prob) :
too few positive probabilities
ESS<-rep(0,T+1)
ESSthr<-N/2
for(t in 2:(T+1)){
ESS[t]<-1/sum(wnorm[t,]^2)
if(ESS[t]<ESSthr){
pfstate[t,]<-sample(pfstate[t,],N,replace=T,prob=wnorm[t,])
wnorm[t,]<-1/N
}
}
#THe second error occurs here
#Error in if (ESS[t] < ESSthr) { : missing value where TRUE/FALSE needed
The problem seems to be here:
probs<-wts[t-1,]*dpois(y[t-1],exp(pfstate[t-1,]))
auxInd<-sample(N,N,replace=TRUE,prob=probs)
It looks like your vector of probabilities becomes all 0s at some point. This could happen, for example if y[t-1] is very large. For example dpois(300,3) evaluates to 0.
By the way, this problem could be an indication that something is wrong conceptually in your experiment design. Since I don't know what you are doing, I can't help here.
Anyway, if you are confident that the algorithm is correct, but you want to avoid this error, one solution is to use the log form of dpois, and then adding a constant, since all that matters for the call to sample is relative weights. Something like this might work:
lprobs<-dpois(y[t-1],exp(pfstate[t-1,]),log=T)
lprobs<-lprobs-max(lprobs)
probs<-wts[t-1,]*exp(lprobs)

Resources