I have an issue that I've been unable to optimise and I'm sure that either igraph or tidy graphs must hold this function already or there must be a better way to do this. I am using R and igraph to do this but possibly tidygraphs would also do the job.
Problem: How to define networks a list of over two million edges (node 1 - linked to - node 2) into their own separate networks and to then define the network as it's highest weighted node category.
Data:
Edges:
from
to
1
2
3
4
5
6
7
6
8
6
This creates 3 networks N.B. in the real example we have loops and multiple edges to and from nodes (this is why I've used igraph as it easily deals with these).
Data: Node categories:
id
cat
weight
1
traffic accident
10
2
abuse
50
3
abuse
50
4
speeding
5
5
murder
100
6
abuse
50
7
speeding
5
8
abuse
50
Final table:
The final table categorises each node and labels each network with the max category of the nodes
id
idcat
networkid
networkcat
1
traffic accident
1
50
2
abuse
1
50
3
abuse
2
50
4
speeding
2
50
5
murder
3
100
6
abuse
3
100
7
speeding
3
100
8
abuse
3
100
Current iterative solution and code:
If there is no better solution to this then maybe we can speed this iteration up?
library(tidyverse)
library(igraph)
library(purrr) #might be an answer
library(tidyverse)
library(tidygraph) #might be an answer
from <- c(1,3,5,7,8)
to <- c(2,4,6,6,6)
edges <- data.frame(from,to)
id <- c(1,2,3,4,5,6,7,8)
cat <- c("traffic accident","abuse","abuse","speeding","murder","abuse","speeding","abuse")
weight <- c(10,50,50,5,100,50,5,50)
details <- data.frame(id,cat,weight)
g <- graph_from_data_frame(edges)# we can add the vertex details here as well g <-
graph_from_data_frame(edges,vertices=details) but we join these in later
plot(g)
dg <- decompose(g)# decomposing the network defines the separate networks
networks <- data.frame(id=as.integer(),
network_id=as.integer())
for (i in 1:length(dg)) { # this is likely too many to do at once. As the networks are already defined we can split this into chunks. There is a case here for parellisation
n <- dg[[i]][1] %>% # using the decomposed list of lists from i graph. There is an issue here as the list comes back with the node as an index. I can't find an easier way to get this out
as.data.frame() %>% # I can't work a way to bring out the data without changing to df and then using row names
row.names() %>% # and this returns a vector
as.data.frame() %>%
rename(id=1) %>%
mutate(network_id = i,
id=as.integer(id))
networks <-bind_rows(n,networks)
}
networks <- networks %>%
inner_join(details) # one way to bring in details
n_weight <- networks %>%
group_by(network_id) %>%
summarise(network_weight=max(weight))
networks <- networks %>%
inner_join(n_weight)
networks # final answer
filtered_n <- networks %>%
filter(network_weight==100) %>%
select(network_id) %>%
distinct()#this brings out just the network ID's of whatever we happen to want
filtered_n <- networks %>%
filter(network_id %in% filtered_n_id$network_id)
edges %>%
filter(from %in% filtered_n$id | to %in% filtered_n$id ) %>%
graph_from_data_frame() %>%
plot() # returns only the network/s that we want to view
Here is a solution just using igraph and base R.
networkid <- components(g)$membership
Table <- aggregate(id, list(networkid), function(x) { max(weight[x]) })
networkcat <- Table$x[networkid]
Final <- data.frame(id, idcat=cat, networkid, networkcat)
Final
id idcat networkid networkcat
1 1 traffic accident 1 50
2 2 abuse 1 50
3 3 abuse 2 50
4 4 speeding 2 50
5 5 murder 3 100
6 6 abuse 3 100
7 7 speeding 3 100
8 8 abuse 3 100
Related
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I am looking to add a column to my data that will list the individual count of the observation in the dataset. I have data on NBA teams and each of their games. They are listed by date, and I want to create a column that lists what # in each season each game is for each team.
My data looks like this:
# gmDate teamAbbr opptAbbr id
# 2012-10-30 WAS CLE 2012-10-30WAS
# 2012-10-30 CLE WAS 2012-10-30CLE
# 2012-10-30 BOS MIA 2012-10-30BOS
Commas separate each column
I've tried to use "add_count" but this has provided me with the total # of games each team has played in the dataset.
Prior attempts:
nba_box %>% add_count()
I expect the added column to display the # game for each team (1-82), but instead it now shows the total number of games in the dataset (82).
Here is a base R example that approaches the problem from a for loop standpoint. Given that a team can be either column, we keep track of the teams position by unlisting the data and using the table function to sum the previous rows.
# intialize some fake data
test <- as.data.frame(t(replicate(6, sample( LETTERS[1:3],2))),
stringsAsFactors = F)
colnames(test) <- c("team1","team2")
# initialize two new columns
test$team2_gamenum <- test$team1_gamenum <- NA
count <- NULL
for(i in 1:nrow(test)){
out <- c(count, table(unlist(test[i,c("team1","team2")])))
count <- table(rep(names(out), out)) # prob not optimum way of combining two table results
test$team1_gamenum[i] <- count[which(names(count) == test[i,1])]
test$team2_gamenum[i] <- count[which(names(count) == test[i,2])]
}
test
# team1 team2 team1_gamenum team2_gamenum
#1 B A 1 1
#2 A C 2 1
#3 C B 2 2
#4 C B 3 3
#5 A C 3 4
#6 A C 4 5
I want to create a table from the existing data. I have 5 varieties and 3 clusters in the data. In the expected table I want to show the number and the name of the varieties with the corresponding clusters. But I cannot make it. This is my data
variety<-c("a","b","c","d","e")
cluster<-c(1,2,2,3,1)
x <- cbind(variety, cluster)
data <- data.frame(x)
data
variety cluster
1 a 1
2 b 2
3 c 2
4 d 3
5 e 1
My desirable table is like this.
cluster number variety name
1 2 a, e
2 2 b,c
3 1 d
I would be grateful if anyone helps me.
The following can give the results you're looking for:
library(plyr)
variety<-c("a","b","c","d","e")
cluster<-c(1,2,2,3,1)
x <- cbind(variety, cluster)
data <- data.frame(x)
data
ddply(data,.(cluster),summarise,n=length(variety),group=paste(variety,collapse=','))
Here is one option with tidyverse. Grouped by 'cluster', get the number of rows (n()) and paste the 'variety' into a single string (toString)
library(tidyverse)
data %>%
group_by(cluster) %>%
summarise(number = n(), variety_name = toString(variety))
To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language processing [nlp] predictions. Carrying around all the 1grams will be pointless because only the top ten [ or twenty ] will ever get used with the simple back off model i'm using.
I wasn't able to find a quanteda::dfm(corpusText, . . .) parameter that instructs it to only return the top ## features. So based on comments from package author #KenB in other threads i'm using the dfm_select/remove functions to extract the top ten 1grams and based on the "quanteda dfm join" search results hit "concatenate dfm matrices in 'quanteda' package" i'm using rbind.dfmSparse??? function to join those results.
So far everything looks right from what i can tell. Thought i'd bounce this game plan off of SO community to see if i'm overlooking a more efficient route to arrive at this result or some flaw in solution I've arrived at thus far.
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq
For extracting the top 10 unigrams, this strategy will work just fine:
sort the dfm by the (default) decreasing order of overall feature frequency, which you have already done, but then add a step tp slice out the first 10 columns.
combine this with the 2- to 5-gram dfm using cbind() (not rbind())).
That should do it:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
Your example code includes some use of data.table, although this does not appear in the question. In v0.99 we have added a new function textstat_frequency() which produces a "long"/"tidy" format of frequencies in a data.frame that might be helpful:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1
I'm dealing with a dataset where I have students ratings of teachers. Some students rated the same teacher more than once.
What I would like to do with the data is to subset it with the following criteria:
1) Keep any unique student Ids and ratings
2) In cases where students rated a teacher twice keep only 1 rating, but to select which rating to keep randomly.
3) If possible I'd like to be able to run the code in a munging script at the top of every analysis file and ensure that the dataset created is exaclty the same for each analysis (set seed?).
# data
student.id <- c(1,1,2,3,3,4,5,6,7,7,7,8,9)
teacher.id <- c(1,1,1,1,1,2,2,2,2,2,2,2,2)
rating <- c(100,99,89,100,99,87,24,52,100,99,89,79,12)
df <- data.frame(student.id,teacher.id,rating)
Thanks for any guidance for how to move forward.
Assuming that each student.id is only applied to one teacher, you could use the following method.
# get a list containing data.frames for each student
myList <- split(df, df$student.id)
# take a sample of each data.frame if more than one observation or the single observation
# bind the result together into a data.frame
set.seed(1234)
do.call(rbind, lapply(myList, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x))
This returns
student.id teacher.id rating
1 1 1 100
2 2 1 89
3 3 1 99
4 4 2 87
5 5 2 24
6 6 2 52
7 7 2 99
8 8 2 79
9 9 2 12
If the same student.id rates multiple teachers, then this method requires the construction of a new variable with the interaction function:
# create new interaction variable
df$stud.teach <- interaction(df$student.id, df$teacher.id)
myList <- split(df, df$stud.teach)
then the remainder of the code is identical to that above.
A potentially faster method is to use the data.table library and rbindlist.
library(data.table)
# convert into a data.table
setDT(df)
myList <- split(df, df$stud.teach)
# put together data.frame with rbindlist
rbindlist(lapply(myList, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x))
This can now be done much faster using data.table. Your question is equivalent to sampling rows from within groups, see
Sample random rows within each group in a data.table
To create some plots, I've already summarized my data using the following approach, which includes all the needed information.
# Load Data
RawDataSet <- read.csv("http://pastebin.com/raw/VP6cF31A", sep=";")
# Load packages
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
# summarising the data
new.df <- RawDataSet %>%
group_by(UserEmail,location,context) %>%
tally() %>%
mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
group_by(UserEmail,location) %>%
mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n))
With some other analysis I've identified distinct user groups. Since I would like to plot my data, it would be great to have a plot visualizing my data in the right order.
The order is based on the UserEmail and is defined by the following:
order <- c("28","27","25","23","22","21","20","16","12","10","9","8","5","4","2","1","29","19","17","15","14","13","7","3","30","26","24","18","11","6")
Asking for the type of my new.df with typeof(new.df) it says that this is a list. I've already tried some approaches like order_by or with_order, but I until now I have not managed it to order my new.df depending on my order-vector. Of course, the order process could also be done in the summarising part.
Is there any way to do so?
I couldn't bring myself to create a vector named order out of respect for the R function by that name. Using match to construct an index to use as the basis ordering (as a function):
sorted.df <- new.df[ order(match(new.df$UserEmail, as.integer(c("28","27","25","23","22","21","20","16","12","10","9","8","5","4","2","1","29","19","17","15","14","13","7","3","30","26","24","18","11","6")) )), ]
head(sorted.df)
#---------------
Source: local data frame [6 x 6]
Groups: UserEmail, location [4]
UserEmail location context n n2 p
(int) (fctr) (fctr) (int) (dbl) (dbl)
1 28 NOT_WITHIN Clicked A 16 -16 -0.8421053
2 28 NOT_WITHIN Clicked B 3 -3 -0.1578947
3 28 WITHIN Clicked A 2 2 1.0000000
4 27 NOT_WITHIN Clicked A 4 -4 -0.8000000
5 27 NOT_WITHIN Clicked B 1 -1 -0.2000000
6 27 WITHIN Clicked A 1 1 1.0000000
(I didn't load plyr or reshape2 since at least one of those packages has a nasty habit of interaction poorly with the dplyr functions.)