creating network data from event level data - r

As a newbie to network analysis, I am struggling with transforming an event level dataset I want to plot into the correct shape. I am grateful for any hints/ leads/ etc. What I did until now, broadly follows this introduction.
The dataset in question contains events organized by the political party Jobbik. Each event defined by a unique id (id) has associated organizational sponsors (org_names) and their type (org). There is no hierarchy between org_1, org_2, or org_names1 and org_names2.
Originally the dataset comes in a wide format. Although I am not sure if this is what I should be doing, the first step I do is to transform the data into a long format and clean a bit the strings. This is the code for reading in the data and getting it into a long format:
jobbik <- read.csv("http://eborbath.github.io/stackoverflow/jobbik.csv")
library(tidyverse)
library(stringr)
library(igraph)
# long format
jobbik <- reshape(as.data.frame(jobbik), dir='long',
varying=list(c(3:13), c(14:24)),
v.names=c('org_names', 'org'), times = c(as.character(seq(1:11))))
jobbik$org <- str_trim(jobbik$org, side="both")
jobbik$org_names <- str_trim(jobbik$org_names, side="both")
jobbik <- jobbik %>%
filter(!(org=="no other organizer" & org_names=="")) %>%
filter(!(org=="JOBBIK" & org_names %in% c("Jobbik",
"Jobbik Magyarországért Mozgalom",
"",
"JObbik",
"jobbik",
"aktivisté Jobbiku",
"a Jobbik"))) %>%
mutate(org_names=ifelse(org_names=="", org, org_names)) %>%
distinct(.)
In the next step I want to create the network dataset. To do so, I calculate the number of times each unique organization has been involved in events with Jobbik. Add Jobbik as one side of each edge and plot the data with igraph:
network <- jobbik %>%
select(id, org_names) %>%
group_by(org_names) %>%
summarise(weight = n()) %>%
ungroup() %>%
mutate(from=1,
org_names=as.factor(org_names)) %>%
mutate(org_id=as.numeric(factor(org_names)))
edges <- network %>% select(from, org_id, weight)
nodes <- network %>% select(org_id, org_names) %>%
mutate(org_names=as.character(org_names))
routes_igraph <- graph_from_data_frame(d = edges, vertices = nodes, directed = FALSE)
plot(routes_igraph, layout = layout_with_graphopt)
While this runs and creates the network, it only gets me the relationship between each unique organization and Jobbik, but not the relationship between these organizations, which do not involve Jobbik. I realize that the error is in the data transformation I do and I should use the event level information to calculate the number of times each organizational pair has been involved in organizing something together, then plot that data. Unfortunately, though I don't know how to get there. I am grateful for any help.

I am not exactly an expert in network analysis, and igraph in particular. But I think something like that might be helpful.
I changed the preprocessing part of your analysis, because I've found few complications in a way:
Encoding of Hungarian language: that took time to find right encoding (see locale = 'cp1250 in read_csv call;
After gathering I've changed org_name* to org and org* into type;
I use chop to make it easier to spread -> unnest;
I've tried to make filter call shorter, but with no big success;
I use stringr::str_to_title() to unify org var, because there are same names which differs only in the way that nth word of the name is capitalized or not;
I use coalesce to fill NAs of org var with values from type var.
library(tidyverse)
library(magrittr)
library(igraph)
jobbik <- read_csv(
"http://eborbath.github.io/stackoverflow/jobbik.csv",
trim_ws = T,
locale = locale(encoding = 'cp1250')
)
jobbik %<>%
gather('key', 'val', -c('id', 'date')) %>%
mutate(
key = case_when(
grepl('^org_names\\d+$', key) ~ 'org',
grepl('^org\\d+$', key) ~ 'type',
TRUE ~ key
)
) %>%
chop(val) %>%
spread(key, val) %>%
unnest(c(org, type)) %>%
filter(
!(is.na(org) & (type == 'no other organizer')) &
!((is.na(org) | grepl('.*jobbik.*', org, T )) & (type == 'JOBBIK'))
) %>%
mutate(org = str_to_title(coalesce(org, type)))
To form data frame of graph edges, I am grouping by id of the event, filtering out all the events that where supported by only one organization (so there is no connection with other organizations), and finally I create pairs within id between the organizations with combn function. The result is character vector Org A-Org B, which, after unnesting, I separate into to cols from and to using - as a split (which is potentially dangerous, if the name of the org. has - symbol in it). I also filter out all self loops, if any. The last operation is count, to calculate how frequently each individual pair appears through the list of Jobbik meetings. I assign it to the width because when plotting, igraph::plot will use it as a width for the edges.
ed <- jobbik %>%
group_by(id) %>%
filter(n() > 1) %>%
summarise(edge = list(combn(org, 2, paste, collapse = '-'))) %>%
unnest(edge) %>%
separate(edge, into = c('from', 'to'), sep = '-') %>%
filter(from != to) %>%
count(from, to, name = width)
Similar analysis is performed for vertices. I add here extra information for the vertices, namely event id, date, organization type which you could use further, color - mapping the number of times given org. supported Jobbik and some additional graphical parameters for latter plot.
nd <- jobbik %>%
filter(org %in% c(ed$from, ed$to)) %>%
group_by(name = org) %>%
summarise(
id = sprintf('Event ids: %s', paste(id, collapse = ', ')),
date = sprintf('Event dates: %s', paste(date, collapse = ', ')),
type = sprintf('Org. type: %s', paste(type, collapse = '; ')),
color = n()
) %>%
ungroup() %>%
mutate(
color = heat.colors(10)[cut(color, 10)],
frame.color = NA,
label.dist = 1,
label.cex = .5,
label.color = 'gray10'
)
With these data we can make undirected graph, using graph_from_data_frame() function:
g <- graph_from_data_frame(ed, F, nd)
vertex_attr(g, 'size') <- degree(g, mode = 'all')
In a second line above, I add vertex attribute size to map degree of the vertices to the size of the vertices.
And finally to plot the comunity, I can do just:
plot(
g,
edge.curved = .2,
layout = layout_with_kk,
asp = 1,
main = 'Jobbik interaction network',
)

Related

Reduce duplicated entries considering more than one column

I have a long dataset in which there are duplicated entries whose data I need to merge, e.g. paste values together.
In my case, I have a database of scientific articles: the strongest unique identifiers are the DOI and the article title, but the first may be missing in one of the copies, and the second may have slight phonetic/graphic differences that are easy to spot for humans but not programmatically (e.g. one copy uses β and the other plain beta).
A "match" are two articles that share at least one of the two columns. That is, I need a way to dplyr::group_by by the DOI OR the article title (usual group_by uses an AND logic).
The only solution that comes to my mind is to repeat the aggregation twice, for each column. Not very efficient given the large number of records.
Example:
imagine an input like:
df <- data.frame(
ID = c(1, NA, 2, 2),
Title = c('A', 'A', 'beta', 'β'),
to.join = 1:4
)
After (OR)grouping and summarising:
df %>%
group_by_OR(ID, Title) %>% # dummy function
summarise(
ID = na.omit(ID)[1],
Title = Title[1],
joined = paste(to.join, collapse = ', '))
I should get something like this:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
That is, the data was grouped by the title for the first group and by the id for the second.
I don't think you can avoid having to group the data twice, but we can do it sequentially, that way we can be as efficient as possible.
library(dplyr)
df_aggregated <- df %>%
group_by(ID) %>%
arrange(Title) %>%
summarise(Title = first(Title),
to.join = paste0(to.join, collapse=", ")) %>%
group_by(Title) %>%
arrange(ID) %>%
summarise(ID = first(ID),
to.join = paste0(to.join, collapse=", ")) %>%
select(ID, Title, joined=to.join) %>%
as.data.frame()
Now,
df_aggregated
is:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
Eventually I found a solution, thanks also to #dario.
First I group by Title and impute the missing DOIs if at least one of the copies has one. Then I ungroup and create a new unique ID, using the DOI if present and the Title for those entries whose no copies have it.
Finally I group and summarize by this ID.
This way the computational-heavy summarising step is done only once.
records %>%
mutate(
uID = str_to_lower(Title) %>% str_remove_all('[^\\w\\d]+') # Improve matching between slightly different copies
) %>%
group_by(uID) %>%
mutate(DOI = na.omit(DOI)[1]) %>%
ungroup() %>%
mutate(
uID = ifelse(is.na(DOI), uID, DOI)
) %>%
group_by(uID) %>%
summarise(...) # various stuff here.

Showing flows for ggalluvium

Seeking some advice around the use of ggalluvium to demonstrate the distribution of preferences in Australia.
Context, in Australia we have preferential voting. Say I live in an area with 4 candidates contesting.
The ballot is completed by numbering a box 1-4 according to your party/candidate preference.
The candidate with the lowest proportion of the vote after the first count will be eliminated and their votes will be apportioned to where their voters have indicated on their ballot paper. This process is reiterated until two candidates remain and a candidate is elected when they have greater than 50% of the two party preferred vote.
I'm seeking to visualise the above reiterating distribution process using flow diagram, and ggalluvium.
However I can't quite seem to plot the aesthetics to show the flows feeding votes to candidates in the next count of the votes.
Here's what I get so far:
library(tidyverse)
library(magrittr)
library(ggalluvial)
Load Data
house_of_reps <- read_csv("https://results.aec.gov.au/24310/Website/Downloads/HouseDopByDivisionDownload-24310.csv", skip = 1)
house_of_reps$BallotPosition %<>% as.factor()
house_of_reps$CountNumber %<>% as.factor()
cooper <- house_of_reps %>%
filter(DivisionNm == "Cooper") %>%
spread(CalculationType, CalculationValue) %>%
select(4,9,10,14)
cooper %>% ggplot(aes(x = CountNumber, alluvium = PartyNm, stratum = `Preference Percent`, y = `Preference Percent`, fill = PartyAb)) +
geom_alluvium(aes(fill = PartyAb), decreasing = TRUE) +
geom_stratum(decreasing = TRUE) +
geom_text(stat = "stratum",decreasing = TRUE, aes(label = after_stat(fill))) +
stat_stratum(decreasing = TRUE) +
stat_stratum(geom = "text", aes(label = PartyAb), decreasing = TRUE) +
scale_fill_viridis_d() +
theme_minimal()
Output image
Would appreciate any guidance on how to show where the votes after each subsequent count are flowing to which political party in the next stratum.
Unfortunately your dataset is not well suited for the kind of plot you have in mind. While the plotting itself is easy, to achieve the desired plot involves "some" data wrangling and preparation steps.
The general issue is that your dataset as is does not show the flow of votes from one party to a second. It only shows the overall number of votes a party lost or receivd in each count.
However, as in each step only one party drops out this missing information could be extracted from your data. The basic idea is to split the obs for each party or more precisely each party which drops out in one of the later counts by voter's secondary party preference.
Not sure wether each step is clear but I added some explanations as comments and added a plot of the final structure of the dataset which hopefully makes it clearer what's the final result of all the steps:
library(tidyverse)
library(magrittr)
library(ggalluvial)
# Load Data
house_of_reps <- read_csv("https://results.aec.gov.au/24310/Website/Downloads/HouseDopByDivisionDownload-24310.csv", skip = 1)
house_of_reps$BallotPosition %<>% as.factor()
house_of_reps$CountNumber %<>% as.factor()
cooper <- house_of_reps %>%
filter(DivisionNm == "Cooper") %>%
spread(CalculationType, CalculationValue) %>%
select(count = CountNumber, party = PartyAb, pref = `Preference Count`, trans = `Transfer Count`)
# Helper function to
make_rows <- function(x) {
# Name of party which gets dropped in this period
dropped <- filter(x, trans < 0) %>% pull(party)
if (length(dropped) > 0) {
x <- filter(x, trans >= 0)
# Replacements are added two times. Once for the period where the party drops out,
# and also for the previous period
xdrop <- mutate(x, party = dropped, pref = trans, trans = 0, is_drop = FALSE)
xdrop1 <- mutate(xdrop, count = count - 1, to = party, is_drop = FALSE)
# For the parties to keep or which receive transfered votes have to adjust the number of votes
xkeep <- mutate(x, pref = pref - trans, trans = 0)
bind_rows(xdrop1, xdrop, xkeep)
} else {
x
}
}
cooper1 <- cooper %>%
# First: Convert count to a numeric. Add a "to" variable for second
# party preference or the party where votes are transferred to. This variable
# will later on be mapped on the "fill" aes
mutate(to = party, count = as.numeric(as.character(count))) %>%
group_by(party) %>%
# Add identifier of obs. to drop. Obs. to drop are obs. of parties which
# drop out in the following count
mutate(is_drop = lead(trans, default = 0) < 0) %>%
ungroup() %>%
# Split obs. to be dropped by secondary party preference, i.e. in count 0 the
# obs for party "IND" is replaced by seven obs. reflecting the secondary preference
# for one of the other seven parties
split(.$count) %>%
map(make_rows) %>%
bind_rows() %>%
# Now drop original obs.
filter(!is_drop, pref > 0) %>%
# Add a unique identifier
group_by(count, party) %>%
mutate(id = paste0(party, row_number())) %>%
ungroup() %>%
# To make the flow chart work we have make the dataset complete, i.e. add
# "empty" obs for each type of voter and each count
complete(count, id, fill = list(pref = 0, trans = 0, is_drop = FALSE)) %>%
# Fill up party and "to" columns
mutate(across(c(party, to), ~ if_else(is.na(.), str_extract(id, "[^\\d]+"), .))) %>%
# Filling up the "to" column with last observed value for "to" if any
group_by(id) %>%
mutate(last_id = last(which(party != to)),
to = if_else(count >= last_id & !is.na(last_id), to[last_id], to)) %>%
ungroup()
The final structure of the dataset could be illustrated by means of a tile plot:
cooper1 %>%
add_count(count, party) %>%
ggplot(aes(count, reorder(id, n), fill = to)) +
geom_tile(color = "white")
As I said, after all the cumbersome data wrangling making the flow chart itself is the easiest task and could be achieved like so:
cooper1 %>%
ggplot(aes(x = count, alluvium = id, stratum = to, y = pref, fill = to)) +
geom_flow(decreasing = TRUE) +
geom_stratum(decreasing = TRUE) +
scale_fill_viridis_d() +
theme_minimal()

Is there a way to loop through different levels of a factor for anomaly detection

I am using the 'anomalize' package for anomaly detection. My data consists of three columns, the date, an agent (this is where the different levels come from), and the number of schedules that agent had on a particular day. I can run the anomaly detection just fine when I remove the 'agent' column and sum the number of consults by day using this code:
df <- scheds %>%
group_by(date) %>%
summarise(
new_scheds = sum(new_scheds)
)
df_ts <- df %>% rownames_to_column() %>% as_tibble() %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) %>% select(-one_of('rowname'))
df_ts <- df_ts[order(df_ts$date),]
########## TS Decomp ###############
df_ts %>%
time_decompose(new_scheds, method = "stl", frequency = 5, trend = "auto") %>%
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
plot_anomaly_decomposition()
But I cannot find out how I would do this same type of thing for each agent individually without manually typing everything out and using filter(). I have tried the following loop with no luck:
agents <- levels(ts_agents$agent)
results <- matrix(NA, length(agents))
for(i in 1:length(agents)){
ts_agents %>%
time_decompose(new_scheds)[i] %>%
anomalize(remainder)[i] %>%
time_recompose()[i] %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)[i] }
but I get the following error:
'Error in time_decompose(new_scheds) : object 'new_scheds' not found'
Any tips or pointers would be greatly appreciated!
The reason for your error is that the pipe operator %>% doesn't work right when you try to subset your data.
If you enclose it in brackets and use . to refer to the input, you will avoid this error:
for(i in 1:length(agents)){
ts_agents %>% {
time_decompose(., new_scheds)[i]
} %>% {
...
This fixes the immediate problem of the error but I'm not sure how well the subsetting will work. It may be that you need filter() in the loop, or even group_by(df, agent) without any loop at all. (If you provide a full reproducible example including data, it will be easier to help).

Efficiently create data.frames for a changing number of input csv files with identical 'tidy' format and size

I can't figure out how to:
efficiently create, with rbind or another way, a data.frame compiling csv-derived data.frames, whose number varies for different projects. Or similarly:
efficiently create a data.frame of the difference between a csv-derived "baseline scenario" 's values and those of the rest of the csv-based alternative scenarios.
The csvs are timeseries of hydrologic model output, already in long, 'tidy' format and they're identical in format, size, and order -- there's just different numbers of them for different projects. There's always at least two, a baseline and an alternative, but there's usually quite a few. Eg, Project A might have four csvs/scenarios and Project B might have thirty csvs/scenarios.
I'm hoping to have one code template that will efficiently accommodate projects with any number of scenarios. Without an efficient way, I'm needing to add or delete quite a few lines to match the number of scenarios I have on an sub-daily basis, so it's a time-consuming step I'd like to avoid. After df and df_diff are created, both are used for later summaries and plots.
I'll manually enter the names of the scenarios as they always differ, eg:
library(dplyr)
scenarios <- c("baseline", "alt1", "alt1b", "no dam")
length(scenarios) will always match the number of CSVs I have for a given project.
Read in the csvs (one csv for each scenario) and keep them unmodified for later, separate processing:
#In my case these csv#s are from a separate file's list of csvs,
#eg csv1 <- read.csv("baseline.csv")
# csv2 <- read.csv("alt1.csv"), etc - all tidy monthly timeseries of many variables
#For reproducibility, simplyfying:
csv1 <- data.frame("variable" = "x", "value" = 13) #baseline scenario
csv2 <- data.frame("variable" = "x", "value" = 5) #"alternative 1"
csv3 <- data.frame("variable" = "x", "value" = 109) #"alternative 1b"
csv4 <- data.frame("variable" = "x", "value" = 11) #"dam removal"
#csv5 <- data.frame("variable" = "x", "value" = 2.5) #"100 extra flow for salmon sep-dec"
#...
#csv30 <- data.frame("variable" = "x", "value" = 41) #"alternative H3"
Copy the csvs and connect data to scenario:
baseline <- csv1 %>% mutate(scenario = as.factor(paste0(scenarios[1])))
scen2 <- csv2 %>% mutate(scenario = as.factor(paste0(scenarios[2])))
scen3 <- csv3 %>% mutate(scenario = as.factor(paste0(scenarios[3])))
scen4 <- csv4 %>% mutate(scenario = as.factor(paste0(scenarios[4])))
df <- rbind(baseline, scen2, scen3, scen4) #data.frame #1 I'm looking for.
#eg, if csv1-csv30 were included, how to compile in df efficiently, w/o needing the "scen" lines?
There are 4 scenarios in this case so df$scenario has 4 levels. To get here.
Now for the second "difference" data.frame:
bslnevals <- baseline %>% select(value)
scen2vals <- scen2 %>% select(value)
scen3vals <- scen3 %>% select(value)
scen4vals <- scen4 %>% select(value)
scen2diff <- (scen2vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[2], " - baseline"))) %>%
data.frame(scen2) %>% select(-value, -scenario)
scen3diff <- (scen3vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[3], " - baseline"))) %>%
data.frame(scen3) %>% select(-value, -scenario)
scen4diff <- (scen4vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[4], " - baseline"))) %>%
data.frame(scen4) %>% select(-value, -scenario)
df_diff <- rbind(scen2diff, scen3diff, scen4diff) #data.frame #2 I'm looking for.
#same as above, if csv1 - csv30 were included, how to compile in df_diff efficiently, w/o
#needing the "scen#vals" and "scen#diff" lines?
rm(baseline, scen2, scen3, scen4) #declutter - now unneeded (but csv1, csv2, etc orig csv#s needed later)
rm(bslnevals, scen2vals, scen3vals, scen4vals) #unneeded
rm(scen2diff, scen3diff, scen4diff) #unneeded
With 4 scenarios, there are 3 differences from the baseline so df_diff$scenario has 3 levels.
So, if I had 4 csvs (1 baseline, 3 alternatives) or maybe 30 CSVs (1 baseline, 29 alternatives), I tried to write functions and for loops that would assign scen2 and scen3 ...scen28 , and scen2diff, scen3diff...scen28diff etc, variables dynamically, but I failed. So, I'm looking for a way that works and that doesn't need much modification when applied to a project with any number of scenarios. I'm looking just to create df and df_diff in a clean way for a user, for however many scenarios (ie csvs) happen to be given to me or them for a given project.
Any help is greatly appreciated.
I can't test with your case but this may be a good starting point for refactoring your code. I use case_when to generate rules to map the name of the CSV file to a scenario. I subtract the baseline value from the value in each scenario.
library(dplyr)
library(readr)
library(purrr)
library(tidyr)
baseline_df <- read_csv("baseline.csv") %>%
mutate(id = row_number())
# list all csv files (in current directory), then read them all, and row-bind them.
# use case_when to apply rules to change filenames to "scenarios" (grepl to check presence of string)
# join with baseline df (by scenario row number) for easy subtracting.
# calculate differences values.
# remove baseline-baseline rows (diff is 0)
diff_df <- list.files(path = getwd(), pattern = "*.csv", full.names = TRUE) %>%
tibble(filename = .) %>%
mutate(data = map(filename, read_csv)) %>%
unnest() %>%
mutate(scenario = case_when(
grepl("baseline", filename) ~ "baseline",
grepl("alternative1", filename) ~ "alt1",
grepl("alternative2", filename) ~ "alt2",
grepl("dam_removal", filename) ~ "no dam",
TRUE ~ "other"
)) %>%
group_by(scenario) %>%
mutate(id = row_number()) %>%
left_join(baseline_df, by = "id", suffix = c("_new", "_baseline")) %>%
mutate(Value_diff = Value_new - Value_baseline) %>%
filter(scenario != "baseline")

Error All select() inputs must resolve to integer column positions. The following do not:

I am trying to use dplyr computation as below and then call this in a function where I can change the column name and dataset name. The code is as below:
sample_table <- function(byvar = TRUE, dataset = TRUE) {
tcount <-
df2 %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(tcount = n) %>%
left_join(
select(
dataset %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(scount = n), byvar, scount
), by = c("byvar")
) %>%
mutate_each(funs(replace(., is.na(.), 0)),-byvar %>% mutate(
tperc = round(tcount / rcount, digits = 2), sperc = round(scount / samplesize, digits = 2),
absdiff = abs(sperc - tperc)
) %>%
select(byvar, tcount, tperc, scount, sperc, absdiff)
return(tcount)
}
category_Sample1 <- sample_table(byvar = "category", dataset = Sample1)
My function name is sample_table.
The Error message is as below:-
Error: All select() inputs must resolve to integer column positions.
The following do not:
* byvar
I know this is a repeat question and I have gone through the below links:
Function writing passing column reference to group_by
Error when combining dplyr inside a function
I am not sure where I am going wrong. rcount is the number of rows in df2 and samplesize is the number of rows in "dataset" dataframe. I have to compute the same thing for another variable with three different "dataset" names.
You use column references as strings (byvar) (Standard Evaluation) and normal reference (tcount, tperc etc.) (Non Standard Evaluation) together.
Make sure you use one of both and the appropriate function: select() or select_(). You can fix your issue by using
select(one_of(c(byvar,'tcount')))

Resources