I have data which represents transit between UK cities.
Transit: if there is a transit between these two cities = 1, otherwise
=0
ave.pas: average number of passengers
.
library(plotly)
library(ggraph)
library(tidyverse)
library(tidygraph)
library(igraph)
library(edgebundleR)
df2 <- data.frame (City1 = c("London", "London", "London", "London" ,"Liverpool","Liverpool","Liverpool" , "Manchester", "Manchester", "Bristol"),
City2 = c("Liverpool", "Manchester", "Bristol","Derby", "Manchester", "Bristol","Derby","Bristol","Derby","Derby"),
Transit = c(1,0,1,1,1,1,1,1,0,1),
ave.pas = c(10,0,11,24,40,45,12,34,0,29))
df:
City1 City2 Transit ave.pas
1 London Liverpool 1 10
2 London Manchester 0 0
3 London Bristol 1 11
4 London Derby 1 24
5 Liverpool Manchester 1 40
6 Liverpool Bristol 1 45
7 Liverpool Derby 1 12
8 Manchester Bristol 1 34
9 Manchester Derby 0 0
10 Bristol Derby 1 29
Now I plot circular network:
df <- subset(df2, Transit== 1, select = c("City1","City2"))
edgebundle(graph.data.frame(df),directed=F,tension=0.1,fontsize = 10)
My goal is to set the size or colour's intensitvity of edges based on the corresponding value in 'ave.pas' variable from the dataset
linked links: link1 link2 link3 link4
(Plot must be made using edgebundle() function)
The intensity of the edges in the linked plots appears to be a function of the number of edges joining the vertices. We can make the number of edges equal to the number of passengers, but the problem here is that after a few lines are plotted on top of each other, the intensity stops increasing. It is therefore good for showing the difference between, say, 1 and 3 edges, but the difference between 10 and 30 is much less obvious. As a compromise, we can make the number of edges approximately proportional to the number of passengers. One way to do this is to create the graph from an adjacency matrix:
cities <- unique(c(df2$City1, df$City2))
m <- matrix(0, nrow = length(cities), ncol = length(cities),
dimnames = list(cities, cities))
for(i in seq(nrow(df2))) m[df2[i, 1], df2[i, 2]] <- df2[i, 4]
m <- m/min(m[m > 0])
edgebundle(graph_from_adjacency_matrix(m))
Related
I want to create a spatial map showing drug mortality rates by US county, but I'm having trouble merging the drug mortality dataset, crude_rate, with the shapefile, usa_county_df. Can anyone help out?
I've created a key variable, "County", in both sets to merge on but I don't know how to format them to make the data mergeable. How can I make the County variables correspond? Thank you!
head(crude_rate, 5)
Notes County County.Code Deaths Population Crude.Rate
1 Autauga County, AL 1001 74 975679 7.6
2 Baldwin County, AL 1003 440 3316841 13.3
3 Barbour County, AL 1005 16 524875 Unreliable
4 Bibb County, AL 1007 50 420148 11.9
5 Blount County, AL 1009 148 1055789 14.0
head(usa_county_df, 5)
long lat order hole piece id group County
1 -97.01952 42.00410 1 FALSE 1 0 0.1 1
2 -97.01952 42.00493 2 FALSE 1 0 0.1 2
3 -97.01953 42.00750 3 FALSE 1 0 0.1 3
4 -97.01953 42.00975 4 FALSE 1 0 0.1 4
5 -97.01953 42.00978 5 FALSE 1 0 0.1 5
crude_rate$County <- as.factor(crude_rate$County)
usa_county_df$County <- as.factor(usa_county_df$County)
merge(usa_county_df, crude_rate, "County")
[1] County long lat order hole
[6] piece id group Notes County.Code
[11] Deaths Population Crude.Rate
<0 rows> (or 0-length row.names)`
My take on this. First, you cannot expect a full answer with code because you did not provide a link to you data. Next time, please provide a full description of the problem with the data.
I just used the data you provided here to illustrate.
require(tidyverse)
# Load the data
crude_rate = read.csv("county_crude.csv", header = TRUE)
usa_county = read.csv("usa_county.csv", header = TRUE)
# Create the variable "county_join" within the county_crude to "left_join" on with the usa_county data. Note that you have to have the same type of data variable between the two tables and the same values as well
crude_rate = crude_rate %>%
mutate(county_join = c(1:5))
# Join the dataframes using a left join on the county_join and County variables
df_all = usa_county %>%
left_join(crude_rate, by = c("County"="county_join")) %>%
distinct(order,hole,piece,id,group, .keep_all = TRUE)
Data link: county_crude
Data link: usa_county
Blockquote
Sample data
library(raster)
dat <- getData('GADM', country='FRA', level=1)
plot(dat)
text(dat, labels=as.character(dat$ID_1), col="darkred", font=2, offset=0.5, adj=c(0,2))
To save the IDs of provinces, I can do this
province.id <- dat$ID_1
However, I want to arrange these IDs according to some direction (i.e. south to north)
For example, my province.id id should start from 10 (since it is the southern most province) all the way till 17 since it is the northern most province
One way I thought was I can generate centorid of each province and based on the centroid,
I can determine which are the most south to most north location.
library(rgeos)
trueCentroids = gCentroid(dat,byid=TRUE)
plot(dat)
points(coordinates(dat),pch=1)
But I still cannot export the output or arrange the centroids in the south-to-north direction
to save as a vector
An easy approach would be to take the minimum latitude of each polygon and sort your IDs based on that:
# data
library(raster)
dat <- getData('GADM', country='FRA', level=1)
# create south to north index
sn_index <- unlist(lapply(dat#polygons, function(x) min(x#Polygons[[1]]#coords[,2])))
#sort IDs
dat$ID_1[order(sn_index)]
# [1] 10 13 21 16 22 3 2 14 20 6 18 8 11 7 1 9 15 4 5 19 12 17
I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link
I wonder if there is an simpler way than writing if...else... for the following case. I have a dataframe and I only want the rows with number in column "percentage" >=95. Moreover, for one object, if there is multiple rows fitting this criteria, I only want the largest one(s). If there are more than one largest ones, I would like to keep all of them.
For example:
object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75
Then I'd like the result shows:
object city street percentage
A NY Sun 100
A NY Waterfall 100
B CA Washington 98
I am able to do it using if else statement, but I feel there should be some smarter ways to say: 1. >=95 2. if more than one, choose the largest 3. if more than one largest, choose them all.
You can do this by creating an variable that indicates the rows that have the maximum percentage for each of the objects. We can then use this indicator to subset the data.
# your data
dat <- read.table(text = "object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75", header=TRUE, na.strings="", stringsAsFactors=FALSE)
# create an indicator to identify the rows that have the maximum
# percentage by object
id <- with(dat, ave(percentage, object, FUN=function(i) i==max(i)) )
# subset your data - keep rows that are greater than 95 and have the
# maximum group percentage (given by id equal to one)
dat[dat$percentage >= 95 & id , ]
This works by the addition statement creating a logical, which can then be used to subset the rows of dat.
dat$percentage >= 95 & id
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
Or putting these together
with(dat, dat[percentage >= 95 & ave(percentage, object,
FUN=function(i) i==max(i)) , ])
# object city street percentage
# 1 A NY Sun 100
# 3 A NY Waterfall 100
# 4 B CA Washington 98
You could do this also in data.table using the same approach by #user20650
library(data.table)
setDT(dat)[dat[,percentage==max(percentage) & percentage >=95, by=object]$V1,]
# object city street percentage
#1: A NY Sun 100
#2: A NY Waterfall 100
#3: B CA Washington 98
Or using dplyr
dat %>%
group_by(object) %>%
filter(percentage==max(percentage) & percentage >=95)
Following works:
ddf2 = ddf[ddf$percentage>95,]
ddf3 = ddf2[-c(1:nrow(ddf2)),]
for(oo in unique(ddf2$object)){
tempdf = ddf2[ddf2$object == oo, ]
maxval = max(tempdf$percentage)
tempdf = tempdf[tempdf$percentage==maxval,]
for(i in 1:nrow(tempdf)) ddf3[nrow(ddf3)+1,] = tempdf[i,]
}
ddf3
object city street percentage
1 A NY Sun 100
3 A NY Waterfall 100
4 B CA Washington 98
Hey there everyone just getting started with R, so I decided to make some data up with the eventual goal of superimposing it on top of a map.
Before I can get there I'm trying to add a name to my data to sort by Province.
Drugs <- c("Azin", "Prolof")
Provinces <- c("Ontario", "British Columbia", "Quebec")
Gender <- c("Female", "Male")
raw <- c(10,16,8,20,7,12,13,11,9,7,14,7)
yomom <- matrix(raw, nrow = 6, ncol = 2)
colnames(yomom) <- Drugs
bro <- data.frame(Gender, yomom)
idunno <- data.frame(Provinces, bro)
The first problem I've encountered is that the provinces vector is repeating, I'm not sure how to make it look like this in R. I'm basically trying to get it to skip a row.
Something like this?
idunno <- data.frame(Provinces=rep(Provinces,each=2), bro)
idunno
# Provinces Gender Azin Prolof
# 1 Ontario Female 10 13
# 2 Ontario Male 16 11
# 3 British Columbia Female 8 9
# 4 British Columbia Male 20 7
# 5 Quebec Female 7 14
# 6 Quebec Male 12 7
Read the documentation on rep(...)