My problem:
I have a street network (df.net) and a list containing the Origins and Destinations of trips (df.trips).
I need to find the flow on all links?
library(dplyr)
df.net = tribble(~from, ~to, ~weight,1,2,1,2,1,1,1,9,3,9,1,2,2,10,1,10,2,2,9,10,8,10,9,15,9,8,1,8,9,2,7,8,2,12,7,3,9,12,10,12,9,9,12,6,2,6,12,5,11,12,3,12,11,3,5,6,1,11,5,4,5,11,3,11,4,3,4,3,5,3,10,4,10,11,10)
df.trips = tribble(~from, ~to, ~N,1,2,45,1,4,24,1,5,66,1,9,12,1,11,54,2,3,63,2,4,22,2,7,88,2,12,44,3,2,6,3,8,43,3,10,20,3,11,4,4,1,9,4,5,7,4,6,35,4,9,1,5,7,55,5,8,21,5,1,23,5,7,12,5,2,18,6,2,31,6,3,6,6,5,15,6,8,19,7,1,78,7,2,48,7,3,92,7,6,6,8,2,77,8,4,5,8,5,35,8,6,63,8,7,22)
This is my solution:
library(igraph)
# I construct a directed igraph network:
graph = igraph::graph_from_data_frame(d=df.net, directed=T)
plot(graph)
# I make a vector of edge_ids:
edges = paste0(df.net$from,":",df.net$to)
# and an empty vector of same length to fill with the flow afterwards:
N = integer(length(edges))
# I loop through all Origin-Destination-pairs:
for(i in 1:nrow(df.trips)){
# provides one shortest path between one Origin & one Destination:
path = shortest_paths(graph = graph,
from = as.character(df.trips$from[i]),
to = as.character(df.trips$to[i]),
mode = "out",
weights = NULL)
# Extract the names of vetices on the path:
a = names(path$vpath[[1]])
# Make a vector of the edge_ids:
a2 = a[2:length(a)]
a = a[1:(length(a)-1)]
a = paste0(a,":",a2)
# and fill the vector with the trips
v = integer(length(edges))
v[edges %in% a] = pull(df.trips[i,3])
# adding the trips of this iteration to the sum
N = N + v
}
# attach vector to network-dataframe:
df.net = data.frame(df.net, N)
Theoretically it works. It just takes approx. 8h for my real network to finish (about 500 000 Origin-Destination-pairs on a network with a bit less than 50 000 links).
I am pretty sure my for-loop is the culprit.
So my questions concerning optimization are:
1) Is there a igraph-function which simply does what I want to do? I could not find it...
2) Maybe there is another package better suited to my needs which I haven't stumbled upon?
3) If not, should I go for loop-performance improvement by rewriting it with the Rcpp-package?
Anyways, I am grateful for any help you can provide me.
Thanks in advance!
I have what I hope is a faster solution, although I get slightly different results from you.
This approach multithreads with data.table, calls igraph::shorest_paths only once per from vertex, and avoids using the names attributes of the graph until the trivial last step.
library(igraph)
library(tibble)
library(data.table)
library(zoo)
library(purrr)
df.net = tribble(~from, ~to, ~weight,1,2,1,2,1,1,1,9,3,9,1,2,2,10,1,10,2,2,9,10,8,10,9,15,9,8,1,8,9,2,7,8,2,12,7,3,9,12,10,12,9,9,12,6,2,6,12,5,11,12,3,12,11,3,5,6,1,11,5,4,5,11,3,11,4,3,4,3,5,3,10,4,10,11,10)
graph = igraph::graph_from_data_frame(d=df.net, directed=T)
df.trips = tribble(~from, ~to, ~N,1,2,45,1,4,24,1,5,66,1,9,12,1,11,54,2,3,63,2,4,22,2,7,88,2,12,44,3,2,6,3,8,43,3,10,20,3,11,4,4,1,9,4,5,7,4,6,35,4,9,1,5,7,55,5,8,21,5,1,23,5,7,12,5,2,18,6,2,31,6,3,6,6,5,15,6,8,19,7,1,78,7,2,48,7,3,92,7,6,6,8,2,77,8,4,5,8,5,35,8,6,63,8,7,22)
l.trips <- split(df.trips,1:nrow(df.trips))
setDT(df.trips)
Result <- df.trips[,setnames(lapply(shortest_paths(graph = graph,from= from,to = to,weights=NULL,mode = "out")$vpath,
function(x){zoo::rollapply(x,width=2,c)}) %>% map2(.,N,~ {.x %x% rep(1,.y)} %>% as.data.frame) %>%
rbindlist %>% .[,.N,by = c("V1","V2")],c("new.from","new.to","N")),by=from][,sum(N),by = c("new.from","new.to")]
Result[,`:=`(new.from = V(graph)$name[Result$new.from],
new.to = V(graph)$name[Result$new.to])]
# new.from new.to V1
# 1: 1 2 320
# 2: 2 10 161
# 3: 1 9 224
# 4: 9 8 73
# 5: 10 11 146
# 6: 11 4 102
# 7: 2 1 167
# 8: 9 12 262
# 9: 4 3 44
#10: 9 1 286
#11: 12 6 83
#12: 12 11 24
#13: 11 5 20
#14: 10 2 16
#15: 11 12 35
#16: 12 7 439
#17: 8 9 485
#18: 7 8 406
#19: 6 12 202
I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe
I am looking for a fast and efficient way to compute the problem described below. Any help would be appreciated, thanks in advance!
I have a couple of very large csv files that have different information about the same object, but in my final calculation I need all of the attributes in the different table. I am trying to calculate the load of a large number of electrical substations, first I have a list of unique electrical substations;
Unique_Substations <- data.frame(Name = c("SubA", "SubB", "SubC", "SubD"))
In another list I have information about the customers behind these substations;
Customer_Information <- data.frame(
Customer = 1001:1010,
SubSt_Nm = sample(unique(Unique_Substations$Name), 10, replace = TRUE),
HouseHoldType = sample(1:2, 10, replace = TRUE)
)
And in another list I have information about the, let's say, solar panels on these customers roofs (for different years);
Solar_Panels <- data.frame(
Customer = sample(1001:1010, 10, replace = TRUE),
SolarPanelYear1 = sample(10:20, 10, replace = TRUE),
SolarPanelYear2 = sample(15:20, 10, replace = TRUE)
)
Now I want see what the load is for each substation for each year. I have a household load and a solar panel load normalised for each type of household or the solarpanel;
SolarLoad <- data.frame(Load = c(0, -10, -10, 5))
HouseHoldLoad <- data.frame(Type1 = c(1, 3, 5, 2), Type2 = c(3, 5, 6, 1))
So now I have to match up these lists;
ML_SubSt_Cust <- sapply(Unique_Substations$Name,
function(x) which(Customer_Information$SubSt_Nm %in% x == TRUE))
ML_Cust_SolarP <- sapply(Customer_Information$Customer,
function(x) which(Solar_Panels$Customer %in% x == TRUE))
(Here I use the which(xxx %in% x == TRUE) method because I need multiple matches and match() only returns one match
And now we come to my big question (but probably not my only problem with this method) at last. I want to calculate the maximum load on each substation for each year. To this end I had first written a for loop that looped through the Unique_Substations list, which is of course highly inefficient. After that I tried to speed it up using outer() but I don't think I have properly vectorized my function. My maximum function looks as follows (I only wrote it out for the solar panel part to keep it simple);
GetMax <- function(i, Yr) {
max(sum(Solar_Panels[unlist(ML_Cust_SolarP[ML_SubSt_Cust[[i]]], use.names= FALSE),Yr])*SolarLoad)
}
I'm sure this is not efficient at all but I have no clue how to do it in any other way.
To get my final results I use a outer function;
Results <- outer(1:nrow(Unique_Substations), 1:2, Vectorize(GetMax))
In my example all of these data frames are much much larger (40000 rows each or so), so I really need some good optimization of the functions involved. I tried to think of ways to vectorize the function but I couldn't work it out. Any help would be appreciated.
EDIT:
Now that I fully understand the accepted awnser I have another problem. My actual Customer_Information is 188k rows long and my actual HouseHoldLoad is 53k rows long. Needless to say this does not merge() very well. Is there another solution to this problem that does not require merge() or for loops that are too slow?
First: set.seed() when generating random data! I did set.seed(1000) before your code for these results.
I think a bit of merge-ing and dplyr can help here. First, we get the data into a better shape:
library(dplyr)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load") %>%
select(HouseHoldType=variable, Load) %>%
mutate(HouseHoldType=gsub("Type", "", HouseHoldType))
Solar_Panels <- melt(Solar_Panels, id.vars="Customer",
value.name="SPYearVal") %>%
select(Customer, SolarPanelYear=variable, SPYearVal) %>%
mutate(SolarPanelYear=gsub("SolarPanelYear", "", SolarPanelYear))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
That gives us:
## Customer SubSt_Nm HouseHoldType SolarPanelYear SPYearVal
## 1 1001 SubB 1 1 16
## 2 1001 SubB 1 2 18
## 3 1001 SubB 1 2 16
## 4 1001 SubB 1 1 20
## 5 1002 SubD 2 1 16
## 6 1002 SubD 2 1 13
## 7 1002 SubD 2 2 20
## 8 1002 SubD 2 2 18
## 9 1003 SubA 1 2 15
## 10 1003 SubA 1 1 16
## 11 1005 SubC 2 2 19
## 12 1005 SubC 2 1 10
## 13 1006 SubA 1 1 15
## 14 1006 SubA 1 2 19
## 15 1007 SubC 1 1 17
## 16 1007 SubC 1 2 19
## 17 1009 SubA 1 1 10
## 18 1009 SubA 1 1 18
## 19 1009 SubA 1 2 18
## 20 1009 SubA 1 2 18
Now we just group and summarize:
dat %>% group_by(SubSt_Nm, SolarPanelYear) %>%
summarise(mx=max(sum(SPYearVal)*SolarLoad))
## SubSt_Nm SolarPanelYear mx
## 1 SubA 1 295
## 2 SubA 2 350
## 3 SubB 1 180
## 4 SubB 2 170
## 5 SubC 1 135
## 6 SubC 2 190
## 7 SubD 1 145
## 8 SubD 2 190
If you use data.table vs data frames, it should be pretty speedy even with 40K entries.
UPDATE For those who cannot install dplyr, this just uses reshape2 (hopefully that is installable)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load")
colnames(HouseHoldLoad) <- c("HouseHoldType", "Load")
HouseHoldLoad$HouseHoldType <- gsub("Type", "", HouseHoldLoad$HouseHoldType)
Solar_Panels <- melt(Solar_Panels, id.vars="Customer", value.name="SPYearVal")
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("SolarPanelYear", "", Solar_Panels$SolarPanelYear)
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190
If you really can't install even reshape2, then this works with just the base stats package:
colnames(HouseHoldLoad) <- c("Load.1", "Load.2")
HouseHoldLoad <- reshape(HouseHoldLoad, varying=c("Load.1", "Load.2"), direction="long", timevar="HouseHoldType")[1:2]
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear.1", "SolarPanelYear.2")
Solar_Panels <- reshape(Solar_Panels, varying=c("SolarPanelYear.1", "SolarPanelYear.2"), direction="long", timevar="SolarPanelYear")[1:2]
colnames(Solar_Panels) <- c("Customer", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("^[0-9]+\\.", "", rownames(Solar_Panels))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190