My problem:
I have a street network (df.net) and a list containing the Origins and Destinations of trips (df.trips).
I need to find the flow on all links?
library(dplyr)
df.net = tribble(~from, ~to, ~weight,1,2,1,2,1,1,1,9,3,9,1,2,2,10,1,10,2,2,9,10,8,10,9,15,9,8,1,8,9,2,7,8,2,12,7,3,9,12,10,12,9,9,12,6,2,6,12,5,11,12,3,12,11,3,5,6,1,11,5,4,5,11,3,11,4,3,4,3,5,3,10,4,10,11,10)
df.trips = tribble(~from, ~to, ~N,1,2,45,1,4,24,1,5,66,1,9,12,1,11,54,2,3,63,2,4,22,2,7,88,2,12,44,3,2,6,3,8,43,3,10,20,3,11,4,4,1,9,4,5,7,4,6,35,4,9,1,5,7,55,5,8,21,5,1,23,5,7,12,5,2,18,6,2,31,6,3,6,6,5,15,6,8,19,7,1,78,7,2,48,7,3,92,7,6,6,8,2,77,8,4,5,8,5,35,8,6,63,8,7,22)
This is my solution:
library(igraph)
# I construct a directed igraph network:
graph = igraph::graph_from_data_frame(d=df.net, directed=T)
plot(graph)
# I make a vector of edge_ids:
edges = paste0(df.net$from,":",df.net$to)
# and an empty vector of same length to fill with the flow afterwards:
N = integer(length(edges))
# I loop through all Origin-Destination-pairs:
for(i in 1:nrow(df.trips)){
# provides one shortest path between one Origin & one Destination:
path = shortest_paths(graph = graph,
from = as.character(df.trips$from[i]),
to = as.character(df.trips$to[i]),
mode = "out",
weights = NULL)
# Extract the names of vetices on the path:
a = names(path$vpath[[1]])
# Make a vector of the edge_ids:
a2 = a[2:length(a)]
a = a[1:(length(a)-1)]
a = paste0(a,":",a2)
# and fill the vector with the trips
v = integer(length(edges))
v[edges %in% a] = pull(df.trips[i,3])
# adding the trips of this iteration to the sum
N = N + v
}
# attach vector to network-dataframe:
df.net = data.frame(df.net, N)
Theoretically it works. It just takes approx. 8h for my real network to finish (about 500 000 Origin-Destination-pairs on a network with a bit less than 50 000 links).
I am pretty sure my for-loop is the culprit.
So my questions concerning optimization are:
1) Is there a igraph-function which simply does what I want to do? I could not find it...
2) Maybe there is another package better suited to my needs which I haven't stumbled upon?
3) If not, should I go for loop-performance improvement by rewriting it with the Rcpp-package?
Anyways, I am grateful for any help you can provide me.
Thanks in advance!
I have what I hope is a faster solution, although I get slightly different results from you.
This approach multithreads with data.table, calls igraph::shorest_paths only once per from vertex, and avoids using the names attributes of the graph until the trivial last step.
library(igraph)
library(tibble)
library(data.table)
library(zoo)
library(purrr)
df.net = tribble(~from, ~to, ~weight,1,2,1,2,1,1,1,9,3,9,1,2,2,10,1,10,2,2,9,10,8,10,9,15,9,8,1,8,9,2,7,8,2,12,7,3,9,12,10,12,9,9,12,6,2,6,12,5,11,12,3,12,11,3,5,6,1,11,5,4,5,11,3,11,4,3,4,3,5,3,10,4,10,11,10)
graph = igraph::graph_from_data_frame(d=df.net, directed=T)
df.trips = tribble(~from, ~to, ~N,1,2,45,1,4,24,1,5,66,1,9,12,1,11,54,2,3,63,2,4,22,2,7,88,2,12,44,3,2,6,3,8,43,3,10,20,3,11,4,4,1,9,4,5,7,4,6,35,4,9,1,5,7,55,5,8,21,5,1,23,5,7,12,5,2,18,6,2,31,6,3,6,6,5,15,6,8,19,7,1,78,7,2,48,7,3,92,7,6,6,8,2,77,8,4,5,8,5,35,8,6,63,8,7,22)
l.trips <- split(df.trips,1:nrow(df.trips))
setDT(df.trips)
Result <- df.trips[,setnames(lapply(shortest_paths(graph = graph,from= from,to = to,weights=NULL,mode = "out")$vpath,
function(x){zoo::rollapply(x,width=2,c)}) %>% map2(.,N,~ {.x %x% rep(1,.y)} %>% as.data.frame) %>%
rbindlist %>% .[,.N,by = c("V1","V2")],c("new.from","new.to","N")),by=from][,sum(N),by = c("new.from","new.to")]
Result[,`:=`(new.from = V(graph)$name[Result$new.from],
new.to = V(graph)$name[Result$new.to])]
# new.from new.to V1
# 1: 1 2 320
# 2: 2 10 161
# 3: 1 9 224
# 4: 9 8 73
# 5: 10 11 146
# 6: 11 4 102
# 7: 2 1 167
# 8: 9 12 262
# 9: 4 3 44
#10: 9 1 286
#11: 12 6 83
#12: 12 11 24
#13: 11 5 20
#14: 10 2 16
#15: 11 12 35
#16: 12 7 439
#17: 8 9 485
#18: 7 8 406
#19: 6 12 202
Related
I am working with the R programming language. Suppose I have the following data:
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
index <- 1:1400
my_data = data.frame(index,d)
I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):
hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")
My Question: In each one of these histograms there are a different number of "bars" (i.e. bins). For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars. For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?
Right now, I am trying to manually do this, e.g. (for histogram #3)
histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]
head(histogram3_bar1)
index d
1001 1001 4.156393
1002 1002 3.358958
1003 1003 1.605904
1004 1004 3.603535
1006 1006 2.943456
1007 1007 1.586542
But is there a more "efficient" way to do this?
Thanks!
hist itself can provide for the solution to the question's problem, to find out which data points are in which intervals. hist returns a list with first member breaks
First, make the problem reproducible by setting the RNG seed.
set.seed(2021)
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
Now, save the return value of hist and have findInterval tell the bins where each data points are in.
h1 <- hist(d, breaks = 10)
f1 <- findInterval(d, h1$breaks)
h1$breaks
# [1] -2 0 2 4 6 8 10 12 14 16
head(f1)
#[1] 6 7 7 7 7 6
The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d by f1:
head(d[f1])
#[1] 8.07743 10.26174 10.26174 10.26174 10.26174 8.07743
As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval").
As a final check, table the values returned by findInterval and see if they match the histogram's counts.
table(f1)
#f1
# 1 2 3 4 5 6 7 8 9
# 2 34 130 34 17 478 512 169 24
h1$counts
#[1] 2 34 130 34 17 478 512 169 24
To have the intervals for each data point, the following
bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
head(bins)
# bin min max
#1 6 8 10
#2 7 10 12
#3 7 10 12
#4 7 10 12
#5 7 10 12
#6 6 8 10
I have data that looks like this (but larger):
Pos Value
0 66.81967
1 66.36885
2 65.79508
3 65.27049
4 64.88525
5 64.97541
6 65.39344
7 65.99181
8 66.63115
9 66.95901
10 66.89344
11 66.44262
12 65.90984
13 65.49181
14 65.35246
I have already determined the maxima and saved the position values of each to a vector like so:
9 19 30 42 56 69 80 92 107 118 130 143 154 164 176 188 199 211
222 234 245
I now want to split the data based on the value of the maxima, so for the sample data I'd want to split the dataset into the values for Positions 0->9 and into the values for Positions 10-15, and save each of these sub-sets into vectors of their own.
I'm new to R (and coding) and was wondering how to best go about this.
Suppose your data frame is dat and your maxima values are in a vector maxima, you might use
split(dat, cut(dat$Pos, breaks = maxima, include.lowest = TRUE))
For your example data frame:
dat <-
structure(list(Pos = 0:14, Value = c(66.81967, 66.36885, 65.79508,
65.27049, 64.88525, 64.97541, 65.39344, 65.99181, 66.63115, 66.95901,
66.89344, 66.44262, 65.90984, 65.49181, 65.35246)), .Names = c("Pos",
"Value"), class = "data.frame", row.names = c(NA, -15L))
and the first few values of your maxima in the range:
maxima <- c(0, 10, 19)
my code gives you a list of data frames
#$`[0,10]`
# Pos Value
#1 0 66.81967
#2 1 66.36885
#3 2 65.79508
#4 3 65.27049
#5 4 64.88525
#6 5 64.97541
#7 6 65.39344
#8 7 65.99181
#9 8 66.63115
#10 9 66.95901
#11 10 66.89344
#
#$`(10,19]`
# Pos Value
#12 11 66.44262
#13 12 65.90984
#14 13 65.49181
#15 14 65.35246
If you don't want data frames, but just Value, use
split(dat$Value, cut(dat$Pos, breaks = maxima, include.lowest = TRUE))
#$`[0,10]`
# [1] 66.81967 66.36885 65.79508 65.27049 64.88525 64.97541 65.39344 65.99181
# [9] 66.63115 66.95901 66.89344
#
#$`(10,19]`
# [1] 66.44262 65.90984 65.49181 65.35246
Thanks! How would I go about saving these as separate data frames/sets (not sure on the correct terminology) so that I can then fit them individually?
How about
lst <- split(dat, cut(dat$Pos, breaks = maxima, include.lowest = TRUE))
dir <- getwd()
lapply(seq_len(length(lst)),
function (i) write.csv(lst[[i]], file = paste0(dir,"/",names(lst[i]), ".csv"), row.names = FALSE))
This will save each data frame into a .csv file under directory dir. I have used getwd() to test the code; you may change it to a specific folder.
Not sure if that is the best approach, but I would work with a list and use a for loop like this (untested):
maxpos <- c(9, 19, 30)
ans <- list()
prev <- 1
for (i in seq.int(length(maxpos))) {
ans[[i]] <- dataset[seq(prev, maxpos[i]),]
prev <- maxpos[i+1]
}
ans[[length(maxpos)+1]] <- dataset[seq(maxpos[length[maxpos]]+1,nrow(dataset)),]
I have the two following tables:
df <- data.frame(eth = c("A","B","B","A","C"),ZIP1 = c(1,1,2,3,5))
Inc <- data.frame(ZIP2 = c(1,2,3,4,5,6,7),A = c(56,98,43,4,90,19,59), B = c(49,10,69,30,10,4,95),C = c(69,2,59,8,17,84,30))
eth ZIP1 ZIP2 A B C
A 1 1 56 49 69
B 1 2 98 10 2
B 2 3 43 69 59
A 3 4 4 30 8
C 5 5 90 10 17
6 19 4 84
7 59 95 39
I would like to create a variable Inc in the df data frame where for each observation, the value is the intersection of the eth and ZIP of the observation. In my example, it would lead to:
eth ZIP1 Inc
A 1 56
B 1 49
B 2 10
A 3 43
C 5 17
A loop or quite brute force could solve it but it takes time on my dataset, I'm looking for a more subtle way maybe using data.table. It seems to me that it is a very standard question and I'm apologizing if it is, my unability to formulate a precise title for this problem (as you may have noticed..) is maybe why I haven't found any similar question in searching on the forum..
Thanks !
Sure, it can be done in data.table:
library(data.table)
setDT(df)
df[ melt(Inc, id.var="ZIP2", variable.name="eth", value.name="Inc"),
Inc := i.Inc
, on=c(ZIP1 = "ZIP2","eth") ]
The syntax for this "merge-assign" operation is X[i, Xcol := expression, on=merge_cols].
You can run the i = melt(Inc, id.var="ZIP", variable.name="eth", value.name="Inc") part on its own to see how it works. Inside the merge, columns from i can be referred to with i.* prefixes.
Alternately...
setDT(df)
setDT(Inc)
df[, Inc := Inc[.(ZIP1), eth, on="ZIP2", with=FALSE], by=eth]
This is built on a similar idea. The package vignettes are a good place to start for this sort of syntax.
We can use row/column indexing
df$Inc <- Inc[cbind(match(df$ZIP1, Inc$ZIP2), match(df$eth, colnames(Inc)))]
df
# eth ZIP1 Inc
#1 A 1 56
#2 B 1 49
#3 B 2 10
#4 A 3 43
#5 C 5 17
What about this?
library(reshape2)
merge(df, melt(Inc, id="ZIP2"), by.x = c("ZIP1", "eth"), by.y = c("ZIP2", "variable"))
ZIP1 eth value
1 1 A 56
2 1 B 49
3 2 B 10
4 3 A 43
5 5 C 17
Another option:
library(dplyr)
library(tidyr)
Inc %>%
gather(eth, value, -ZIP2) %>%
left_join(df, ., by = c("eth", "ZIP1" = "ZIP2"))
my solution(which maybe seems awkward)
for (i in 1:length(df$eth)) {
df$Inc[i] <- Inc[as.character(df$eth[i])][df$ZIP[i],]
}
I am looking for a fast and efficient way to compute the problem described below. Any help would be appreciated, thanks in advance!
I have a couple of very large csv files that have different information about the same object, but in my final calculation I need all of the attributes in the different table. I am trying to calculate the load of a large number of electrical substations, first I have a list of unique electrical substations;
Unique_Substations <- data.frame(Name = c("SubA", "SubB", "SubC", "SubD"))
In another list I have information about the customers behind these substations;
Customer_Information <- data.frame(
Customer = 1001:1010,
SubSt_Nm = sample(unique(Unique_Substations$Name), 10, replace = TRUE),
HouseHoldType = sample(1:2, 10, replace = TRUE)
)
And in another list I have information about the, let's say, solar panels on these customers roofs (for different years);
Solar_Panels <- data.frame(
Customer = sample(1001:1010, 10, replace = TRUE),
SolarPanelYear1 = sample(10:20, 10, replace = TRUE),
SolarPanelYear2 = sample(15:20, 10, replace = TRUE)
)
Now I want see what the load is for each substation for each year. I have a household load and a solar panel load normalised for each type of household or the solarpanel;
SolarLoad <- data.frame(Load = c(0, -10, -10, 5))
HouseHoldLoad <- data.frame(Type1 = c(1, 3, 5, 2), Type2 = c(3, 5, 6, 1))
So now I have to match up these lists;
ML_SubSt_Cust <- sapply(Unique_Substations$Name,
function(x) which(Customer_Information$SubSt_Nm %in% x == TRUE))
ML_Cust_SolarP <- sapply(Customer_Information$Customer,
function(x) which(Solar_Panels$Customer %in% x == TRUE))
(Here I use the which(xxx %in% x == TRUE) method because I need multiple matches and match() only returns one match
And now we come to my big question (but probably not my only problem with this method) at last. I want to calculate the maximum load on each substation for each year. To this end I had first written a for loop that looped through the Unique_Substations list, which is of course highly inefficient. After that I tried to speed it up using outer() but I don't think I have properly vectorized my function. My maximum function looks as follows (I only wrote it out for the solar panel part to keep it simple);
GetMax <- function(i, Yr) {
max(sum(Solar_Panels[unlist(ML_Cust_SolarP[ML_SubSt_Cust[[i]]], use.names= FALSE),Yr])*SolarLoad)
}
I'm sure this is not efficient at all but I have no clue how to do it in any other way.
To get my final results I use a outer function;
Results <- outer(1:nrow(Unique_Substations), 1:2, Vectorize(GetMax))
In my example all of these data frames are much much larger (40000 rows each or so), so I really need some good optimization of the functions involved. I tried to think of ways to vectorize the function but I couldn't work it out. Any help would be appreciated.
EDIT:
Now that I fully understand the accepted awnser I have another problem. My actual Customer_Information is 188k rows long and my actual HouseHoldLoad is 53k rows long. Needless to say this does not merge() very well. Is there another solution to this problem that does not require merge() or for loops that are too slow?
First: set.seed() when generating random data! I did set.seed(1000) before your code for these results.
I think a bit of merge-ing and dplyr can help here. First, we get the data into a better shape:
library(dplyr)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load") %>%
select(HouseHoldType=variable, Load) %>%
mutate(HouseHoldType=gsub("Type", "", HouseHoldType))
Solar_Panels <- melt(Solar_Panels, id.vars="Customer",
value.name="SPYearVal") %>%
select(Customer, SolarPanelYear=variable, SPYearVal) %>%
mutate(SolarPanelYear=gsub("SolarPanelYear", "", SolarPanelYear))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
That gives us:
## Customer SubSt_Nm HouseHoldType SolarPanelYear SPYearVal
## 1 1001 SubB 1 1 16
## 2 1001 SubB 1 2 18
## 3 1001 SubB 1 2 16
## 4 1001 SubB 1 1 20
## 5 1002 SubD 2 1 16
## 6 1002 SubD 2 1 13
## 7 1002 SubD 2 2 20
## 8 1002 SubD 2 2 18
## 9 1003 SubA 1 2 15
## 10 1003 SubA 1 1 16
## 11 1005 SubC 2 2 19
## 12 1005 SubC 2 1 10
## 13 1006 SubA 1 1 15
## 14 1006 SubA 1 2 19
## 15 1007 SubC 1 1 17
## 16 1007 SubC 1 2 19
## 17 1009 SubA 1 1 10
## 18 1009 SubA 1 1 18
## 19 1009 SubA 1 2 18
## 20 1009 SubA 1 2 18
Now we just group and summarize:
dat %>% group_by(SubSt_Nm, SolarPanelYear) %>%
summarise(mx=max(sum(SPYearVal)*SolarLoad))
## SubSt_Nm SolarPanelYear mx
## 1 SubA 1 295
## 2 SubA 2 350
## 3 SubB 1 180
## 4 SubB 2 170
## 5 SubC 1 135
## 6 SubC 2 190
## 7 SubD 1 145
## 8 SubD 2 190
If you use data.table vs data frames, it should be pretty speedy even with 40K entries.
UPDATE For those who cannot install dplyr, this just uses reshape2 (hopefully that is installable)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load")
colnames(HouseHoldLoad) <- c("HouseHoldType", "Load")
HouseHoldLoad$HouseHoldType <- gsub("Type", "", HouseHoldLoad$HouseHoldType)
Solar_Panels <- melt(Solar_Panels, id.vars="Customer", value.name="SPYearVal")
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("SolarPanelYear", "", Solar_Panels$SolarPanelYear)
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190
If you really can't install even reshape2, then this works with just the base stats package:
colnames(HouseHoldLoad) <- c("Load.1", "Load.2")
HouseHoldLoad <- reshape(HouseHoldLoad, varying=c("Load.1", "Load.2"), direction="long", timevar="HouseHoldType")[1:2]
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear.1", "SolarPanelYear.2")
Solar_Panels <- reshape(Solar_Panels, varying=c("SolarPanelYear.1", "SolarPanelYear.2"), direction="long", timevar="SolarPanelYear")[1:2]
colnames(Solar_Panels) <- c("Customer", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("^[0-9]+\\.", "", rownames(Solar_Panels))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190
here is my questions: I got data with 3000 obs. and 5000 features, the 3000 obs. has a numeric names like 100.1,100.3,100.5,100.7. I changed the names into a integer variables by segs <-as.integer(names), then I want to use segs as a factor to sum all of the 3000 features. The length of the segs is 300 so the final data frame is 300 by 5000. I know tapply could be used to get the sum by factor for one variable but I have to use for to get all of the 5000 features summed. It is really time-consuming, so I want to know if there is a clear way in R to solve those problems or if there are some packages to solve this kind of problem.
This is the dirty code and df0 is the data while df is what I want:
df <- data.frame()
for(i in 2:ncol(df0)-1){
temp <- tapply(df0[,i],df2$segs,sum)
df <- cbind(df,temp)
}
Thanks!
=====
Thanks, Roland, a demo data is shown as follows:
set.seed(42)
df0 <- data.frame(
X = rnorm(100,10,10),
Y = rnorm(100),
Z = rnorm(100))
df0$seq <- as.integer(df0$X)
Try this...
set.seed(42)
df0 <- data.frame(
X = rnorm(100,10,10),
Y = rnorm(100),
Z = rnorm(100))
df0$seq <- as.integer(df0$X)
library(data.table)
dt = data.table(df0)
dt[,lapply(.SD, sum), by=seq ]
seq X Y Z
1: 23 164.8144774 1.293768670 -3.74807730
2: 4 8.9247301 1.909529066 -0.06277254
3: 13 40.2090180 -2.036599633 0.88836392
4: 16 147.8571697 -2.571487358 -1.35542918
5: 14 72.1640142 0.432493959 -1.49983832
6: 8 42.8498355 -0.582031919 -1.35989852
7: 25 75.9995653 0.896369560 -1.08024329
8: 9 27.5244048 0.833429855 -1.19363017
9: 30 30.1842371 0.188193035 -0.64574372
10: 32 32.8664539 0.108072728 2.03697217
11: -3 -7.5714175 -0.899304085 -1.27286230
12: 7 29.6254908 -0.929790177 2.75906514
27: 12 50.2535374 -0.620793351 -3.80900436
28: 24 24.4410126 -0.433169033 -0.02671746
29: -19 -19.9309008 -0.533492330 -1.01759612
30: 11 11.8523056 -1.071782384 0.96954501
31: 19 38.5407490 -0.751408534 -4.81312992
32: 0 -0.9642319 1.453325156 2.20977601
33: -1 -4.3685646 -0.834654913 -0.24624546
34: 18 18.2177311 -1.594588162 0.27369527
35: -4 -4.5921400 0.586487537 0.86256338