Neo4j shortest path with rels in both directions - graph

I have a graph set up with the function...
create (a:station {name:"a"}),
(b:station {name:"b"}),
(c:station {name:"c"}),
(d:station {name:"d"}),
(e:station {name:"e"}),
(f:station {name:"f"}),
(a)-[:CONNECTS_TO {time:8}]->(b),
(a)-[:CONNECTS_TO {time:4}]->(c),
(a)-[:CONNECTS_TO {time:10}]->(d),
(b)-[:CONNECTS_TO {time:3}]->(c),
(b)-[:CONNECTS_TO {time:9}]->(e),
(c)-[:CONNECTS_TO {time:40}]->(f),
(d)-[:CONNECTS_TO {time:5}]->(e),
(e)-[:CONNECTS_TO {time:3}]->(f)
and using the function
START startStation=node:node_auto_index(name = "a"), endStation=node:node_auto_index(name = "f")
MATCH p =(startStation)-[r*]->(endStation)
WITH extract(x IN rels(p)| x.time) AS Times, length(p) AS `Number of Stops`, reduce(totalTime = 0, x IN rels(p)| totalTime + x.time) AS `Total Time`, extract(x IN nodes(p)| x.name) AS Route
RETURN Route, Times, `Total Time`, `Number of Stops`
ORDER BY `Total Time`
and it returns the results...
+-------------------------------------------------------------+
| Route | Times | Total Time | Number of Stops |
+-------------------------------------------------------------+
| ["a","d","e","f"] | [10,5,3] | 18 | 3 |
| ["a","b","e","f"] | [8,9,3] | 20 | 3 |
| ["a","c","f"] | [4,40] | 44 | 2 |
| ["a","b","c","f"] | [8,3,40] | 51 | 3 |
+-------------------------------------------------------------+
Which is fine except because it is a directed graph and there is no path from c -> b it doesn't return (for instance) [a, c, b, e, f] which is a valid path of length 4.
So, if I add the inverse paths...
MATCH (START)-[r:CONNECTS_TO]->(END )
CREATE UNIQUE (START)<-[:CONNECTS_TO { time:r.time }]-(END )
And run the query again I get... (for paths length 1..4)...
+---------------------------------------------------------------------+
| Route | Times | Total Time | Number of Stops |
+---------------------------------------------------------------------+
| ["a","d","e","f"] | [10,5,3] | 18 | 3 |
| ["a","c","b","e","f"] | [4,3,9,3] | 19 | 4 |
| ["a","b","e","f"] | [8,9,3] | 20 | 3 |
| ["a","c","f"] | [4,40] | 44 | 2 |
| ["a","c","b","c","f"] | [4,3,3,40] | 50 | 4 |
| ["a","c","f","e","f"] | [4,40,3,3] | 50 | 4 |
| ["a","b","c","f"] | [8,3,40] | 51 | 3 |
| ["a","b","a","c","f"] | [8,8,4,40] | 60 | 4 |
| ["a","d","a","c","f"] | [10,10,4,40] | 64 | 4 |
+---------------------------------------------------------------------+
This does include the path [a, c, b, e, f] but it also include [a, c, b, c, f] which uses c twice and [a, c, f, e, f] which uses f (the destination?!) twice.
Is there a way of filtering the paths so each path only includes the same node once?

You could do a filtering after the fact, but it might not be the fastest thing.
Something like this:
START startStation=node:node_auto_index(name = "a"), endStation=node:node_auto_index(name = "f")
MATCH p = (startStation)-[r*..4]->(endStation)
WHERE length(reduce (a=[startStation], n IN nodes(p) | CASE WHEN n IN a THEN a ELSE a + n END)) = length(nodes(p))
WITH extract(x IN rels(p)| x.time) AS Times, length(p) AS `Number of Stops`, reduce(totalTime = 0, x IN rels(p)| totalTime + x.time) AS `Total Time`, extract(x IN nodes(p)| x.name) AS Route
RETURN Route, Times, `Total Time`, `Number of Stops`
ORDER BY `Total Time`
I created a GraphGist with your question and answers in as an executable, live document.
See here: Neo4j shortest path with rels in both directions

Related

WGCNA package: value matching function output contains wrong NAs

I use WGCNA package for analyzing the co-expressed genes. Here I try to Form a data frame analogous to expression data that will hold the clinical traits. and i use the following codes:
table for traitData
| x | sample | NoduleperPlant |
|- |- |- |
| 1 | 1021_verbena_rep_1 | 2 |
| 2 | 1021_verbena_rep_2 | 3 |
| 3 | 1021_verbena_rep_3 | 1 |
| 4 | 1021_camporegio_rep_1 | 2 |
| 5 | 1021_camporegio_rep_2 | 3 |
| 6 | 1021_camporegio_rep_3 | 4 |
| 7 | BL225C_camporegio_rep_1 | 5 |
| 8 | BL225C_camporegio_rep_2 | 4 |
| 9 | BL225C_camporegio_rep_3 | 1 |
Table dfxpr (some of the genes are presented in table)
|FIELD1 |aacC-1|aacC4-1|aapJ-1|aapM-1|aapP-1|aapQ-1|aarF-1|
|-----------------------|------|-------|------|------|------|------|------|
|X1021_verbena_rep_1 |42 |46 |12412 |935 |3354 |2876 |550 |
|X1021_verbena_rep_2 |52 |37 |11775 |946 |2970 |2824 |514 |
|X1021_verbena_rep_3 |12 |22 |5077 |397 |1462 |1228 |230 |
|X1021_camporegio_rep_1 |52 |71 |12983 |1454 |3408 |3248 |707 |
|X1021_camporegio_rep_2 |20 |65 |9240 |803 |2807 |3146 |445 |
|X1021_camporegio_rep_3 |28 |53 |11030 |1065 |3480 |3410 |582 |
|BL225C_camporegio_rep_1|29 |19 |6346 |375 |938 |768 |118 |
|BL225C_camporegio_rep_2|51 |62 |12938 |781 |1765 |1629 |291 |
|BL225C_camporegio_rep_3|52 |43 |6462 |504 |1120 |1091 |238 |
traitData = read.csv("NodulPerPlantTraitForLowGroup.csv"); #this csv file contains 3 columns as the first column is non-relevant information, second column contains the names of samples and the third column holds the values measured for the traits.
# remove columns that hold information I do not need.
allTraits = traitData[, -1];
allTraits = allTraits[, 1:2];
# Form a data frame analogous to expression data that will hold the clinical traits.
lowNoduleSamples = rownames(dfxpr) #dfxpr is a data frame containing 9 observations (i.e. samples) and 6398 variables (i.e. genes)
traitRows = match(lowNoduleSamples, allTraits$sample); #here is the line i get wrong values as NAs while i know they all should match
datTraits = allTraits[traitRows, -1]; #then this lines result NAs too
rownames(datTraits) = allTraits[traitRows, 1];
collectGarbage();
how can I fix the problem?
I have Added a "drop = FALSE" to this line: datTraits = allTraits[traitRows, -1]
datTraits = allTraits[traitRows, -1, drop = FALSE]
I realized that my allTraits contains only 2 columns; when I remove the first one, I'm left with just one column and R converts that into a single vector unless I add the drop = FALSE argument.

Is there a KQL query to limit the number of sub results I get per a particular category?

I’m trying to generate a query where I limit the number of sub results I get per a particular category, and could use some help on if there is a good function for this.
Quick Example:
| ID | Category | Value | A bunch of other important columns |
|-----------|-----------------|--------------|-------------------------------------------|
| 1 | A | GUID | |
| 2 | A | GUID | |
| 3 | A | GUID | |
| 4 | A | GUID | |
| 5 | B | GUID | |
| 6 | B | GUID | |
I want to return only N GUIDs per category. (Largely because I’m hitting the 64MB Kusto query limits for some Categories that won’t be useful anyway)
The Top-nested operator looks good at first, BUT I don’t want to do any aggregation, and it filters out other important columns. Per the note on the page, I can use Ignore=max(1) to remove the aggregation, then do some serializing of all my other columns to a certain value, then unpack after the filter. But that feels like I’m doing something very wrong.
I've also tried something like:
| partition by Category ( top 3 by Value)
But it's limited to 64 partitions, and I need closer to 500.
Any idea of a good pattern to do this?
Here you go:
let NumItemsPerCategory = 3;
datatable(ID:long, Category:string, Value:guid)
[
1, "A", guid(40b73f8f-78d2-4eae-bd5b-b3e00f38ac33),
2, "A", guid(043ee507-aadf-4453-bcc6-d8f4f541b043),
3, "A", guid(f71d3cc0-ce46-474f-9dcd-f3883fa08859),
4, "A", guid(bf259fc8-e9fe-4a99-a296-ca81e1fa250a),
5, "B", guid(d8ee3ac7-da76-4e87-a9ed-e5a37c943ad2),
6, "B", guid(282e74ff-3b71-407c-a2a7-92bb1cb17b27),
]
| summarize PackedItems = make_list(pack_all(), NumItemsPerCategory) by Category
| project-away Category
| mv-expand PackedItem = PackedItems
| evaluate bag_unpack(PackedItem)
| project-away PackedItems
Result:
| ID | Category | Value |
|----|----------|--------------------------------------|
| 1 | A | 40b73f8f-78d2-4eae-bd5b-b3e00f38ac33 |
| 2 | A | 043ee507-aadf-4453-bcc6-d8f4f541b043 |
| 3 | A | f71d3cc0-ce46-474f-9dcd-f3883fa08859 |
| 5 | B | d8ee3ac7-da76-4e87-a9ed-e5a37c943ad2 |
| 6 | B | 282e74ff-3b71-407c-a2a7-92bb1cb17b27 |

Generating table from dataframe with proportions of 20 variables, for each row, for each possible combination of said variable in R

I have a dataframe with 1000 rows representing a different species, for each of these rows are 20 columns with different proportions of a single variable (amino acids).
For each row (species), I would like to calculate the proportion of each possible combination of single letter variables (amino acids).
So each species should have 10 million calculated combinations of the amino acids.
My code for generating all possible combinations of amino acids is this:
S <- c('G','A','L','M','F','W','K','Q','E','S','P','V','I','C','Y','H','R','N','D','T')
allCombs <- function(x) c(x, lapply(seq_along(x)[-1L],
function(y) combn(x, y, collapse = "")),
recursive = TRUE)
Scombi <- allCombs(S)
My dataframe looks like this:
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Species | Domain | Actual OGT | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Acaryochloris_marina | Bacteria | 25 | 0.089806129655016 | 0.011179368033588 | 0.052093758404379 | 0.056116688487831 | 0.033311792369428 | 0.074719969063287 | 0.021456955206517 | 0.062874293719234 | 0.046629846831622 | 0.105160548187069 | 0.023372745414207 | 0.034667218445279 | 0.050847279968411 | 0.052372091362254 | 0.054393907299958 | 0.058415776607691 | 0.059282788930956 | 0.075786041807662 | 0.012266709932789 | 0.025246090272826 |
| Acetobacter_pasteurianus | Bacteria | 26 | 0.113635842586218 | 0.009802006063102 | 0.053600553080754 | 0.058133056353357 | 0.036903783608575 | 0.085210142094237 | 0.021833316616858 | 0.053123968429941 | 0.045353753818743 | 0.096549489115246 | 0.025913145427995 | 0.027225003296464 | 0.052562918173042 | 0.033342785074972 | 0.072705595398914 | 0.049908591821467 | 0.056094207383391 | 0.079084190962059 | 0.010144168305489 | 0.018873482389179 |
| Acetobacterium_woodii | Bacteria | 30 | 0.074955804625209 | 0.011863137047001 | 0.058166310295556 | 0.071786218284636 | 0.03424697521635 | 0.075626240308253 | 0.018397399287915 | 0.087245372635541 | 0.078978610001876 | 0.087790924875632 | 0.03068806687375 | 0.046498124583435 | 0.036120348133785 | 0.031790536900726 | 0.045179171055634 | 0.050727609439901 | 0.055617806111571 | 0.069643619533744 | 0.005984048340735 | 0.028693676448754 |
| Acetohalobium_arabaticum | Bacteria | 37 | 0.07294006171749 | 0.008402092275195 | 0.063388830763099 | 0.094174357919767 | 0.032968396601359 | 0.074335444399095 | 0.014775170057021 | 0.081175614650614 | 0.068173658934912 | 0.096191143631822 | 0.023591084039018 | 0.042176390239929 | 0.036535950562554 | 0.032690297143697 | 0.045929769851454 | 0.05201834344653 | 0.049098780255464 | 0.079225589949997 | 0.004923023531168 | 0.027286000029819 |
| Acholeplasma_laidlawii | Bacteria | 37 | 0.067353087090147 | 0.002160134400001 | 0.056809775441953 | 0.065310218890485 | 0.038735792072418 | 0.069508395797039 | 0.018942086187746 | 0.081435757342441 | 0.084786245636216 | 0.096181862610799 | 0.026545056054257 | 0.045549913713558 | 0.038323250930165 | 0.033008924859672 | 0.047150659509282 | 0.054698408656138 | 0.059971572823796 | 0.072199395290938 | 0.005926270925023 | 0.03540319176793 |
| Achromobacter_xylosoxidans | Bacteria | 30 | 0.120974236639852 | 0.008469732379263 | 0.054028585828065 | 0.055476991380945 | 0.035048667997051 | 0.086814010110846 | 0.02243157894653 | 0.050520668283285 | 0.039296015271673 | 0.099074202941835 | 0.028559018986725 | 0.025845147774914 | 0.049701994138614 | 0.034808403369533 | 0.073998251525545 | 0.050072992977641 | 0.051695040348985 | 0.080314177991249 | 0.011792085285623 | 0.021078197821829 |
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
So you can see, each row has the proportion of each amino acid (A,G,I etc.) over the entire set of amino acids, (all 20 add up to 1), but I would like to generate each possible combination, over 1. so something that looks like the following:
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
| Species | Domain | Actual OGT | A | AC | AD | AE |
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
| Acaryochloris_marina | Bacteria | 25 | 0.089806129655016 | 0.191179368033588 | 0.1782093758404379 | 0.186116688487831 |
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
So for each species, 10 million columns (each representing one of the possible combinations of amino acids, without repetition, so the largest string is 20 with each one)
Apologies for being unclear, does anyone have any ideas on how to create this data-set? (Or the best way of asking/explaining what I should be looking up?)
Species <- structure(list(Species = c("Acaryochloris_marina",
"Acetobacter_pasteurianus",
"Acetobacterium_woodii", "Acetohalobium_arabaticum", "Acholeplasma_laidlawii",
"Achromobacter_xylosoxidans"), Domain = c("Bacteria", "Bacteria",
"Bacteria", "Bacteria", "Bacteria", "Bacteria"), Actual.OGT = c(25,
26, 30, 37, 37, 30), A = c(0.089806129655016, 0.113635842586218,
0.074955804625209, 0.07294006171749, 0.067353087090147, 0.120974236639852
), C = c(0.011179368033588, 0.009802006063102, 0.011863137047001,
0.008402092275195, 0.002160134400001, 0.008469732379263), D = c(0.052093758404379,
0.053600553080754, 0.058166310295556, 0.063388830763099, 0.056809775441953,
0.054028585828065), E = c(0.056116688487831, 0.058133056353357,
0.071786218284636, 0.094174357919767, 0.065310218890485, 0.055476991380945
), F = c(0.033311792369428, 0.036903783608575, 0.03424697521635,
0.032968396601359, 0.038735792072418, 0.035048667997051), G = c(0.074719969063287,
0.085210142094237, 0.075626240308253, 0.074335444399095, 0.069508395797039,
0.086814010110846), H = c(0.021456955206517, 0.021833316616858,
0.018397399287915, 0.014775170057021, 0.018942086187746, 0.02243157894653
), I = c(0.062874293719234, 0.053123968429941, 0.087245372635541,
0.081175614650614, 0.081435757342441, 0.050520668283285), K = c(0.046629846831622,
0.045353753818743, 0.078978610001876, 0.068173658934912, 0.084786245636216,
0.039296015271673), L = c(0.105160548187069, 0.096549489115246,
0.087790924875632, 0.096191143631822, 0.096181862610799, 0.099074202941835
), M = c(0.023372745414207, 0.025913145427995, 0.03068806687375,
0.023591084039018, 0.026545056054257, 0.028559018986725), N = c(0.034667218445279,
0.027225003296464, 0.046498124583435, 0.042176390239929, 0.045549913713558,
0.025845147774914), P = c(0.050847279968411, 0.052562918173042,
0.036120348133785, 0.036535950562554, 0.038323250930165, 0.049701994138614
), Q = c(0.052372091362254, 0.033342785074972, 0.031790536900726,
0.032690297143697, 0.033008924859672, 0.034808403369533), R = c(0.054393907299958,
0.072705595398914, 0.045179171055634, 0.045929769851454, 0.047150659509282,
0.073998251525545), S = c(0.058415776607691, 0.049908591821467,
0.050727609439901, 0.05201834344653, 0.054698408656138, 0.050072992977641
), T = c(0.059282788930956, 0.056094207383391, 0.055617806111571,
0.049098780255464, 0.059971572823796, 0.051695040348985), V = c(0.075786041807662,
0.079084190962059, 0.069643619533744, 0.079225589949997, 0.072199395290938,
0.080314177991249), W = c(0.012266709932789, 0.010144168305489,
0.005984048340735, 0.004923023531168, 0.005926270925023, 0.011792085285623
), Y = c(0.025246090272826, 0.018873482389179, 0.028693676448754,
0.027286000029819, 0.03540319176793, 0.021078197821829)), .Names = c("Species",
"Domain", "Actual.OGT", "A", "C", "D", "E", "F", "G", "H", "I",
"K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"), row.names = c(NA,
-6L), class = "data.frame")
I'm not entirely sure that R is the right tool for this job. It's going to take a very, very long time. You may be able to reduce that time using the parallel package if you have sufficient cores, however.
I've put together a process that will accomplish what you want. For each species, it takes my computer about eight minutes to generate the "joint proportion." If you run on a single thread, as R will do inherently, you're looking at close to an hour just to accomplish the these six species in your sample data.
I wrote my script to run in parallel, and using seven cores, it took about 11 minutes to complete all six. Extending this over all 1000 species, I wouldn't be surprised if it took as long as two days to do all this (on seven cores). If you have a large cluster, you may be able to cut it down some.
Please note that this will not give you your results as described in your question. I posted a comment that I wasn't sure what formula you were using to get the joint proportions. I am just taking the sum here for ease of demonstration. You will need to adjust your code appropriately.
library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
# Reshape data. This will make it easier to split and access proportion
# within each species.
SpeciesLong <-
Species %>%
gather(protein, proportion,
A:Y) %>%
arrange(Species)
# Get unique species
S <- unique(SpeciesLong$protein)
# Build the combination list
# Note, this is different than your code, I added FUN = paste0
Scombi <- unlist(lapply(seq_along(S),
function(x) combn(S, x, FUN = paste0, collapse = "")))
# Function to get the joint proportion
# I took the sum, for convenience. You'll need to replace this
# with whatever function you use to get the joint proportion.
# The key part is getting the correct proteins, which happens within
# the `sum` call.
joint_protein <- function(protein_combo, data){
sum(data$proportion[vapply(data$protein,
grepl,
logical(1),
protein_combo)])
}
# make a list data frames, one for each species
SplitSpecies <-
split(SpeciesLong,
SpeciesLong$Species)
# Make a cluster of processors to run on
cl <- makeCluster(detectCores() - 1)
# export Scombi and joint_protein to all processes in the cluster
clusterExport(cl, c("Scombi", "joint_protein"))
# Get the aggregate values for each species in a one-row data frame.
SpeciesAggregate <-
parLapply(cl,
X = SplitSpecies,
fun = function(data){
X <- lapply(Scombi,
joint_protein,
data)
names(X) <- Scombi
as.data.frame(X)
})
# Join the results to the Species data
# You may want to save your data before this step. I'm not entirely
# sure I did this right to match the rows correctly.
Species <- cbind(Species, SpeciesAggregate)

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

By group: sum of variable values under condition

Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))

Resources