SAS: recursively process dataset to get all connections between entities - recursion

I have a bridging table in01 that contains relations. E.g. Customer A is related to Customer B (which is then stored in a two duplicate records A/B and B/A).
/* 1. Testdata */
data in01(keep=primary: secondary:);
infile datalines;
length primary_party_no secondary_party_no $ 15;
input primary_party_no secondary_party_no ;
datalines;
A B
B A
A C
C A
B D
D B
W Z
Z W
Y Z
Z Y
X Y
Y X
;
run;
Task: My Task is group all customers with a connection and create an ID, regardless of the number of "links" required to connect the customers.
In the example above Group 1 would consist of A, B, C, D, while Group 2 would consist of W, X, Y, Z.
I reckon the the data has to be looped over recursively - I am, however, unable to figure out how to design a data step or macro that calls upon itself in each loop.
What would be a good starting point to tackle this problem?

Looks like you are trying to calculate the connected subgraphs from the network that your connections build. See this previous question. Identifying groups/networks of customers
In your case the node ID variables are character instead of numeric. The macro mentioned in that other question has been updated to handle character node ids https://github.com/sasutils/macros/blob/master/subnet.sas
It sounds like your connections are not one-way so disable the directed option.
filename src url 'https://raw.githubusercontent.com/sasutils/macros/master/subnet.sas';
%include src;
%subnet
(in=in01
,out=want
,from=primary_party_no
,to=secondary_party_no
,directed=0
);
proc print data=want;
run;
Results:
primary_ secondary_
Obs party_no party_no subnet
1 A B 1
2 A C 1
3 B A 1
4 B D 1
5 C A 1
6 D B 1
7 W Z 2
8 X Y 2
9 Y X 2
10 Y Z 2
11 Z W 2
12 Z Y 2

Related

Duplicate each row in a data frame a number of times equal to how many times a value in that row shows up in another data frame?

I apologize as I wasn't quite sure how to word my question without making it extremely lengthy, as the duplicate rows also need to have some altered values from the original.
I have two data frames. The first, df1, records all paths actually taken from source to destination, while the second, df2, contains all possible paths. Some sample data is below:
df1
Row
Source
Destination
Payload
1
A
B
10010101
2
A
D
11101011
3
A
B
10111111
4
E
B
01100110
df2
Row
Source
Destination
1
A
B
2
B
A
3
B
C
4
B
E
5
B
F
6
A
D
7
D
A
8
D
C
9
D
H
For my data, it is assumed that if an object takes a path A -> B for example, it also takes every possible path stemming from B that isn't to the original source (Think of a networking hub. In one way, and out every other). So since we have a payload that goes from A -> B, I also need to record that same payload going from B to C, E, and F. I'm currently accomplishing this in the FOR loop below, but I would like to know if there is a better way to do it, preferably one that doesn't use looping. I'm also somewhat new to R, so even simple corrections to my code are also appreciated.
for (row in 1:dim(df1)[1]){
initialSource <- df1$source[row] #saves the initial source
paths <- df1[row,] #saves the current row for duplication
paths <- paths[rep(1, times = count(df2[df2$source %in% df1$destination[row], ])[[1]]), ] #duplicates the row
paths$source <- paths$destination #replaces the source values to be the location of the hub
paths$destination <- df2$destination[df2$source %in% paths$destination] #replaces the destination values to be every connection from the hub
paths <- paths[!(paths$destination %in% initialSource), ] #removes the row that would indicate data being sent back to the source
masterdf <- rbind(masterdf, paths) #saving the new data to a larger data frame that df1 is actually a sample of.
}
The data frame paths by the end of the first loop with the above data would look like:
Row
Source
Destination
Payload
1
B
C
10010101
2
B
E
10010101
3
B
F
10010101
Maybe you could try merging your two dataframes. With base R merge you could do the following (using "Destination" from df1 and "Source" from df2). You would need to remove rows to exclude the "original source" as you described. Renaming and selecting the columns gives you the final output. Please let me know if this is what you had in mind.
d <- subset(
merge(df1, df2, by.x = "Destination", by.y = "Source", all = TRUE),
Source != Destination.y
)
data.frame(
Source = d$Destination,
Destination = d$Destination.y,
Payload = d$Payload
)
Output
Source Destination Payload
1 B C 10010101
2 B E 10010101
3 B F 10010101
4 B C 10111111
5 B E 10111111
6 B F 10111111
7 B C 1100110
8 B F 1100110
9 B A 1100110
10 D C 11101011
11 D H 11101011

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

Given set of column values, create data.frame with known number of rows

I'm trying to make datasets of a fixed number of rows to make test datasets - however I'm writing to a destination that requires known keys for each column. For my example, assume that these keys are lowercase letters, upper case letters and numbers respectively.
I need to make a function which, provided only the required number of rows, combines keys such that the number of combinations is equal the required number. Naturally there will be some impossible cases such as prime numbers than the largest key and values larger than the product of the number of keys.
A sample output dataset of 10 rows could look like the following:
data.frame(col1 = rep("a", 10),
col2 = rep(LETTERS[1:5], 2),
col3 = rep(1:2, 5))
col1 col2 col3
1 a A 1
2 a B 2
3 a C 1
4 a D 2
5 a E 1
6 a A 2
7 a B 1
8 a C 2
9 a D 1
10 a E 2
Note here that I had to manually specify the keys to get the desired number of rows. How can I arrange things so that R can do this for me?
Things I've already considered
optim - The equation I'm trying to solve is effectively x * y * z = n where all of them must be integers. optim doesn't seem to support that constraint
expand.grid and then subset - almost 500 million combinations, eats up all my memory - not an option.
lpSolve - Has the integer option, but only seems to support linear equations. Could use logs to make it linear, but then I can't use the integer option.
factorize from gmp to get factors - Thought about this, but I can't think of a way to distribute the prime factors back into the keys. EDIT: Maybe a bin packing problem?
For integer optimisation on a low level scale you can use a grid search. Other possibilities are described here.
This should work for your example.
N <- 10
fr <- function(x) {
x1 <- x[1]
x2 <- x[2]
x3 <- x[3]
(x1 * x2 * x3 - N)^2
}
library(NMOF)
gridSearch(fr, list(seq(0,5), seq(0,5), seq(0,5)))$minlevels
I am a bit reluctant,but we can work things out:
a1<-2
a2<-5
eval(parse(text=paste0("data.frame(col1 = rep(LETTERS[1],",a1*a2,"),col2 =
rep(LETTERS[1:",a2,"],",a1,"),col3 = rep(1:",a1,",",a2,"))")))
col1 col2 col3
1 A A 1
2 A B 2
3 A C 1
4 A D 2
5 A E 1
6 A A 2
7 A B 1
8 A C 2
9 A D 1
10 A E 2
Is this something similar to what you are asking?

Counting repetition in r

I want to count the number of specific repetitions in my dataframe. Here is a reproducible example
df <- data.frame(Date= c('5/5', '5/5', '5/5', '5/6', '5/7'),
First = c('a','b','c','a','c'),
Second = c('A','B','C','D','A'),
Third = c('q','w','e','w','q'),
Fourth = c('z','x','c','v','z'))
Give this:
Date First Second Third Fourth
1 5/5 a A q z
2 5/5 b B w x
3 5/5 c C e c
4 5/6 a D w v
5 5/7 c A q z
I read a big file that holds 400,000 instances and I want to know different statistics about specific attributes. For an example here I'd like to know how many times a happens on 5/5. I tried using sum(df$Date == '5/5' & df$First == 'a', na.rm=TRUE) which gave me the right result here (2), but when I run it on the big data set, the numbers are not accurate.
Any idea why?

Aggregate with trimmed means in R

I am trying to aggregate data like this in R:
df = data.frame(c("a","a","a","a","a","b","b","b","b","b","c","c","c"))
colnames(df) = "f"
set.seed(10)
df$e = rnorm(13,20,5)
f e
1 a 20.09373
2 a 19.07874
3 a 13.14335
4 a 17.00416
5 a 21.47273
6 b 21.94897
7 b 13.95962
8 b 18.18162
9 b 11.86664
10 b 18.71761
11 c 25.50890
12 c 23.77891
13 c 18.80883
Which I would like to aggregate by the column f and have a trimmed mean of e for each unique f type (i.e. produce 3 rows of data).
I tried:
df2=data.frame(0)
df2=aggregate(df$e, by = "f",mean(df$e, trim=0.1))
got the following error:
Error in match.fun(FUN) :
'mean(df$e, trim = 0.1)' is not a function, character or symbol
Tried a few searches online and came up empty. My actual data consists of around 30 values of e per f so I am not concerned that trim=0.1 won't actually trim the means in the example (because no points lie outside of the upper and lower 5th percentile) it will with the real data, this is just to get the aggregate function working as intended. Thanks!
Try this
df2=aggregate(e~f,data=df,mean,trim=0.1)
f e
1 a 18.15854
2 b 16.93489
3 c 22.69888
Function to use for calculation in this case can be given just by its name, for example, mean, and additional parameters needed for that function are set after comma.

Resources