I'm really new to spark and graphx. My question is that if i have a graph with some nodes that have mutual(reciprocally) edges between them, i want to select the edges with a good performace. An example:
Source Dst.
1 2
1 3
1 4
1 5
2 1
2 5
2 6
2 7
3 1
I want to get the result:
1 2
2 1
1 3
3 1
The order may be arbitrary. Have anyone an idea how i can get this?
Try:
edges.intersection(edges.map(e => Edge(e.dstId, e.srcId))
Note that this compares the Edge.attr values as well. If you want to ignore attr values, then do this:
edges.map(e=> (e.srcId,e.dstId)).intersection(edges.map(e => (e.dstId, e.srcId)))
Related
I have a dataset called restrictions and I know if people can do actions (eat with a fork, come out of bed...).
Each number represents with which level of difficulty each individual can do an action (1: No difficulty, 2: Some difficulties, 3: High difficulties, 4: Cannot do the action at all)
I am mostly interested in level 4.
The dataset looks like this (with many more variables)
> head(restrictions)
RATOI_I RAHAB_I RANOU_I RAELI_I RAACH_I RAREP_I RAMEN_I RAADM_I RAMED_I RADPI_I RADPE_I RABUS_I
1 4 4 1 1 4 4 4 4 1 1 4 4
2 4 3 3 1 4 4 4 4 4 2 4 4
I would like to know how many people are level 4 in RATOI_I (I can do that) and for these people level 4 in RATOI_I, how many are level 4 in RAHAB_I and each variable.
I looked at the function sapply() but I am completely lost, I do not know how to use it and with which function.
Or must I maybe use the group_by() function?
Thanks in advance!
You can use apply with sum using restrictions==4 to count the number equal 4 per column.
apply(restrictions==4, 2, sum)
#colSums(restrictions==4) #Alternative
#RATOI_I RAHAB_I RANOU_I RAELI_I RAACH_I RAREP_I RAMEN_I RAADM_I RAMED_I RADPI_I RADPE_I RABUS_I
# 2 1 0 0 2 2 2 2 1 0 2 2
Or only for those having restrictions$RATOI_I==4 (Thanks to #Daniel-o for pointing on this):
apply(restrictions[restrictions$RATOI_I==4]==4, 2, sum)
#colSums(restrictions[restrictions$RATOI_I==4]==4)
#RATOI_I RAHAB_I RANOU_I RAELI_I RAACH_I RAREP_I RAMEN_I RAADM_I RAMED_I RADPI_I RADPE_I RABUS_I
# 2 1 0 0 2 2 2 2 1 0 2 2
we can also do by base packages:
df[df<4]<-0
df[df==4]<-1
colSums(df)
>RATOI_I RAHAB_I RANOU_I RAELI_I RAACH_I RAREP_I RAMEN_I RAADM_I RAMED_I RADPI_I RADPE_I RABUS_I
2 1 0 0 2 2 2 2 1 0 2 2
I want to use T-SNE to visualize user's variable but I want to be able to join the data to the user's social information.
Unfortunately, the output of Rtsne doesn't seems to return data with the user id..
The data looks like this:
client_id recency frequen monetary
1 2 1 1 1
2 3 3 1 2
3 4 1 1 2
4 5 3 1 1
5 6 4 1 2
6 7 5 1 1
and the Rtsne output:
x y
1 -6.415009 -0.4726438
2 -16.027732 -9.3751709
3 17.947615 0.2561859
4 1.589996 13.8016613
5 -9.332319 -13.2144419
6 10.545698 8.2165265
and the code:
tsne = Rtsne(rfm[, -1], dims=2, check_duplicates=F)
Rtsne preserves the input order of the dataframe you pass to it.
Try:
Tsne_with_ID = cbind.data.frame(rfm[,1],tsne$y)
and then just fix the first column name:
colnames(Tsne_with_ID)[1] <- paste(colnames(rfm)[1])
I am wanting to set a count variable for another variable (innings). Additionally I want the count variable to reset every time the innings variables changes from 1 to 2. For example
innings count
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
2 1
2 2
2 3
2 4
2 5
2 6
2 7
2 8
2 10
2 11
2 12
1 1
1 2
1 3
1 4
1 5
1 6
I have tried the following code:
data T20_SCORECARD_data_innings;
set T20_SCORECARD_data_innings;
count + 1;
by innings;
if first.innings then count = 0;
run;
But it doesn't seem to work.
Any help would be greatly appreciated.
Ankit
If your data is truly not sorted and simply grouped into bins, 1 and 2 then you can use your code but add the NOTSORTED option to your BY statement.
data T20_SCORECARD_data_innings;
set T20_SCORECARD_data_innings;
by innings NOTSORTED;
count + 1;
if first.innings then count = 0;
run;
In a datastep when you use the by clause the data needs to be sorted. In your case it isn't. If you change your data so the third group (second group of 1's) to 3's your code should work.
I'm trying to analyze some date using R but I'm not very familiar with R (yet) and therefore I'm totally stuck.
What I try to do is manipulate my input data so I can use it to calculate Cohen's Kappa.
Now the problem is, that for rater_1, I have several ratings for some of the items and I need to select one. If rater_1 has given the same rate on an item as rater_2, then this rating should be chosen, if not any rating of the list can be used.
I tried
unique(merge(rater_1, rater_2, all.x=TRUE))
which brings me close, but if the ratings between the two raters diverge, only one is kept.
So, my question is, how do I get from
item rating_1
1 3
2 5
3 4
item rating_2
1 2
1 3
2 4
2 1
2 2
3 4
3 2
to
item rating_1 rating_2
1 3 3
2 5 4
3 4 4
?
There are some fancy ways to do this, but I thought it might be helpful to combine a few basic techniques to accomplish this task. Usually, in your question, you should include some easy way to generate your data, like this:
# Create some sample data
set.seed(1)
id<-rep(1:50)
rater_1<-sample(1:5,50,replace=TRUE)
df1<-data.frame(id,rater_1)
id<-rep(1:50,each=2)
rater_2<-sample(1:5,100,replace=TRUE)
df2<-data.frame(id,rater_2)
Now, here is one simple technique for doing this.
# Merge together the data frames.
all.merged<-merge(df1,df2)
# id rater_1 rater_2
# 1 1 2 3
# 2 1 2 5
# 3 2 2 3
# 4 2 2 2
# 5 3 3 1
# 6 3 3 1
# Find the ones that are equal.
same.rating<-all.merged[all.merged$rater_2==all.merged$rater_1,]
# Consider id 44, sometimes they match twice.
# So remove duplicates.
same.rating<-same.rating[!duplicated(same.rating),]
# Find the ones that never matched.
not.same.rating<-all.merged[!(all.merged$id %in% same.rating$id),]
# Pick one. I chose to pick the maximum.
picked.rating<-aggregate(rater_2~id+rater_1,not.same.rating,max)
# Stick the two together.
result<-rbind(same.rating,picked.rating)
result<-result[order(result$id),] # Sort
# id rater_1 rater_2
# 27 1 2 5
# 4 2 2 2
# 33 3 3 1
# 44 4 5 3
# 281 5 2 4
# 11 6 5 5
A fancy way to do this would be like this:
same.or.random<-function(x) {
matched<-which.min(x$rater_1==x$rater_2)
if(length(matched)>0) x[matched,]
else x[sample(1:nrow(x),1),]
}
do.call(rbind,by(merge(df1,df2),id,same.or.random))
I have lots of little vectors like this:
id<-c(2,2,5,2,1,9,4,4,3,9,5,5)
and I want to create another vector of the same length that has the cumulative number of occurrences of each id thus far. For the above example data this would look like:
> id.count
[1] 1 2 1 3 1 1 1 2 1 2 2 3
I can not find a function that does this easily, maybe because I do not really know how to fully articulate in words what it is that I actually want (hence the slightly awkward question title). Any suggestions?
Here is another way:
ave(id,id,FUN=seq_along)
Gives:
[1] 1 2 1 3 1 1 1 2 1 2 2 3
> sapply(1:length(id), function(i) sum(id[1:i] == id[i]))
[1] 1 2 1 3 1 1 1 2 1 2 2 3
Or if you have NAs in id you should use id[1:i] %in% id[i] instead.