Related
I want to identify networks where all people in the same network directly or indirectly connected through friendship nominations while no students from different networks are connected.
I am using the Add Health data. Each student nominates upto 10 friends.
Say, sample data may look like this:
ID FID_1 FID_2 FID_3 FID_4 FID_5 FID_6 FID_7 FID_8 FID_9 FID_10
1 2 6 7 9 10 NA NA NA NA NA
2 5 9 12 45 13 90 87 6 NA NA
3 1 2 4 7 8 9 10 14 16 18
100 110 120 122 125 169 178 190 200 500 520
500 100 110 122 125 169 178 190 200 500 520
700 800 789 900 NA NA NA NA NA NA NA
1000 789 2000 820 900 NA NA NA NA NA NA
There are around 85,000 individuals. Could anyone please tell me how I can get network ID?
So, I would like the data to look the following
ID network_ID ID network_ID
1 1 700 3
2 1 789 3
3 1 800 3
4 1 820 3
5 1 900 3
6 1 1000 3
7 1 2000 3
8 1
9 1
10 1
12 1
13 1
14 1
16 1
18 1
90 1
87 1
100 2
110 2
120 2
122 2
125 2
169 2
178 2
190 2
200 2
500 2
520 2
So, everyone directly or indirectly connected to ID 1 belong to network 1. 2 is a friend of 1. So, everyone directly or indirectly connected to 2 are also in 1's network and so on. 700 is not connected to 1 or friend of 1 or friend of friend of 1 and so on. Thus 700 is in a different network, which is network 3.
Any help will be much appreciated...
Update
library(igraph)
library(dplyr)
library(data.table)
setDT(df) %>%
melt(id.var = "ID", variable.name = "FID", value.name = "ID2") %>%
na.omit() %>%
setcolorder(c("ID", "ID2", "FID")) %>%
graph_from_data_frame() %>%
components() %>%
membership() %>%
stack() %>%
setNames(c("Network_ID", "ID")) %>%
rev() %>%
type.convert(as.is = TRUE) %>%
arrange(Network_ID, ID)
gives
ID Network_ID
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 12 1
12 13 1
13 14 1
14 16 1
15 18 1
16 45 1
17 87 1
18 90 1
19 100 2
20 110 2
21 120 2
22 122 2
23 125 2
24 169 2
25 178 2
26 190 2
27 200 2
28 500 2
29 520 2
30 700 3
31 789 3
32 800 3
33 820 3
34 900 3
35 1000 3
36 2000 3
Data
> dput(df)
structure(list(ID = c(1L, 2L, 3L, 100L, 500L, 700L, 1000L), FID_1 = c(2L,
5L, 1L, 110L, 100L, 800L, 789L), FID_2 = c(6L, 9L, 2L, 120L,
110L, 789L, 2000L), FID_3 = c(7L, 12L, 4L, 122L, 122L, 900L,
820L), FID_4 = c(9L, 45L, 7L, 125L, 125L, NA, 900L), FID_5 = c(10L,
13L, 8L, 169L, 169L, NA, NA), FID_6 = c(NA, 90L, 9L, 178L, 178L,
NA, NA), FID_7 = c(NA, 87L, 10L, 190L, 190L, NA, NA), FID_8 = c(NA,
6L, 14L, 200L, 200L, NA, NA), FID_9 = c(NA, NA, 16L, 500L, 500L,
NA, NA), FID_10 = c(NA, NA, 18L, 520L, 520L, NA, NA)), class = "data.frame", row.names = c(NA,
-7L))
Are you looking for something like this?
library(data.table)
library(dplyr)
setDT(df) %>%
melt(id.var = "ID", variable.name = "FID", value.name = "ID2") %>%
na.omit() %>%
setcolorder(c("ID", "ID2", "FID")) %>%
graph_from_data_frame() %>%
plot(edge.label = E(.)$FID)
Data
structure(list(ID = 1:3, FID_1 = c(2L, 5L, 1L), FID_2 = c(6L,
9L, 2L), FID_3 = c(7L, 12L, 4L), FID_4 = c(9L, 45L, 7L), FID_5 = c(10L,
12L, 8L), FID_6 = c(NA, 90L, 9L), FID_7 = c(NA, 87L, 10L), FID_8 = c(NA,
6L, 14L), FID_9 = c(NA, NA, 16L), FID_10 = c(NA, NA, 18L)), class = "data.frame", row.names = c(NA,
-3L))
I have a dataframe df with 6 fields A,B,C,D,E & F. My requirement is to create a new column G which is equal to the previous value(C) + previous value(D) + previous (G) - F. But this needs to be implemented at a group level through columns A & B (group by A & B). In case it is the first row within the group then the value in column G should be equal to E.
Sample Df -
A B C D E F
1 2 100 200 300 0
1 2 110 210 310 10
1 2 120 130 300 10
1 1 140 150 80 0
1 1 50 60 80 20
1 1 50 60 80 20
Output -
A B C D E F G
1 2 100 200 300 0 300
1 2 110 210 310 10 590
1 2 120 130 300 10 900
1 1 140 150 80 0 80
1 1 50 60 80 20 350
1 1 50 60 80 20 440
Please provide a suitable solution.
Here is one option with dplyr where we group by 'A', 'B', take the lag of 'C', 'D', 'E' add (+) them, and subtract from 'F', and coalesce with the 'E' column
library(dplyr)
df1 %>%
group_by(A, B) %>%
mutate(G = coalesce(lag(C) + lag(D) + lag(E) - F, E))
-output
# A tibble: 6 x 7
# Groups: A, B [2]
# A B C D E F G
# <int> <int> <int> <int> <int> <int> <int>
#1 1 2 100 200 300 0 300
#2 1 2 110 210 310 10 590
#3 1 2 120 130 300 10 620
#4 1 1 140 150 80 0 80
#5 1 1 50 60 80 20 350
#6 1 1 50 60 80 20 170
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 1L, 1L), B = c(2L, 2L, 2L,
1L, 1L, 1L), C = c(100L, 110L, 120L, 140L, 50L, 50L), D = c(200L,
210L, 130L, 150L, 60L, 60L), E = c(300L, 310L, 300L, 80L, 80L,
80L), F = c(0L, 10L, 10L, 0L, 20L, 20L)), class = "data.frame",
row.names = c(NA,
-6L))
Hello all my df looks like
PID V1
123 1
123 2
123 3
111 1
111 2
111 1
122 3
122 1
122 1
333 1
333 4
333 2
I want to delete rows contains 1 and 2 event alone for the PID
and expected output
PID V1
123 1
123 2
123 3
122 3
122 1
122 1
333 1
333 4
333 2
You can do this in base R :
subset(df, !ave(V1 %in% 1:2, PID, FUN = all))
# PID V1
#1 123 1
#2 123 2
#3 123 3
#7 122 3
#8 122 1
#9 122 1
#10 333 1
#11 333 4
#12 333 2
dplyr
library(dplyr)
df %>% group_by(PID) %>% filter(!all(V1 %in% 1:2))
or data.table :
library(data.table)
setDT(df)[, .SD[!all(V1 %in% 1:2)], PID]
The logic of all of them is the same. Remove groups (PID) who have only 1 and 2 in V1 column.
data
df <- structure(list(PID = c(123L, 123L, 123L, 111L, 111L, 111L, 122L,
122L, 122L, 333L, 333L, 333L), V1 = c(1L, 2L, 3L, 1L, 2L, 1L,
3L, 1L, 1L, 1L, 4L, 2L)), class = "data.frame", row.names = c(NA, -12L))
I'm trying to use ddply to find the smallest distance between two positions pos where the corresponding chrom is the same in two dataframes:
head(bps, 10)
chrom pos iteration
1 1 4 1
2 1 14 1
3 1 68 1
4 1 79 1
5 1 200 1
6 1 205 1
7 1 270 1
8 1 304 1
9 2 7 1
10 2 13 1
head(flocs)
chrom pos
1 1 100
2 1 200
3 1 220
4 1 312
5 2 500
6 2 501
As an example, for the first line in bps, I want to find the closest pos in flocs where chrom = 1, which gives a value of -96.
The pseudocode for what I'm trying to do is:
foreach iteration (bps$iteration):
foreach chrom (bps$chrom):
foreach pos (bps$pos):
features_pos = pos in dataframe flocs closest to pos on the same chromosome
min_dist = feature_pos - pos
return features_pos, min_dist
I am trying to do this with ddply:
minDists <- ddply(bp_data, c("chrom", "pos"), function(x) {
index <- which.min(abs(flocs$pos[which(flocs$chrom==x$chrom)] - x$pos))
closestMotif <- flocs$pos[index]
chrom <- as.character(flocs$chrom[index])
dist <- (x$pos - closestMotif)
data.frame(features_pos = closestMotif, pos = x$pos, min_dist = dist, feature = feature)
})
But this doesn't constrain comparisons to the same chromosome:
head(minDists, 10)
chrom features_pos pos min_dist feature
1 1 100 4 -96 feature1
2 1 100 14 -86 feature1
3 1 100 68 -32 feature1
4 1 100 79 -21 feature1
5 1 200 200 0 feature1
6 1 200 205 5 feature1
7 1 312 270 -42 feature1
8 1 312 304 -8 feature1
9 2 100 7 -93 feature1 # bps chrom=2, flocs chrom=1
10 2 100 13 -87 feature1 # bps chrom=2, flocs chrom=1
The expected output here is:
chrom features_pos pos min_dist feature
1 1 100 4 -96 feature1
2 1 100 14 -86 feature1
3 1 100 68 -32 feature1
4 1 100 79 -21 feature1
5 1 200 200 0 feature1
6 1 200 205 5 feature1
7 1 312 270 -42 feature1
8 1 312 304 -8 feature1
9 2 500 7 -493 feature1 # bp1 chrom=2, flocs chrom=2
10 2 500 13 -487 feature1 # bp1 chrom=2, flocs chrom=2
I thought that by providing the columns c("chrom", "pos") essentially performed a group_by to the function call.
Is there any way that I can improve what I've written to achieve the desired result?
bps <- structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"
), class = "factor"), pos = c(4L, 14L, 68L, 79L, 200L, 205L,
270L, 304L, 7L, 13L, 23L, 39L, 100L, 150L, 17L, 55L, 75L, 79L,
102L, 109L, 123L, 155L, 157L, 200L, 260L, 299L, 300L, 320L, 323L,
345L, 450L, 550L), iteration = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor")), row.names = c(NA,
-32L), class = "data.frame")
flocs <- structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1", "2", "3"), class = "factor"), pos = c(100L,
200L, 220L, 312L, 500L, 501L, 123L, 444L)), row.names = c(NA,
-8L), class = "data.frame")
data.table approach using a rolling join...
updated answer
(initially forgot all about the by-reference joining, which is faster and most certainly shorter ;-) )
library( data.table )
#set data as data.table
setDT( bps, key = c("chrom", "pos") )
setDT( flocs, key = c("chrom", "pos") )
#perform by-reference rolling join
bps[, mindist := pos - flocs[bps, x.pos, roll = "nearest"]][]
output
# chrom pos iteration mindist
# 1: 1 4 1 -96
# 2: 1 14 1 -86
# 3: 1 68 1 -32
# 4: 1 79 1 -21
# 5: 1 200 1 0
# 6: 1 205 1 5
# 7: 1 270 1 -42
# 8: 1 304 1 -8
# 9: 2 7 1 -493
# 10: 2 13 1 -487
# 11: 2 23 1 -477
# 12: 2 39 1 -461
# 13: 2 100 1 -400
# 14: 2 150 1 -350
# 15: 3 17 1 -106
# 16: 3 55 1 -68
# 17: 3 75 1 -48
# 18: 3 79 1 -44
# 19: 3 102 1 -21
# 20: 3 109 1 -14
# 21: 3 123 1 0
# 22: 3 155 1 32
# 23: 3 157 1 34
# 24: 3 200 1 77
# 25: 3 260 1 137
# 26: 3 299 1 -145
# 27: 3 300 1 -144
# 28: 3 320 1 -124
# 29: 3 323 1 -121
# 30: 3 345 1 -99
# 31: 3 450 1 6
# 32: 3 550 1 106
# chrom pos iteration mindist
Benchmarking answer until now
# Unit: milliseconds
# expr min lq mean median uq max neval
# Ronak_base 2.355879 2.555768 2.973069 2.626415 2.773581 8.016016 100
# Wimpel_data.table 1.697921 2.035788 2.416199 2.209616 2.361001 17.724528 100
# Pawel_tidyverse 14.845354 15.310505 16.333158 15.814819 16.541618 24.077871 100
microbenchmark::microbenchmark(
Ronak_base = {
bps$min_dist <- unlist(mapply(return_min_value, unique(bps$chrom), split(bps$pos, bps$chrom)))
},
Wimpel_data.table = {
setDT( bps, key = c("chrom", "pos") )
setDT( flocs, key = c("chrom", "pos") )
#perform by-reference rolling join
bps[, mindist := pos - flocs[bps, x.pos, roll = "nearest"]][]
},
Pawel_tidyverse = {
bps %>%
select(-iteration) %>%
unite('bps') %>%
crossing(flocs %>% unite('flocks')) %>%
separate(bps, c('chrom_bps', 'pos')) %>%
separate(flocks, c('chrom_flocks', 'features_pos')) %>%
filter(chrom_bps == chrom_flocks) %>%
select(-chrom_flocks) %>%
rename_at(1, ~'chrom') %>%
mutate_all(as.numeric) %>%
mutate(min_dist = pos - features_pos) %>%
group_by(chrom, pos) %>%
filter(abs(min_dist) == min(abs(min_dist)))
}
)
Looks like my data-table answer and the answer by Ronak Shah are pretty close together. I believe that data.table will gain the clear advantage when the data-sets are getting lager-huge (but I haven't tested)..
My base R attempt by creating a helper function (return_min_value). This function subset flocs based on current chrom and then returns the minimum value after subtracting it from pos. We split the pos column based on chrom and pass these values along with unique chrom values in mapply to return_min_value function.
return_min_value <- function(x, y) {
sapply(y, function(p) {
vals = p - flocs$pos[flocs$chrom == x]
vals[which.min(abs(vals))]
})
}
bps$min_dist <- unlist(mapply(return_min_value,
unique(bps$chrom), split(bps$pos, bps$chrom)))
bps
# chrom pos iteration min_dist
#1 1 4 1 -96
#2 1 14 1 -86
#3 1 68 1 -32
#4 1 79 1 -21
#5 1 200 1 0
#6 1 205 1 5
#7 1 270 1 -42
#8 1 304 1 -8
#9 2 7 1 -493
#10 2 13 1 -487
#...
Check this solution:
library(tidyverse)
bps %>%
select(-iteration) %>%
unite('bps') %>%
crossing(flocs %>% unite('flocks')) %>%
separate(bps, c('chrom_bps', 'pos')) %>%
separate(flocks, c('chrom_flocks', 'features_pos')) %>%
filter(chrom_bps == chrom_flocks) %>%
select(-chrom_flocks) %>%
rename_at(1, ~'chrom') %>%
mutate_all(as.numeric) %>%
mutate(min_dist = pos - features_pos) %>%
group_by(chrom, pos) %>%
filter(abs(min_dist) == min(abs(min_dist)))
Output:
chrom pos features_pos min_dist
<dbl> <dbl> <dbl> <dbl>
1 1 4 100 -96
2 1 14 100 -86
3 1 68 100 -32
4 1 79 100 -21
5 1 200 200 0
6 1 205 200 5
7 1 270 312 -42
8 1 304 312 -8
9 2 7 500 -493
10 2 13 500 -487
# ... with 22 more rows
I have dataset with data of gamesessions (id,count of session, averege second of session and dates of sessions for each id).Here is sample of mydat:
mydat=read.csv("C:/Users/Admin/desktop/rty.csv", sep=";",dec=",")
mydat
structure(list(udid = c(74385162L, 79599601L, 79599601L, 91475825L,
91475825L, 91492531L, 92137561L, 96308016L, 96308016L, 96308016L,
96308016L, 96308016L, 96495076L, 97135620L, 97135620L, 97135620L,
97135620L, 97135620L, 97135620L, 97135620L, 97135620L, 97135620L,
97135620L, 97165942L), count = c(1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), avg_duration = c(39L, 1216L, 568L, 5L, 6L, 79L, 9L, 426L,
78L, 884L, 785L, 785L, 22L, 302L, 738L, 280L, 2782L, 5L, 2284L,
144L, 234L, 231L, 539L, 450L), date = structure(c(13L, 3L, 3L,
1L, 1L, 14L, 2L, 11L, 11L, 11L, 12L, 12L, 9L, 7L, 4L, 4L, 5L,
6L, 8L, 8L, 8L, 8L, 8L, 10L), .Label = c("11.10.16", "12.12.16",
"15.11.16", "15.12.16", "16.12.16", "17.12.16", "18.10.16", "18.12.16",
"21.10.16", "26.10.16", "28.11.16", "29.11.16", "31.10.16", "8.10.16"
), class = "factor")), .Names = c("udid", "count", "avg_duration",
"date"), class = "data.frame", row.names = c(NA, -24L))
I need after each last date when player was seen put 1, and put zero for another dates when this id was seen.
For example id 96308016 has 5 obs.Therefore, we mark the last fifth observation with 1 and the previous 4 observations are marked with zero.
if id has 1 observation we marked it with 1 too like id 74385162.
to make it more clear, here my expected output
udid count avg_duration date diff
74385162 1 39 31.10.16 1
79599601 1 1216 15.11.16 0
79599601 1 568 15.11.16 1
91475825 1 5 11.10.16 0
91475825 3 6 11.10.16 1
91492531 1 79 8.10.16 1
92137561 1 9 12.12.16 1
96308016 2 426 28.11.16 0
96308016 2 78 28.11.16 0
96308016 1 884 28.11.16 0
96308016 1 785 29.11.16 0
96308016 1 785 29.11.16 1
96495076 1 22 21.10.16 1
97135620 2 302 18.10.16 0
97135620 1 738 15.12.16 0
97135620 1 280 15.12.16 0
97135620 1 2782 16.12.16 0
97135620 1 5 17.12.16 0
97135620 1 2284 18.12.16 0
97135620 1 144 18.12.16 0
97135620 1 234 18.12.16 0
97135620 1 231 18.12.16 0
97135620 1 539 18.12.16 1
97165942 1 450 26.10.16 1
How do that?
You could do the following:
library(dplyr)
mydat = mydat %>%
group_by(udid) %>%
mutate(diff=ifelse(row_number()==n(),1,0)) %>%
as.data.frame()
Output:
udid count avg_duration date diff
1 74385162 1 39 31.10.16 1
2 79599601 1 1216 15.11.16 0
3 79599601 1 568 15.11.16 1
4 91475825 1 5 11.10.16 0
5 91475825 3 6 11.10.16 1
6 91492531 1 79 8.10.16 1
7 92137561 1 9 12.12.16 1
8 96308016 2 426 28.11.16 0
9 96308016 2 78 28.11.16 0
10 96308016 1 884 28.11.16 0
11 96308016 1 785 29.11.16 0
12 96308016 1 785 29.11.16 1
13 96495076 1 22 21.10.16 1
14 97135620 2 302 18.10.16 0
15 97135620 1 738 15.12.16 0
16 97135620 1 280 15.12.16 0
17 97135620 1 2782 16.12.16 0
18 97135620 1 5 17.12.16 0
19 97135620 1 2284 18.12.16 0
20 97135620 1 144 18.12.16 0
21 97135620 1 234 18.12.16 0
22 97135620 1 231 18.12.16 0
23 97135620 1 539 18.12.16 1
24 97165942 1 450 26.10.16 1
If it's already sorted by date, then this will work:
mydat$diff = as.integer(!duplicated(mydat$udid, fromLast = TRUE))
head(mydat)
# udid count avg_duration date diff
# 1 74385162 1 39 31.10.16 1
# 2 79599601 1 1216 15.11.16 0
# 3 79599601 1 568 15.11.16 1
# 4 91475825 1 5 11.10.16 0
# 5 91475825 3 6 11.10.16 1
# 6 91492531 1 79 8.10.16 1
If it's not already sorted by date, convert to Date class, sort, and then do the above:
mydat$date = as.Date(mydat$date, format = "%d.%M.%y")
mydat = mydat[order(mydat$udid, mydat$date), ]
If you dont want to sort on date then the logical answer should be achieved by:
mydat$date = as.Date(mydat$date, "%d.%M.%y")
mydat %>%
group_by(udid) %>%
mutate(diff = ifelse(date == max(date), 1L, 0L)) #Last date
udid count avg_duration date diff
<int> <int> <int> <date> <int>
1 74385162 1 39 2016-01-31 1
2 79599601 1 1216 2016-01-15 1
3 79599601 1 568 2016-01-15 1
4 91475825 1 5 2016-01-11 1
5 91475825 3 6 2016-01-11 1
6 91492531 1 79 2016-01-08 1
7 92137561 1 9 2016-01-12 1
8 96308016 2 426 2016-01-28 0
9 96308016 2 78 2016-01-28 0
10 96308016 1 884 2016-01-28 0
# ... with 14 more rows
But seems, your sample date got duplicate date which is not allowing above logic to work. But the solution should work in real data especially when date is in date/time.