Related
I want to identify networks where all people in the same network directly or indirectly connected through friendship nominations while no students from different networks are connected.
I am using the Add Health data. Each student nominates upto 10 friends.
Say, sample data may look like this:
ID FID_1 FID_2 FID_3 FID_4 FID_5 FID_6 FID_7 FID_8 FID_9 FID_10
1 2 6 7 9 10 NA NA NA NA NA
2 5 9 12 45 13 90 87 6 NA NA
3 1 2 4 7 8 9 10 14 16 18
100 110 120 122 125 169 178 190 200 500 520
500 100 110 122 125 169 178 190 200 500 520
700 800 789 900 NA NA NA NA NA NA NA
1000 789 2000 820 900 NA NA NA NA NA NA
There are around 85,000 individuals. Could anyone please tell me how I can get network ID?
So, I would like the data to look the following
ID network_ID ID network_ID
1 1 700 3
2 1 789 3
3 1 800 3
4 1 820 3
5 1 900 3
6 1 1000 3
7 1 2000 3
8 1
9 1
10 1
12 1
13 1
14 1
16 1
18 1
90 1
87 1
100 2
110 2
120 2
122 2
125 2
169 2
178 2
190 2
200 2
500 2
520 2
So, everyone directly or indirectly connected to ID 1 belong to network 1. 2 is a friend of 1. So, everyone directly or indirectly connected to 2 are also in 1's network and so on. 700 is not connected to 1 or friend of 1 or friend of friend of 1 and so on. Thus 700 is in a different network, which is network 3.
Any help will be much appreciated...
Update
library(igraph)
library(dplyr)
library(data.table)
setDT(df) %>%
melt(id.var = "ID", variable.name = "FID", value.name = "ID2") %>%
na.omit() %>%
setcolorder(c("ID", "ID2", "FID")) %>%
graph_from_data_frame() %>%
components() %>%
membership() %>%
stack() %>%
setNames(c("Network_ID", "ID")) %>%
rev() %>%
type.convert(as.is = TRUE) %>%
arrange(Network_ID, ID)
gives
ID Network_ID
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 12 1
12 13 1
13 14 1
14 16 1
15 18 1
16 45 1
17 87 1
18 90 1
19 100 2
20 110 2
21 120 2
22 122 2
23 125 2
24 169 2
25 178 2
26 190 2
27 200 2
28 500 2
29 520 2
30 700 3
31 789 3
32 800 3
33 820 3
34 900 3
35 1000 3
36 2000 3
Data
> dput(df)
structure(list(ID = c(1L, 2L, 3L, 100L, 500L, 700L, 1000L), FID_1 = c(2L,
5L, 1L, 110L, 100L, 800L, 789L), FID_2 = c(6L, 9L, 2L, 120L,
110L, 789L, 2000L), FID_3 = c(7L, 12L, 4L, 122L, 122L, 900L,
820L), FID_4 = c(9L, 45L, 7L, 125L, 125L, NA, 900L), FID_5 = c(10L,
13L, 8L, 169L, 169L, NA, NA), FID_6 = c(NA, 90L, 9L, 178L, 178L,
NA, NA), FID_7 = c(NA, 87L, 10L, 190L, 190L, NA, NA), FID_8 = c(NA,
6L, 14L, 200L, 200L, NA, NA), FID_9 = c(NA, NA, 16L, 500L, 500L,
NA, NA), FID_10 = c(NA, NA, 18L, 520L, 520L, NA, NA)), class = "data.frame", row.names = c(NA,
-7L))
Are you looking for something like this?
library(data.table)
library(dplyr)
setDT(df) %>%
melt(id.var = "ID", variable.name = "FID", value.name = "ID2") %>%
na.omit() %>%
setcolorder(c("ID", "ID2", "FID")) %>%
graph_from_data_frame() %>%
plot(edge.label = E(.)$FID)
Data
structure(list(ID = 1:3, FID_1 = c(2L, 5L, 1L), FID_2 = c(6L,
9L, 2L), FID_3 = c(7L, 12L, 4L), FID_4 = c(9L, 45L, 7L), FID_5 = c(10L,
12L, 8L), FID_6 = c(NA, 90L, 9L), FID_7 = c(NA, 87L, 10L), FID_8 = c(NA,
6L, 14L), FID_9 = c(NA, NA, 16L), FID_10 = c(NA, NA, 18L)), class = "data.frame", row.names = c(NA,
-3L))
Given a data frame like below:
Name No Diff Most repeated Diff
A 24
A 35
A 39
A 41
A 42
A 43
B 32
B 35
B 36
B 37
C 34
C 40
C 42
D 34
D 39
D 44
E 35
E 36
how to calculate last column as the most freq repeated diff of rows? (e.g, for each I want to calculate the difference of rows and then see which difference more repeated- in this case A would be 1 with two differences equal to 1).
Thanks in advance.
We can use diff to calculate difference and table to count their frequency
library(dplyr)
df %>%
group_by(Name) %>%
mutate(diff = c(NA, diff(No)),
#Can also use lag to get difference with previous value
#diff = No - lag(No),
most_repeated_diff = names(which.max(table(diff))))
# Name No diff most_repeated_diff
# <fct> <int> <int> <chr>
# 1 A 24 NA 1
# 2 A 35 11 1
# 3 A 39 4 1
# 4 A 41 2 1
# 5 A 42 1 1
# 6 A 43 1 1
# 7 B 32 NA 1
# 8 B 35 3 1
# 9 B 36 1 1
#10 B 37 1 1
#11 C 34 NA 2
#12 C 40 6 2
#13 C 42 2 2
#14 D 34 NA 5
#15 D 39 5 5
#16 D 44 5 5
#17 E 35 NA 1
#18 E 36 1 1
data
df <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L), .Label = c("A",
"B", "C", "D", "E"), class = "factor"), No = c(24L, 35L, 39L,
41L, 42L, 43L, 32L, 35L, 36L, 37L, 34L, 40L, 42L, 34L, 39L, 44L,
35L, 36L)), class = "data.frame", row.names = c(NA, -18L))
I would like to learn how to subtract one row from multiple rows by group, and save the results as a data table/matrix in R. For example, take the following data frame:
data.frame("patient" = c("a","a","a", "b","b","b","c","c","c"), "Time" = c(1,2,3), "Measure 1" = sample(1:100,size = 9,replace = TRUE), "Measure 2" = sample(1:100,size = 9,replace = TRUE), "Measure 3" = sample(1:100,size = 9,replace = TRUE))
patient Time Measure.1 Measure.2 Measure.3
1 a 1 19 5 75
2 a 2 64 20 74
3 a 3 40 4 78
4 b 1 80 91 80
5 b 2 48 31 73
6 b 3 10 5 4
7 c 1 30 67 55
8 c 2 24 13 90
9 c 3 45 31 88
For each patient, I would like to subtract the row where Time == 1 from all rows associated with that patient. The result would be:
patient Time Measure.1 Measure.2 Measure.3
1 a 1 0 0 0
2 a 2 45 15 -1
3 a 3 21 -1 3
4 b 1 0 0 0
5 b 2 -32 -60 -5
6 b 3 -70 -86 -76
7 c 1 0 0 0
....
I have tried the following code using the dplyr package, but to no avail:
raw_patient<- group_by(rawdata,patient, Time)
baseline_patient <-mutate(raw_patient,cpls = raw_patient[,]- raw_patient["Time" == 0,])
As there are multiple columns, we can use mutate_at by specifying the variables in vars and then subtract the elements from those elements in each column that corresponds to 'Time' 1 after grouping by 'patient'
library(dplyr)
df1 %>%
group_by(patient) %>%
mutate_at(vars(matches("Measure")), funs(.- .[Time==1]))
# A tibble: 9 × 5
# Groups: patient [3]
# patient Time Measure.1 Measure.2 Measure.3
# <chr> <int> <int> <int> <int>
#1 a 1 0 0 0
#2 a 2 45 15 -1
#3 a 3 21 -1 3
#4 b 1 0 0 0
#5 b 2 -32 -60 -7
#6 b 3 -70 -86 -76
#7 c 1 0 0 0
#8 c 2 -6 -54 35
#9 c 3 15 -36 33
data
df1 <- structure(list(patient = c("a", "a", "a", "b", "b", "b", "c",
"c", "c"), Time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Measure.1 = c(19L,
64L, 40L, 80L, 48L, 10L, 30L, 24L, 45L), Measure.2 = c(5L, 20L,
4L, 91L, 31L, 5L, 67L, 13L, 31L), Measure.3 = c(75L, 74L, 78L,
80L, 73L, 4L, 55L, 90L, 88L)), .Names = c("patient", "Time",
"Measure.1", "Measure.2", "Measure.3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
I have two data frames containing related data. It is related to the NFL. One df has player names and receiving targets by week (player df):
Player Tm Position 1 2 3 4 5 6
1 A.J. Green CIN WR 13 8 11 12 8 10
2 Aaron Burbridge SFO WR 0 1 0 2 0 0
3 Aaron Ripkowski GNB RB 0 0 0 0 0 1
4 Adam Humphries TAM WR 5 8 12 4 2 0
5 Adam Thielen MIN WR 5 5 4 3 8 0
6 Adrian Peterson MIN RB 2 3 0 0 0 0
The other data frame has recieving targets summed by team for each week (team df):
Tm `1` `2` `3` `4` `5` `6`
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARI 37 35 50 45 26 35
2 ATL 38 34 30 37 28 41
3 BAL 32 45 40 51 47 48
4 BUF 22 30 20 33 20 26
5 CAR 31 39 36 47 28 46
6 CHI 28 29 45 36 41 49
7 CIN 30 54 28 31 39 31
8 CLE 26 33 38 38 35 42
9 DAL 43 30 24 32 24 27
10 DEN 26 32 35 31 34 47
# ... with 22 more rows
What I am trying to do is create another data frame containing the target percentage by player, by week. So I need to match the team from the "Tm" column in the player df and the week column header (1-6).
I have figured out how to do this by merging them and then creating new rows, but as I add more data (weeks) I need to write more code:
a <- merge(playertgt, teamtgt, by="Tm") #merges the two
a$Wk1 <- a$`1.x` / a$`1.y`
a$Wk2 <- a$`2.x` / a$`2.y`
a$Wk3 <- a$`3.x` / a$`3.y`
So what I am looking for is a good way to do this that will auto update and doesn't make me have to create a df with a bunch of columns I don't need and that will update with new weeks as I add them to my source data.
If this is answered somewhere else I apologize, but I have been looking for a good way to do this for a day now, and I can't find it. Thanks in advance for your help!
You can do this with dplyr:
library(dplyr)
## Do a left outer join to match each player with total team targets
a <- left_join(playertgt,teamtgt, by="Tm")
## Compute percentage over all weeks selecting player columns ending with ".x"
## and dividing by corresponding team columns ending with ".y"
tgt.pct <- select(a,ends_with(".x")) / select(a,ends_with(".y"))
## set the column names to week + number
colnames(tgt.pct) <- paste0("week",seq_len(ncol(teamtgt)-1))
## construct the output data frame adding back the player and team columns
tgt.pct <- data.frame(Player=playertgt$Player,Tm=playertgt$Tm,tgt.pct)
Clearly, I am only using dplyr for the convenience of ends_with in selecting the columns after the join. A base-R approach using grepl to do this selection is:
a <- merge(playertgt, teamtgt, by="Tm", all.x=TRUE)
tgt.pct <- subset(a,select=grepl(".x$",colnames(a))) / subset(a,select=grepl(".y$",colnames(a)))
colnames(tgt.pct) <- paste0("week",seq_len(ncol(teamtgt)-1))
tgt.pct <- data.frame(Player=playertgt$Player,Tm=playertgt$Tm,tgt.pct)
Data: with your limited posted data, only AJ Green will have his target percentage computed:
playertgt <- structure(list(Player = structure(1:6, .Label = c("A.J. Green",
"Aaron Burbridge", "Aaron Ripkowski", "Adam Humphries", "Adam Thielen",
"Adrian Peterson"), class = "factor"), Tm = structure(c(1L, 4L,
2L, 5L, 3L, 3L), .Label = c("CIN", "GNB", "MIN", "SFO", "TAM"
), class = "factor"), Position = structure(c(2L, 2L, 1L, 2L,
2L, 1L), .Label = c("RB", "WR"), class = "factor"), X1 = c(13L,
0L, 0L, 5L, 5L, 2L), X2 = c(8L, 1L, 0L, 8L, 5L, 3L), X3 = c(11L,
0L, 0L, 12L, 4L, 0L), X4 = c(12L, 2L, 0L, 4L, 3L, 0L), X5 = c(8L,
0L, 0L, 2L, 8L, 0L), X6 = c(10L, 0L, 1L, 0L, 0L, 0L)), .Names = c("Player",
"Tm", "Position", "X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA,
-6L))
## Player Tm Position X1 X2 X3 X4 X5 X6
##1 A.J. Green CIN WR 13 8 11 12 8 10
##2 Aaron Burbridge SFO WR 0 1 0 2 0 0
##3 Aaron Ripkowski GNB RB 0 0 0 0 0 1
##4 Adam Humphries TAM WR 5 8 12 4 2 0
##5 Adam Thielen MIN WR 5 5 4 3 8 0
##6 Adrian Peterson MIN RB 2 3 0 0 0 0
teamtgt <- structure(list(Tm = structure(1:10, .Label = c("ARI", "ATL",
"BAL", "BUF", "CAR", "CHI", "CIN", "CLE", "DAL", "DEN"), class = "factor"),
X1 = c(37L, 38L, 32L, 22L, 31L, 28L, 30L, 26L, 43L, 26L),
X2 = c(35L, 34L, 45L, 30L, 39L, 29L, 54L, 33L, 30L, 32L),
X3 = c(50L, 30L, 40L, 20L, 36L, 45L, 28L, 38L, 24L, 35L),
X4 = c(45L, 37L, 51L, 33L, 47L, 36L, 31L, 38L, 32L, 31L),
X5 = c(26L, 28L, 47L, 20L, 28L, 41L, 39L, 35L, 24L, 34L),
X6 = c(35L, 41L, 48L, 26L, 46L, 49L, 31L, 42L, 27L, 47L)), .Names = c("Tm",
"X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA,
-10L))
## Tm X1 X2 X3 X4 X5 X6
##1 ARI 37 35 50 45 26 35
##2 ATL 38 34 30 37 28 41
##3 BAL 32 45 40 51 47 48
##4 BUF 22 30 20 33 20 26
##5 CAR 31 39 36 47 28 46
##6 CHI 28 29 45 36 41 49
##7 CIN 30 54 28 31 39 31
##8 CLE 26 33 38 38 35 42
##9 DAL 43 30 24 32 24 27
##10 DEN 26 32 35 31 34 47
The result is:
## Player Tm week1 week2 week3 week4 week5 week6
##1 A.J. Green CIN 0.4333333 0.1481481 0.3928571 0.3870968 0.2051282 0.3225806
##2 Aaron Burbridge SFO NA NA NA NA NA NA
##3 Aaron Ripkowski GNB NA NA NA NA NA NA
##4 Adam Humphries TAM NA NA NA NA NA NA
##5 Adam Thielen MIN NA NA NA NA NA NA
##6 Adrian Peterson MIN NA NA NA NA NA NA
it would be nice if you provide a bit of data the next time, that makes live a lot easier.
I think the main point is your data structure. I think you have to put your data into a long format (keyword is tidy-data I guess). I made up some data and hope I understood your problem correctly.
library(tidyr)
library(dplyr)
player_df = data.frame(team = c('ARI', 'BAL', 'BAL', 'CLE', 'CLE'),
player =c('A', 'B', 'C', 'D', 'F'),
'1' = floor(runif(5, min=1, max=2)*10),
'2' = floor(runif(5, min=1, max=2)*10))
> player_df
team player X1 X2
1 ARI A 15 10
2 BAL B 16 15
3 BAL C 13 11
4 CLE D 14 19
5 CLE F 12 14
team_df = data.frame(team = c('ARI', 'BAL', 'CLE'),
'1' = floor(runif(3, min=10, max=20)*20),
'2' = floor(runif(3, min=10, max=20)*20))
> team_df
team X1 X2
1 ARI 281 205
2 BAL 362 309
3 CLE 323 238
Now, put both dataframes into a long format:
player_df = gather(player_df, week, player_value, -team, -player)
team_df = gather(team_df, week, team_value, -team)
> player_df
team player week player_value
1 ARI A X1 15
2 BAL B X1 16
3 BAL C X1 13
4 CLE D X1 14
5 CLE F X1 12
6 ARI A X2 10
7 BAL B X2 15
8 BAL C X2 11
9 CLE D X2 19
10 CLE F X2 14
> team_df
team week team_value
1 ARI X1 281
2 BAL X1 362
3 CLE X1 323
4 ARI X2 205
5 BAL X2 309
6 CLE X2 238
Now, join (or merge) them together. inner_join will by default join on common column names.
join_db = inner_join(player_df, team_df)
> join_db
team player week player_value team_value
1 ARI A X1 15 281
2 BAL B X1 16 362
3 BAL C X1 13 362
4 CLE D X1 14 323
5 CLE F X1 12 323
6 ARI A X2 10 205
7 BAL B X2 15 309
8 BAL C X2 11 309
9 CLE D X2 19 238
10 CLE F X2 14 238
I think in that format you can do a lot more.
HTH
Stefan
I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points