I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points
Related
I want to identify networks where all people in the same network directly or indirectly connected through friendship nominations while no students from different networks are connected.
I am using the Add Health data. Each student nominates upto 10 friends.
Say, sample data may look like this:
ID FID_1 FID_2 FID_3 FID_4 FID_5 FID_6 FID_7 FID_8 FID_9 FID_10
1 2 6 7 9 10 NA NA NA NA NA
2 5 9 12 45 13 90 87 6 NA NA
3 1 2 4 7 8 9 10 14 16 18
100 110 120 122 125 169 178 190 200 500 520
500 100 110 122 125 169 178 190 200 500 520
700 800 789 900 NA NA NA NA NA NA NA
1000 789 2000 820 900 NA NA NA NA NA NA
There are around 85,000 individuals. Could anyone please tell me how I can get network ID?
So, I would like the data to look the following
ID network_ID ID network_ID
1 1 700 3
2 1 789 3
3 1 800 3
4 1 820 3
5 1 900 3
6 1 1000 3
7 1 2000 3
8 1
9 1
10 1
12 1
13 1
14 1
16 1
18 1
90 1
87 1
100 2
110 2
120 2
122 2
125 2
169 2
178 2
190 2
200 2
500 2
520 2
So, everyone directly or indirectly connected to ID 1 belong to network 1. 2 is a friend of 1. So, everyone directly or indirectly connected to 2 are also in 1's network and so on. 700 is not connected to 1 or friend of 1 or friend of friend of 1 and so on. Thus 700 is in a different network, which is network 3.
Any help will be much appreciated...
Update
library(igraph)
library(dplyr)
library(data.table)
setDT(df) %>%
melt(id.var = "ID", variable.name = "FID", value.name = "ID2") %>%
na.omit() %>%
setcolorder(c("ID", "ID2", "FID")) %>%
graph_from_data_frame() %>%
components() %>%
membership() %>%
stack() %>%
setNames(c("Network_ID", "ID")) %>%
rev() %>%
type.convert(as.is = TRUE) %>%
arrange(Network_ID, ID)
gives
ID Network_ID
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 12 1
12 13 1
13 14 1
14 16 1
15 18 1
16 45 1
17 87 1
18 90 1
19 100 2
20 110 2
21 120 2
22 122 2
23 125 2
24 169 2
25 178 2
26 190 2
27 200 2
28 500 2
29 520 2
30 700 3
31 789 3
32 800 3
33 820 3
34 900 3
35 1000 3
36 2000 3
Data
> dput(df)
structure(list(ID = c(1L, 2L, 3L, 100L, 500L, 700L, 1000L), FID_1 = c(2L,
5L, 1L, 110L, 100L, 800L, 789L), FID_2 = c(6L, 9L, 2L, 120L,
110L, 789L, 2000L), FID_3 = c(7L, 12L, 4L, 122L, 122L, 900L,
820L), FID_4 = c(9L, 45L, 7L, 125L, 125L, NA, 900L), FID_5 = c(10L,
13L, 8L, 169L, 169L, NA, NA), FID_6 = c(NA, 90L, 9L, 178L, 178L,
NA, NA), FID_7 = c(NA, 87L, 10L, 190L, 190L, NA, NA), FID_8 = c(NA,
6L, 14L, 200L, 200L, NA, NA), FID_9 = c(NA, NA, 16L, 500L, 500L,
NA, NA), FID_10 = c(NA, NA, 18L, 520L, 520L, NA, NA)), class = "data.frame", row.names = c(NA,
-7L))
Are you looking for something like this?
library(data.table)
library(dplyr)
setDT(df) %>%
melt(id.var = "ID", variable.name = "FID", value.name = "ID2") %>%
na.omit() %>%
setcolorder(c("ID", "ID2", "FID")) %>%
graph_from_data_frame() %>%
plot(edge.label = E(.)$FID)
Data
structure(list(ID = 1:3, FID_1 = c(2L, 5L, 1L), FID_2 = c(6L,
9L, 2L), FID_3 = c(7L, 12L, 4L), FID_4 = c(9L, 45L, 7L), FID_5 = c(10L,
12L, 8L), FID_6 = c(NA, 90L, 9L), FID_7 = c(NA, 87L, 10L), FID_8 = c(NA,
6L, 14L), FID_9 = c(NA, NA, 16L), FID_10 = c(NA, NA, 18L)), class = "data.frame", row.names = c(NA,
-3L))
Given a data frame like below:
Name No Diff Most repeated Diff
A 24
A 35
A 39
A 41
A 42
A 43
B 32
B 35
B 36
B 37
C 34
C 40
C 42
D 34
D 39
D 44
E 35
E 36
how to calculate last column as the most freq repeated diff of rows? (e.g, for each I want to calculate the difference of rows and then see which difference more repeated- in this case A would be 1 with two differences equal to 1).
Thanks in advance.
We can use diff to calculate difference and table to count their frequency
library(dplyr)
df %>%
group_by(Name) %>%
mutate(diff = c(NA, diff(No)),
#Can also use lag to get difference with previous value
#diff = No - lag(No),
most_repeated_diff = names(which.max(table(diff))))
# Name No diff most_repeated_diff
# <fct> <int> <int> <chr>
# 1 A 24 NA 1
# 2 A 35 11 1
# 3 A 39 4 1
# 4 A 41 2 1
# 5 A 42 1 1
# 6 A 43 1 1
# 7 B 32 NA 1
# 8 B 35 3 1
# 9 B 36 1 1
#10 B 37 1 1
#11 C 34 NA 2
#12 C 40 6 2
#13 C 42 2 2
#14 D 34 NA 5
#15 D 39 5 5
#16 D 44 5 5
#17 E 35 NA 1
#18 E 36 1 1
data
df <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L), .Label = c("A",
"B", "C", "D", "E"), class = "factor"), No = c(24L, 35L, 39L,
41L, 42L, 43L, 32L, 35L, 36L, 37L, 34L, 40L, 42L, 34L, 39L, 44L,
35L, 36L)), class = "data.frame", row.names = c(NA, -18L))
I have different (2 in my example, 85 in my real data) and would like to produce a table of age classes (0-10, 11-20,21-30,31-40 etc.) for each group:
group age
1 1 34
2 1 37
3 1 22
4 1 10
5 1 11
6 1 12
7 1 14
8 2 56
9 2 46
10 2 25
11 2 24
12 2 13
13 2 13
14 2 45
15 2 45
16 2 23
17 2 56
18 2 54
19 2 31
20 2 68
I have tried various solutions from the forum:
mydf$ageclass<-cut(mydf$age, seq(0,100,10))
only works for the entire df and has no possibilty of groups.
mydf$ageclass<-Freq(mydf$age, breaks=c(0,20,30,40,50,60,70,80))
also only returns a solution for the entire dataframe
I have no way of integrating the "group" into these functions.
Also, both return a column with the age class given as '(30,40]' (meaning upper and lower class bound) and I would like the result to be a table like this:
group 0-10 11-20 21-30 31-40
1
2
What am I missing? perhaps a for loop? I am new to base R and really would enjoy some pointers as to how to think about the problem.
Is this what you are trying to achieve?
df$ageclass <- with(mydf, cut(age, seq(0,100,10)))
with(df, table(group, ageclass))
ageclass
group (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
1 1 3 1 1 0 0 0 0 0 0
2 0 2 3 1 3 3 1 0 0 0
Edit
cut() also has a labels argument:
df$ageclass <- with(mydf, cut(age, seq(0,100,10), labels = paste0(seq(0,90,10) + 1, "-", seq(0,90,10) + 10)))
with(df, table(group, ageclass))
ageclass
group 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100
1 1 3 1 1 0 0 0 0 0 0
2 0 2 3 1 3 3 1 0 0 0
Data
mydf <- structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), age = c(37L, 22L, 10L,
11L, 12L, 14L, 56L, 46L, 25L, 24L, 13L, 13L, 45L, 45L, 23L, 56L,
54L, 31L, 68L)), row.names = c(NA, -19L), class = "data.frame")
I have two datasets in R, and I am trying to add values of the first dataset into a single column of the second dataset. The two datasets have matching variables, based on these variables the new column should be constructed.
The first dataset looks like this:
Experiment Subject R1 R2 R3 R4
1 1 28 29 59 55
1 3 27 24 50 50
1 5 30 30 61 50
1 7 26 30 60 60
1 10 30 30 65 65
2 2 34 34 61 61
2 4 25 25 49 48
2 8 26 26 55 48
2 9 20 20 60 60
The second dataset looks like this:
Subject Experiment R NewColumn
1 1 3
1 1 3
1 1 3
1 1 3
1 1 3
1 1 4
1 1 4
1 1 4
1 1 4
1 1 4
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 2 4
2 2 4
2 2 4
2 2 4
2 2 4
2 2 3
2 2 3
2 2 3
2 2 3
2 2 3
So, basicly I am trying to create a script or use a function that copies the values of R1-R4 of the first dataset into the 'NewColumn' of the second dataset, given that Experiment, Subject, and R (1-4) match.
I have tried to create a solution using loops and if statements, but unfortunately without succes.
Edit:
I think I should add that the second dataset contains (many) more variables (columns, which I left out for this example), is quite long (about 2000 rows) and is not ordered (Experiment, Subject and 'R' don't follow a logical order).
So my thought is, that the script should 'read' the variables 'Experiment' 'Subject' and 'R' from the second dataset, and paste the corresponding value from the first dataset (e.g. Experiment 1, Subject 1, R3) into the 'NewColumn' column. Many thanks for all of your input so far!
Any advice on how to solve this is very much appreciated.
We could use gather from tidyr to reshape the first dataset ('df1') from 'wide' to 'long' format. We create key/val columns ('Var', 'NewCol') from the R1:R4 columns. Then we split the 'Var' column into two new columns ('V1', 'R') using extract, left_join with 'df2' by specifying the common columns, and select the columns that are needed in the output.
library(dplyr)
library(tidyr)
gather(df1, Var, NewCol, R1:R4) %>%
extract(Var, into=c('V1', 'R'), '(.)(.)', convert=TRUE) %>%
left_join(df2, ., by=c('Subject', 'Experiment', 'R')) %>%
select(-V1)
# Subject Experiment R NewCol
#1 1 1 3 59
#2 1 1 3 59
#3 1 1 3 59
#4 1 1 3 59
#5 1 1 3 59
#6 1 1 4 55
#7 1 1 4 55
#8 1 1 4 55
#9 1 1 4 55
#10 1 1 4 55
#11 1 1 1 28
#12 1 1 1 28
#13 1 1 1 28
#14 1 1 1 28
#15 1 1 1 28
#16 1 1 2 29
#17 1 1 2 29
#18 1 1 2 29
#19 1 1 2 29
#20 1 1 2 29
#21 2 2 4 61
#22 2 2 4 61
#23 2 2 4 61
#24 2 2 4 61
#25 2 2 4 61
#26 2 2 3 61
#27 2 2 3 61
#28 2 2 3 61
#29 2 2 3 61
#30 2 2 3 61
data
df1 <- structure(list(Experiment = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), Subject = c(1L, 3L, 5L, 7L, 10L, 2L, 4L, 8L, 9L), R1 = c(28L,
27L, 30L, 26L, 30L, 34L, 25L, 26L, 20L), R2 = c(29L, 24L, 30L,
30L, 30L, 34L, 25L, 26L, 20L), R3 = c(59L, 50L, 61L, 60L, 65L,
61L, 49L, 55L, 60L), R4 = c(55L, 50L, 50L, 60L, 65L, 61L, 48L,
48L, 60L)), .Names = c("Experiment", "Subject", "R1", "R2", "R3",
"R4"), class = "data.frame", row.names = c(NA, -9L))
df2 <- structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), Experiment = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), R = c(3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L,
4L, 4L, 3L, 3L, 3L, 3L, 3L)), .Names = c("Subject", "Experiment",
"R"), class = "data.frame", row.names = c(NA, -30L))
maybe like this ?
library(reshape)
df<- data.frame(Experiment=c(1,1),Subject=c(1,3),R1=c(28,27),R2=c(29,24),R3=c(59,50),R4=c(55,50))
> df
Experiment Subject R1 R2 R3 R4
1 1 1 28 29 59 55
2 1 3 27 24 50 50
dfc <- melt(df,id=c("Experiment","Subject"))
dfc # New Data
> dfc
Experiment Subject variable value
1 1 1 R1 28
2 1 3 R1 27
3 1 1 R2 29
4 1 3 R2 24
5 1 1 R3 59
6 1 3 R3 50
7 1 1 R4 55
8 1 3 R4 50
I don't know if I will be able to explain it correctly but what I want to achieve really simple.
That's first data.frame. The important value for me is in first column "V1"
> dput(Data1)
structure(list(V1 = c(10L, 5L, 3L, 9L, 1L, 2L, 6L, 4L, 8L, 7L
), V2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "NA", class = "factor"),
V3 = c(18L, 17L, 13L, 20L, 15L, 12L, 16L, 11L, 14L, 19L)), .Names = c("V1",
"V2", "V3"), row.names = c(NA, -10L), class = "data.frame")
Second data.frame:
> dput(Data2)
structure(list(Names = c(9L, 10L, 6L, 4L, 2L, 7L, 5L, 3L, 1L,
8L), Herat = c(30L, 29L, 21L, 25L, 24L, 22L, 28L, 27L, 23L, 26L
), Grobpel = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "NA", class = "factor"), Hassynch = c(19L, 12L,
15L, 20L, 11L, 13L, 14L, 16L, 18L, 17L)), .Names = c("Names",
"Herat", "Grobpel", "Hassynch"), row.names = c(NA, -10L), class = "data.frame"
)
The value from first data.frame can be find in 1st column and I would like to copy the value from 4 column (Hassynch) and put it in the second column in first data.frame.
How to do it in the fastest way ?
library(dplyr)
left_join(Data1, Data2, by=c("V1"="Names"))
# V1 V2 V3 Herat Grobpel Hassynch
# 1 10 NA 18 29 NA 12
# 2 5 NA 17 28 NA 14
# 3 3 NA 13 27 NA 16
# 4 9 NA 20 30 NA 19
# 5 1 NA 15 23 NA 18
# 6 2 NA 12 24 NA 11
# 7 6 NA 16 21 NA 15
# 8 4 NA 11 25 NA 20
# 9 8 NA 14 26 NA 17
# 10 7 NA 19 22 NA 13
# if you don't want V2 and V3, you could
left_join(Data1, Data2, by=c("V1"="Names")) %>%
select(-V2, -V3)
# V1 Herat Grobpel Hassynch
# 1 10 29 NA 12
# 2 5 28 NA 14
# 3 3 27 NA 16
# 4 9 30 NA 19
# 5 1 23 NA 18
# 6 2 24 NA 11
# 7 6 21 NA 15
# 8 4 25 NA 20
# 9 8 26 NA 17
# 10 7 22 NA 13
Here's a toy example that I made some time ago to illustrate merge. left_join from dplyr is also good, and data.table almost certainly has another option.
You can subset your reference dataframe so that it contains only the key variable and value variable so that you don't end up with an unmanageable dataframe.
id<-as.numeric((1:5))
m<-c("a","a","a","","")
n<-c("","","b","b","b")
dfm<-data.frame(cbind(id,m))
head(dfm)
id m
1 1 a
2 2 a
3 3 a
4 4
5 5
dfn<-data.frame(cbind(id,n))
head(dfn)
id n
1 1
2 2
3 3 b
4 4 b
5 5 b
dfm$id<-as.numeric(dfm$id)
dfn$id<-as.numeric(dfn$id)
dfm<-subset(dfm,id<4)
head(dfm)
id m
1 1 a
2 2 a
3 3 a
dfn<-subset(dfn,id!=1 & id!=2)
head(dfn)
id n
3 3 b
4 4 b
5 5 b
df.all<-merge(dfm,dfn,by="id",all=TRUE)
head(df.all)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
4 4 <NA> b
5 5 <NA> b
df.all.m<-merge(dfm,dfn,by="id",all.x=TRUE)
head(df.al.lm)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
df.all.n<-merge(dfm,dfn,by="id",all.y=TRUE)
head(df.all.n)
id m n
1 3 a b
2 4 <NA> b
3 5 <NA> b