Adding data to a data.frame to complete a sequence [duplicate] - r

This question already has answers here:
Merge Panel data to get balanced panel data
(2 answers)
Closed 5 years ago.
With the data included below, the first bit of which looks like
head(dat, 9)
IndID BinID Freq
1 BHS_034_A 7 20
2 BHS_034_A 8 27
3 BHS_034_A 9 67
4 BHS_034_A 10 212
5 BHS_037_A 5 1
6 BHS_037_A 7 12
7 BHS_037_A 8 65
8 BHS_037_A 9 122
9 BHS_037_A 10 301
I want to fill in missing numbers of BinID so that all individuals (IndID) have a BinID sequence from 1 to 10. Freq values should be 0 when new values of BinID are added.
I hope to accommodate many individuals, but have only included a few here.
This question, is similar to another post but here I am trying to also add 0 to the filled in columns.
The data:
dat <- structure(list(IndID = c("BHS_034_A", "BHS_034_A", "BHS_034_A",
"BHS_034_A", "BHS_037_A", "BHS_037_A", "BHS_037_A", "BHS_037_A",
"BHS_037_A", "BHS_068_A", "BHS_068_A", "BHS_068_A", "BHS_068_A",
"BHS_068_A", "BHS_068_A", "BHS_068_A", "BHS_070_A", "BHS_070_A",
"BHS_070_A", "BHS_071_A", "BHS_071_A", "BHS_071_A", "BHS_071_A",
"BHS_071_A", "BHS_071_A", "BHS_071_A", "BHS_071_A", "BHS_071_A"
), BinID = c(7L, 8L, 9L, 10L, 5L, 7L, 8L, 9L, 10L, 3L, 4L, 5L,
7L, 8L, 9L, 10L, 8L, 9L, 10L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L), Freq = c(20L, 27L, 67L, 212L, 1L, 12L, 65L, 122L, 301L,
2L, 1L, 1L, 4L, 14L, 104L, 454L, 7L, 90L, 470L, 6L, 11L, 11L,
7L, 18L, 19L, 15L, 31L, 344L)), .Names = c("IndID", "BinID",
"Freq"), row.names = c(NA, 28L), class = "data.frame")

tidyr provides the complete function that allows you to find missing combinations in your dataset:
tidyr::complete(dat, IndID, BinID = 1:10)

Using:
library(data.table)
setDT(dat)[CJ(BinID = 1:10, IndID = IndID, unique = TRUE), on = .(IndID, BinID)]
Or:
library(dplyr)
library(tidyr)
dat %>%
group_by(IndID) %>%
expand(BinID = 1:10) %>%
left_join(., dat)
gives:
IndID BinID Freq
1: BHS_034_A 1 NA
2: BHS_037_A 1 NA
3: BHS_068_A 1 NA
4: BHS_070_A 1 NA
5: BHS_071_A 1 NA
6: BHS_034_A 2 NA
7: BHS_037_A 2 NA
8: BHS_068_A 2 NA
9: BHS_070_A 2 NA
10: BHS_071_A 2 6
11: BHS_034_A 3 NA
12: BHS_037_A 3 NA
13: BHS_068_A 3 2
14: BHS_070_A 3 NA
15: BHS_071_A 3 11
16: BHS_034_A 4 NA
17: BHS_037_A 4 NA
18: BHS_068_A 4 1
19: BHS_070_A 4 NA
20: BHS_071_A 4 11
21: BHS_034_A 5 NA
22: BHS_037_A 5 1
23: BHS_068_A 5 1
24: BHS_070_A 5 NA
25: BHS_071_A 5 7
26: BHS_034_A 6 NA
27: BHS_037_A 6 NA
28: BHS_068_A 6 NA
29: BHS_070_A 6 NA
30: BHS_071_A 6 18
31: BHS_034_A 7 20
32: BHS_037_A 7 12
33: BHS_068_A 7 4
34: BHS_070_A 7 NA
35: BHS_071_A 7 19
36: BHS_034_A 8 27
37: BHS_037_A 8 65
38: BHS_068_A 8 14
39: BHS_070_A 8 7
40: BHS_071_A 8 15
41: BHS_034_A 9 67
42: BHS_037_A 9 122
43: BHS_068_A 9 104
44: BHS_070_A 9 90
45: BHS_071_A 9 31
46: BHS_034_A 10 212
47: BHS_037_A 10 301
48: BHS_068_A 10 454
49: BHS_070_A 10 470
50: BHS_071_A 10 344

Related

Loop through datatable & alter values meeting a specific condition

I'm attempting to create a function that takes a datatable & a variable as arguments. The data table will always have 4 columns (1st is a date column, the other 3 are numeric) but the number of rows will differ. The variable is an integer that is set to act as cutoff. The goal of the function is to output the datatable with all values in the numeric columns greater than the first number larger than the variable. Here is a snippet of the datatable being tested
#datatable dt
> dput(dt[30:40])
structure(list(a = structure(c(18517, 18524, 18531, 18538, 18545,
18552, 18559, 18566, 18573, 18580, 18587), class = "Date"), b = c(14L,
16L, 18L, 21L, 23L, 26L, 29L, 32L, 35L, 39L, 42L), c = c(9L,
10L, 12L, 14L, 16L, 18L, 21L, 23L, 26L, 29L, 32L), d = c(4L,
5L, 6L, 8L, 9L, 11L, 13L, 16L, 18L, 20L, 23L)), row.names = c(NA,
-11L), class = c("data.table", "data.frame"))
> dt[30:40]
a b c d
1: 2020-09-12 14 9 4
2: 2020-09-19 16 10 5
3: 2020-09-26 18 12 6
4: 2020-10-03 21 14 8
5: 2020-10-10 23 16 9
6: 2020-10-17 26 18 11
7: 2020-10-24 29 21 13
8: 2020-10-31 32 23 16
9: 2020-11-07 35 26 18
10: 2020-11-14 39 29 20
11: 2020-11-21 42 32 23
Here is the function I've come up with:
cutoff <- 21 #some integer
checkDT <- function(dt, cutoff){
columns <- c('b','c','d')
for (i in columns){
for (j in dt[,..columns]){
if(is.infinite(min(j[which(j > cutoff)]))){
dt <- dt
}else{
dt[i > min(j[which(j > cutoff)]), `:=` (i = NA)]
}
}
return(dt)
}
}
This outputs a datatable with a fifth column i that is all NA. If I use this statement for a specific column than the output is as expected but I'm trying to have the function perform this to get rid of some lines of code.
if(is.infinite(min(dt$b[which(dt$b > cutoff)]))){
dt <- dt
} else{
dt[b > min(dt$b[which(dt$b > cutoff)]), `:=`(b = NA)]
}
> dt[30:40]
a b c d
1: 2020-09-12 14 9 4
2: 2020-09-19 16 10 5
3: 2020-09-26 18 12 6
4: 2020-10-03 21 14 8
5: 2020-10-10 23 16 9
6: 2020-10-17 NA 18 11
7: 2020-10-24 NA 21 13
8: 2020-10-31 NA 23 16
9: 2020-11-07 NA 26 18
10: 2020-11-14 NA 29 20
11: 2020-11-21 NA 32 23
This is the expected output with a cutoff value of 21:
a b c d
1: 2020-09-12 14 9 4
2: 2020-09-19 16 10 5
3: 2020-09-26 18 12 6
4: 2020-10-03 21 14 8
5: 2020-10-10 23 16 9
6: 2020-10-17 NA 18 11
7: 2020-10-24 NA 21 13
8: 2020-10-31 NA 23 16
9: 2020-11-07 NA NA 18
10: 2020-11-14 NA NA 20
11: 2020-11-21 NA NA 23
Here's another way using lapply and .SDcols.
checkDT <- function(dt1, cutoff) {
columns <- c('b','c','d')
dt1[, (columns) := lapply(.SD, function(x)
replace(x, x > x[x > cutoff][1], NA)), .SDcols = columns][]
}
checkDT(dt, 21)
# a b c d
# 1: 2020-09-12 14 9 4
# 2: 2020-09-19 16 10 5
# 3: 2020-09-26 18 12 6
# 4: 2020-10-03 21 14 8
# 5: 2020-10-10 23 16 9
# 6: 2020-10-17 NA 18 11
# 7: 2020-10-24 NA 21 13
# 8: 2020-10-31 NA 23 16
# 9: 2020-11-07 NA NA 18
#10: 2020-11-14 NA NA 20
#11: 2020-11-21 NA NA 23
I simplified a lot of your notation here
In data.table you don't have to use dt$ again inside the brackets
The which() isn't necessary because the logical vector can be used directly to indicate which rows to keep.
The key is using the get function to translate the text to a column name
I just used suppressWarnings to get rid of the infinity warnings,
The code doesn't replace anything in that case and that's what you want.
checkDT <- function(dt, cutoff) {
columns <- c('b', 'c', 'd')
for (i in columns) {
suppressWarnings(dt[get(i) > min(dt[get(i) > cutoff, get(i)]), (i) := NA])
}
dt[]
}
checkDT(dt, cutoff) gives the desired result

Create group label for chunks of rows using data.table

I have something like the following dataset:
myDT <- structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), id = 2:22, L1 = 2:22), row.names = c(NA,
-21L), class = c("data.table", "data.frame"))
and I would like to create a new column L2 that creates an index for every 2 rows within domain. However, if there is a remainder, like in the case for domain=2 and id=8,9,10, then those ids should be indexed together as long as its within the same domain. Please note that the specific id values in the toy dataset are made up and not always consecutive as shown. The output would be:
structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), id = 2:22, L1 = 2:22, L2=c(1L,1L,2L,2L,3L,3L,4L,4L,4L,
5L,5L,6L,6L,7L,7L,8L,8L,9L,9L,10L,10L)),
row.names = c(NA, -21L), class = c("data.table", "data.frame"))
Is there an efficient way to do this in data.table?
I've tried playing with .N/rowid and the integer division operator %/% (since every n rows should give the same value) inside the subset call but it got me nowhere. For example, I tried something like:
myDT[, L2 := rowid(domain)%/%2]
but clearly this doesn't address the requirements that the last 3 rows within in domain=2 have the same index and that the index should continue incrementing for domain=3.
EDIT Please see revised desired output data table and corresponding description.
EDIT 2
Here is an appended version of myDT:
myDT2 <- structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), id = 2:40,
L1 = 2:40), row.names = c(NA, -39L), class = c("data.table",
"data.frame"))
When I ran #chinsoon12's code on the above, I get:
structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), id = 2:40,
L1 = 2:40, L2 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 5L,
5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 11L,
11L, 12L, 12L, 13L, 13L, 14L, 14L, 15L, 15L, 16L, 16L, 17L,
17L, 18L, 18L)), row.names = c(NA, -39L), class = c("data.table",
"data.frame"))
There appears to be 4 values of L2=11, when two of them should be 12 because they are in a different domain.
An idea is to make a custom function that will create sequential vectors based on the length of each group and the remainder of that length when divided by two. The function is,
f1 <- function(x) {
v1 <- length(x)
i1 <- rep(seq(floor(v1 / 2)), each = 2)
i2 <- c(i1, rep(max(i1), v1 %% 2))
i2 + seq_along(i2)
}
I tried to apply it via data.table but I was getting an error about a bug so here it is with base R,
cumsum(c(TRUE, diff(with(myDT2, ave(id, domain, FUN = f1))) != 1))
#[1] 1 1 2 2 3 3 4 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19
Here is another approach updated for the edited question (inspired by #Sotos use of cumsum):
For each domain id, create a repeated sequence 1, 0, 1, 0, 1, ..., setting the final sequence element to zero by default.
Take the cumsum over the created sequence across domain id's.
library(data.table)
setDT(myDT2)
myDT2[, L2 := c(head(rep_len(c(1, 0), .N), -1), 0), by = domain][, L2 := cumsum(L2)][]
#> domain id L1 L2
#> 1: 2 2 2 1
#> 2: 2 3 3 1
#> 3: 2 4 4 2
#> 4: 2 5 5 2
#> 5: 2 6 6 3
#> 6: 2 7 7 3
#> 7: 2 8 8 4
#> 8: 2 9 9 4
#> 9: 2 10 10 4
#> 10: 3 11 11 5
#> 11: 3 12 12 5
#> 12: 3 13 13 6
#> 13: 3 14 14 6
#> 14: 3 15 15 7
#> 15: 3 16 16 7
#> 16: 3 17 17 8
#> 17: 3 18 18 8
#> 18: 3 19 19 9
#> 19: 3 20 20 9
#> 20: 3 21 21 10
#> 21: 3 22 22 10
#> 22: 4 23 23 11
#> 23: 4 24 24 11
#> 24: 5 25 25 12
#> 25: 5 26 26 12
#> 26: 5 27 27 13
#> 27: 5 28 28 13
#> 28: 5 29 29 14
#> 29: 5 30 30 14
#> 30: 5 31 31 15
#> 31: 5 32 32 15
#> 32: 5 33 33 16
#> 33: 5 34 34 16
#> 34: 5 35 35 17
#> 35: 5 36 36 17
#> 36: 5 37 37 18
#> 37: 5 38 38 18
#> 38: 5 39 39 19
#> 39: 5 40 40 19
#> domain id L1 L2
Here is another option for variable number of repeats other than 2:
n <- 4
setDT(myDT)[, L2 :=
myDT[, {
x <- ceiling(seq_along(id)/n)
if (sum(x==x[.N]) < n) x[x==x[.N]] <- floor(.N/n)
x
}, domain][, rleid(domain, V1)]
]
Or a recursive approach:
n <- 4
s <- 0
setDT(myDT)[, L2 :=
myDT[, {
x <- s + ceiling(seq_along(id)/n)
if (sum(x==x[.N]) < n) x[x==x[.N]] <- s + floor(.N/n)
s <- if (s<max(x)) max(x) else s + 1
x
}, domain]$V1
]
output for n=2:
domain id L1 L2
1: 2 2 2 1
2: 2 3 3 1
3: 2 4 4 2
4: 2 5 5 2
5: 2 6 6 3
6: 2 7 7 3
7: 2 8 8 4
8: 2 9 9 4
9: 2 10 10 4
10: 3 11 11 5
11: 3 12 12 5
12: 3 13 13 6
13: 3 14 14 6
14: 3 15 15 7
15: 3 16 16 7
16: 3 17 17 8
17: 3 18 18 8
18: 3 19 19 9
19: 3 20 20 9
20: 3 21 21 10
21: 3 22 22 10
22: 4 23 23 11
23: 4 24 24 11
24: 5 25 25 12
25: 5 26 26 12
26: 5 27 27 13
27: 5 28 28 13
28: 5 29 29 14
29: 5 30 30 14
30: 5 31 31 15
31: 5 32 32 15
32: 5 33 33 16
33: 5 34 34 16
34: 5 35 35 17
35: 5 36 36 17
36: 5 37 37 18
37: 5 38 38 18
38: 5 39 39 19
39: 5 40 40 19
domain id L1 L2
output for n=4:
domain id L1 L2
1: 2 2 2 1
2: 2 3 3 1
3: 2 4 4 1
4: 2 5 5 1
5: 2 6 6 2
6: 2 7 7 2
7: 2 8 8 2
8: 2 9 9 2
9: 2 10 10 2
10: 3 11 11 3
11: 3 12 12 3
12: 3 13 13 3
13: 3 14 14 3
14: 3 15 15 4
15: 3 16 16 4
16: 3 17 17 4
17: 3 18 18 4
18: 3 19 19 5
19: 3 20 20 5
20: 3 21 21 5
21: 3 22 22 5
22: 4 23 23 6
23: 4 24 24 6
24: 5 25 25 7
25: 5 26 26 7
26: 5 27 27 7
27: 5 28 28 7
28: 5 29 29 8
29: 5 30 30 8
30: 5 31 31 8
31: 5 32 32 8
32: 5 33 33 9
33: 5 34 34 9
34: 5 35 35 9
35: 5 36 36 9
36: 5 37 37 10
37: 5 38 38 10
38: 5 39 39 10
39: 5 40 40 10
domain id L1 L2

Create balanced data set

I am using R and have a long data set as the one outlined below:
Date ID Status
2014-10-01 12 1
2015-04-01 12 1
2015-07-01 12 1
2015-09-01 12 1
2015-11-01 12 0
2016-01-01 12 0
2016-05-01 12 0
2016-08-01 12 1
2017-03-01 12 1
2017-05-01 12 1
2014-10-01 13 1
2015-04-01 13 1
2015-07-01 13 0
2015-11-01 14 0
2016-01-01 14 0
...
My goal is to create a "balanced" data i.e. each ID should occur for each of the 10 dates. The variable "Status" for the initially non-occurring observations should be labeled as N/A. In other words, the outcome should look like this:
Date ID Status
2014-10-01 12 1
2015-04-01 12 1
2015-07-01 12 1
2015-09-01 12 1
2015-11-01 12 0
2016-01-01 12 0
2016-05-01 12 0
2016-08-01 12 1
2017-03-01 12 1
2017-05-01 12 1
2014-10-01 13 1
2015-04-01 13 1
2015-07-01 13 N/A
2015-09-01 13 N/A
2015-11-01 13 N/A
2016-01-01 13 N/A
2016-05-01 13 N/A
2016-08-01 13 N/A
2017-03-01 13 N/A
2017-05-01 13 N/A
2014-10-01 14 N/A
2015-04-01 14 N/A
2015-07-01 14 N/A
2015-09-01 14 N/A
2015-11-01 14 0
2016-01-01 14 0
2016-05-01 14 N/A
2016-08-01 14 N/A
2017-03-01 14 N/A
2017-05-01 14 N/A
...
Thank you for your help!
Here is an approach using tidyverse:
library(tidyverse)
df %>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df) -> df1 #join the original data frame and save to object df1
or save to original object (thanks to Renu's comment):
df %<>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df)
equivalent is:
df %>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df) -> df
The result:
ID Date Status
1 12 2014-10-01 1
2 12 2015-04-01 1
3 12 2015-07-01 1
4 12 2015-09-01 1
5 12 2015-11-01 0
6 12 2016-01-01 0
7 12 2016-05-01 0
8 12 2016-08-01 1
9 12 2017-03-01 1
10 12 2017-05-01 1
11 13 2014-10-01 1
12 13 2015-04-01 1
13 13 2015-07-01 0
14 13 2015-09-01 NA
15 13 2015-11-01 NA
16 13 2016-01-01 NA
17 13 2016-05-01 NA
18 13 2016-08-01 NA
19 13 2017-03-01 NA
20 13 2017-05-01 NA
21 14 2014-10-01 NA
22 14 2015-04-01 NA
23 14 2015-07-01 NA
24 14 2015-09-01 NA
25 14 2015-11-01 0
26 14 2016-01-01 0
27 14 2016-05-01 NA
28 14 2016-08-01 NA
29 14 2017-03-01 NA
30 14 2017-05-01 NA
the data:
> dput(df)
structure(list(Date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 1L, 2L, 3L, 5L, 6L), .Label = c("2014-10-01", "2015-04-01",
"2015-07-01", "2015-09-01", "2015-11-01", "2016-01-01", "2016-05-01",
"2016-08-01", "2017-03-01", "2017-05-01"), class = "factor"),
ID = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L,
13L, 13L, 13L, 14L, 14L), Status = c(1L, 1L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("Date",
"ID", "Status"), class = "data.frame", row.names = c(NA, -15L
))
The following worked for me:
df_b <- data.frame(date = rep(unique(df$date), length(unique(df$id))),
id = rep(unique(df$id), each = length(unique(df$date))))
balanced_data <- left_join(df_b, df)

Finding value in one data.frame and transfering value from other column

I don't know if I will be able to explain it correctly but what I want to achieve really simple.
That's first data.frame. The important value for me is in first column "V1"
> dput(Data1)
structure(list(V1 = c(10L, 5L, 3L, 9L, 1L, 2L, 6L, 4L, 8L, 7L
), V2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "NA", class = "factor"),
V3 = c(18L, 17L, 13L, 20L, 15L, 12L, 16L, 11L, 14L, 19L)), .Names = c("V1",
"V2", "V3"), row.names = c(NA, -10L), class = "data.frame")
Second data.frame:
> dput(Data2)
structure(list(Names = c(9L, 10L, 6L, 4L, 2L, 7L, 5L, 3L, 1L,
8L), Herat = c(30L, 29L, 21L, 25L, 24L, 22L, 28L, 27L, 23L, 26L
), Grobpel = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "NA", class = "factor"), Hassynch = c(19L, 12L,
15L, 20L, 11L, 13L, 14L, 16L, 18L, 17L)), .Names = c("Names",
"Herat", "Grobpel", "Hassynch"), row.names = c(NA, -10L), class = "data.frame"
)
The value from first data.frame can be find in 1st column and I would like to copy the value from 4 column (Hassynch) and put it in the second column in first data.frame.
How to do it in the fastest way ?
library(dplyr)
left_join(Data1, Data2, by=c("V1"="Names"))
# V1 V2 V3 Herat Grobpel Hassynch
# 1 10 NA 18 29 NA 12
# 2 5 NA 17 28 NA 14
# 3 3 NA 13 27 NA 16
# 4 9 NA 20 30 NA 19
# 5 1 NA 15 23 NA 18
# 6 2 NA 12 24 NA 11
# 7 6 NA 16 21 NA 15
# 8 4 NA 11 25 NA 20
# 9 8 NA 14 26 NA 17
# 10 7 NA 19 22 NA 13
# if you don't want V2 and V3, you could
left_join(Data1, Data2, by=c("V1"="Names")) %>%
select(-V2, -V3)
# V1 Herat Grobpel Hassynch
# 1 10 29 NA 12
# 2 5 28 NA 14
# 3 3 27 NA 16
# 4 9 30 NA 19
# 5 1 23 NA 18
# 6 2 24 NA 11
# 7 6 21 NA 15
# 8 4 25 NA 20
# 9 8 26 NA 17
# 10 7 22 NA 13
Here's a toy example that I made some time ago to illustrate merge. left_join from dplyr is also good, and data.table almost certainly has another option.
You can subset your reference dataframe so that it contains only the key variable and value variable so that you don't end up with an unmanageable dataframe.
id<-as.numeric((1:5))
m<-c("a","a","a","","")
n<-c("","","b","b","b")
dfm<-data.frame(cbind(id,m))
head(dfm)
id m
1 1 a
2 2 a
3 3 a
4 4
5 5
dfn<-data.frame(cbind(id,n))
head(dfn)
id n
1 1
2 2
3 3 b
4 4 b
5 5 b
dfm$id<-as.numeric(dfm$id)
dfn$id<-as.numeric(dfn$id)
dfm<-subset(dfm,id<4)
head(dfm)
id m
1 1 a
2 2 a
3 3 a
dfn<-subset(dfn,id!=1 & id!=2)
head(dfn)
id n
3 3 b
4 4 b
5 5 b
df.all<-merge(dfm,dfn,by="id",all=TRUE)
head(df.all)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
4 4 <NA> b
5 5 <NA> b
df.all.m<-merge(dfm,dfn,by="id",all.x=TRUE)
head(df.al.lm)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
df.all.n<-merge(dfm,dfn,by="id",all.y=TRUE)
head(df.all.n)
id m n
1 3 a b
2 4 <NA> b
3 5 <NA> b

How to select data that have complete cases of a certain column?

I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points

Resources