Create group label for chunks of rows using data.table - r

I have something like the following dataset:
myDT <- structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), id = 2:22, L1 = 2:22), row.names = c(NA,
-21L), class = c("data.table", "data.frame"))
and I would like to create a new column L2 that creates an index for every 2 rows within domain. However, if there is a remainder, like in the case for domain=2 and id=8,9,10, then those ids should be indexed together as long as its within the same domain. Please note that the specific id values in the toy dataset are made up and not always consecutive as shown. The output would be:
structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), id = 2:22, L1 = 2:22, L2=c(1L,1L,2L,2L,3L,3L,4L,4L,4L,
5L,5L,6L,6L,7L,7L,8L,8L,9L,9L,10L,10L)),
row.names = c(NA, -21L), class = c("data.table", "data.frame"))
Is there an efficient way to do this in data.table?
I've tried playing with .N/rowid and the integer division operator %/% (since every n rows should give the same value) inside the subset call but it got me nowhere. For example, I tried something like:
myDT[, L2 := rowid(domain)%/%2]
but clearly this doesn't address the requirements that the last 3 rows within in domain=2 have the same index and that the index should continue incrementing for domain=3.
EDIT Please see revised desired output data table and corresponding description.
EDIT 2
Here is an appended version of myDT:
myDT2 <- structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), id = 2:40,
L1 = 2:40), row.names = c(NA, -39L), class = c("data.table",
"data.frame"))
When I ran #chinsoon12's code on the above, I get:
structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), id = 2:40,
L1 = 2:40, L2 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 5L,
5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 11L,
11L, 12L, 12L, 13L, 13L, 14L, 14L, 15L, 15L, 16L, 16L, 17L,
17L, 18L, 18L)), row.names = c(NA, -39L), class = c("data.table",
"data.frame"))
There appears to be 4 values of L2=11, when two of them should be 12 because they are in a different domain.

An idea is to make a custom function that will create sequential vectors based on the length of each group and the remainder of that length when divided by two. The function is,
f1 <- function(x) {
v1 <- length(x)
i1 <- rep(seq(floor(v1 / 2)), each = 2)
i2 <- c(i1, rep(max(i1), v1 %% 2))
i2 + seq_along(i2)
}
I tried to apply it via data.table but I was getting an error about a bug so here it is with base R,
cumsum(c(TRUE, diff(with(myDT2, ave(id, domain, FUN = f1))) != 1))
#[1] 1 1 2 2 3 3 4 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19

Here is another approach updated for the edited question (inspired by #Sotos use of cumsum):
For each domain id, create a repeated sequence 1, 0, 1, 0, 1, ..., setting the final sequence element to zero by default.
Take the cumsum over the created sequence across domain id's.
library(data.table)
setDT(myDT2)
myDT2[, L2 := c(head(rep_len(c(1, 0), .N), -1), 0), by = domain][, L2 := cumsum(L2)][]
#> domain id L1 L2
#> 1: 2 2 2 1
#> 2: 2 3 3 1
#> 3: 2 4 4 2
#> 4: 2 5 5 2
#> 5: 2 6 6 3
#> 6: 2 7 7 3
#> 7: 2 8 8 4
#> 8: 2 9 9 4
#> 9: 2 10 10 4
#> 10: 3 11 11 5
#> 11: 3 12 12 5
#> 12: 3 13 13 6
#> 13: 3 14 14 6
#> 14: 3 15 15 7
#> 15: 3 16 16 7
#> 16: 3 17 17 8
#> 17: 3 18 18 8
#> 18: 3 19 19 9
#> 19: 3 20 20 9
#> 20: 3 21 21 10
#> 21: 3 22 22 10
#> 22: 4 23 23 11
#> 23: 4 24 24 11
#> 24: 5 25 25 12
#> 25: 5 26 26 12
#> 26: 5 27 27 13
#> 27: 5 28 28 13
#> 28: 5 29 29 14
#> 29: 5 30 30 14
#> 30: 5 31 31 15
#> 31: 5 32 32 15
#> 32: 5 33 33 16
#> 33: 5 34 34 16
#> 34: 5 35 35 17
#> 35: 5 36 36 17
#> 36: 5 37 37 18
#> 37: 5 38 38 18
#> 38: 5 39 39 19
#> 39: 5 40 40 19
#> domain id L1 L2

Here is another option for variable number of repeats other than 2:
n <- 4
setDT(myDT)[, L2 :=
myDT[, {
x <- ceiling(seq_along(id)/n)
if (sum(x==x[.N]) < n) x[x==x[.N]] <- floor(.N/n)
x
}, domain][, rleid(domain, V1)]
]
Or a recursive approach:
n <- 4
s <- 0
setDT(myDT)[, L2 :=
myDT[, {
x <- s + ceiling(seq_along(id)/n)
if (sum(x==x[.N]) < n) x[x==x[.N]] <- s + floor(.N/n)
s <- if (s<max(x)) max(x) else s + 1
x
}, domain]$V1
]
output for n=2:
domain id L1 L2
1: 2 2 2 1
2: 2 3 3 1
3: 2 4 4 2
4: 2 5 5 2
5: 2 6 6 3
6: 2 7 7 3
7: 2 8 8 4
8: 2 9 9 4
9: 2 10 10 4
10: 3 11 11 5
11: 3 12 12 5
12: 3 13 13 6
13: 3 14 14 6
14: 3 15 15 7
15: 3 16 16 7
16: 3 17 17 8
17: 3 18 18 8
18: 3 19 19 9
19: 3 20 20 9
20: 3 21 21 10
21: 3 22 22 10
22: 4 23 23 11
23: 4 24 24 11
24: 5 25 25 12
25: 5 26 26 12
26: 5 27 27 13
27: 5 28 28 13
28: 5 29 29 14
29: 5 30 30 14
30: 5 31 31 15
31: 5 32 32 15
32: 5 33 33 16
33: 5 34 34 16
34: 5 35 35 17
35: 5 36 36 17
36: 5 37 37 18
37: 5 38 38 18
38: 5 39 39 19
39: 5 40 40 19
domain id L1 L2
output for n=4:
domain id L1 L2
1: 2 2 2 1
2: 2 3 3 1
3: 2 4 4 1
4: 2 5 5 1
5: 2 6 6 2
6: 2 7 7 2
7: 2 8 8 2
8: 2 9 9 2
9: 2 10 10 2
10: 3 11 11 3
11: 3 12 12 3
12: 3 13 13 3
13: 3 14 14 3
14: 3 15 15 4
15: 3 16 16 4
16: 3 17 17 4
17: 3 18 18 4
18: 3 19 19 5
19: 3 20 20 5
20: 3 21 21 5
21: 3 22 22 5
22: 4 23 23 6
23: 4 24 24 6
24: 5 25 25 7
25: 5 26 26 7
26: 5 27 27 7
27: 5 28 28 7
28: 5 29 29 8
29: 5 30 30 8
30: 5 31 31 8
31: 5 32 32 8
32: 5 33 33 9
33: 5 34 34 9
34: 5 35 35 9
35: 5 36 36 9
36: 5 37 37 10
37: 5 38 38 10
38: 5 39 39 10
39: 5 40 40 10
domain id L1 L2

Related

(R) Apply running row sums in a table object with 2 variables

The following is a replicated sample of data which records the duration of 300 absences. month is the first month of the absence and length is the number of concurrent months the absence lasted.
df <- data.frame("month" = sample(c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"),300, replace = TRUE),
"length" = sample.int(6, size = 300, replace = TRUE))
df$month <- factor(df$month, levels(df$month)[c(5,4,8,1,9,7,6,2,12,11,10,3)])
Using table(df$length) you can see how many separate absences lasted for exactly each value of length.
1 2 3 4 5 6
55 45 42 56 51 51
But because length is incremental, if I wanted to show the total number of absences that reached (but not necessarily lasted) a certain number of months, I could use rev(cumsum(rev(table(df$length)))) which gives:
1 2 3 4 5 6
300 245 200 158 102 51
I am interested in seeing this cumulative view by month. rev(cumsum(rev(table(df$month,df$length))))
returns a vector and not a table.
The result I would like is to take this
table(df$month, df$length)
1 2 3 4 5 6
jan 5 5 4 5 3 2
feb 5 7 2 7 9 3
mar 5 3 2 2 9 4
apr 6 7 4 4 3 11
may 5 5 3 5 5 2
jun 4 4 2 7 4 5
jul 4 3 5 5 1 4
aug 4 0 5 3 6 7
sep 4 5 4 4 3 3
oct 4 2 1 6 5 4
nov 5 2 3 5 2 2
dec 4 2 7 3 1 4
and turn it into this, where the reverse cumulative count of length is calculated for each month.
1 2 3 4 5 6
jan 24 19 14 10 5 2
feb 33 28 21 19 12 3
mar 25 20 17 15 13 4
apr 35 29 22 18 14 11
may 25 20 15 12 7 2
jun 26 22 18 16 9 5
jul 22 18 15 10 5 4
aug 25 21 21 16 13 7
sep 23 19 14 10 6 3
oct 22 18 16 15 9 4
nov 19 14 12 9 4 2
dec 21 17 15 8 5 4
Is there a way to do this using table()? If not, I am open to any solution. Thanks in advance.
We can use rowCumsums on the reverse columns using index with seq (:) reversed for the column index and then reverse the index again
library(matrixStats)
tbl <- table(df$month, df$length)
tbl[] <- rowCumsums(tbl[, ncol(tbl):1])[, ncol(tbl):1]
tbl
#
# 1 2 3 4 5 6
# jan 24 19 14 10 5 2
# feb 33 28 21 19 12 3
# mar 25 20 17 15 13 4
# apr 35 29 22 18 14 11
# may 25 20 15 12 7 2
# jun 26 22 18 16 9 5
# jul 22 18 15 10 5 4
# aug 25 21 21 16 13 7
# sep 23 19 14 10 6 3
# oct 22 18 16 15 9 4
# nov 19 14 12 9 4 2
# dec 21 17 15 8 5 4
Or in base R, it would be cumsum with apply
tbl[] <- t(apply(tbl[, ncol(tbl):1], 1, cumsum))[, ncol(tbl):1]
data
tbl <- structure(c(5L, 5L, 5L, 6L, 5L, 4L, 4L, 4L, 4L, 4L, 5L, 4L, 5L,
7L, 3L, 7L, 5L, 4L, 3L, 0L, 5L, 2L, 2L, 2L, 4L, 2L, 2L, 4L, 3L,
2L, 5L, 5L, 4L, 1L, 3L, 7L, 5L, 7L, 2L, 4L, 5L, 7L, 5L, 3L, 4L,
6L, 5L, 3L, 3L, 9L, 9L, 3L, 5L, 4L, 1L, 6L, 3L, 5L, 2L, 1L, 2L,
3L, 4L, 11L, 2L, 5L, 4L, 7L, 3L, 4L, 2L, 4L), .Dim = c(12L, 6L
), .Dimnames = structure(list(c("jan", "feb", "mar", "apr", "may",
"jun", "jul", "aug", "sep", "oct", "nov", "dec"), c("1", "2",
"3", "4", "5", "6")), .Names = c("", "")), class = "table")
If you create a data frame rather than a table-class object, you can use Reduce with + as the function and accumulate = T to get a cumsum. Before creating the "table" (in quotes since the class is not "table") I made a factor version of the month column so the months would stay in the same order.
df$month_fac <- with(df, factor(month, levels = unique(month)))
tbl <- data.table::dcast(df, month_fac ~ length)
tbl[ncol(tbl):2] <- Reduce('+', rev(tbl[-1]), accumulate = TRUE)
The output is the tbl object, but I didn't bother showing it because you didn't set a seed so the (random) values will be different from the output shown in the question.

Adding data to a data.frame to complete a sequence [duplicate]

This question already has answers here:
Merge Panel data to get balanced panel data
(2 answers)
Closed 5 years ago.
With the data included below, the first bit of which looks like
head(dat, 9)
IndID BinID Freq
1 BHS_034_A 7 20
2 BHS_034_A 8 27
3 BHS_034_A 9 67
4 BHS_034_A 10 212
5 BHS_037_A 5 1
6 BHS_037_A 7 12
7 BHS_037_A 8 65
8 BHS_037_A 9 122
9 BHS_037_A 10 301
I want to fill in missing numbers of BinID so that all individuals (IndID) have a BinID sequence from 1 to 10. Freq values should be 0 when new values of BinID are added.
I hope to accommodate many individuals, but have only included a few here.
This question, is similar to another post but here I am trying to also add 0 to the filled in columns.
The data:
dat <- structure(list(IndID = c("BHS_034_A", "BHS_034_A", "BHS_034_A",
"BHS_034_A", "BHS_037_A", "BHS_037_A", "BHS_037_A", "BHS_037_A",
"BHS_037_A", "BHS_068_A", "BHS_068_A", "BHS_068_A", "BHS_068_A",
"BHS_068_A", "BHS_068_A", "BHS_068_A", "BHS_070_A", "BHS_070_A",
"BHS_070_A", "BHS_071_A", "BHS_071_A", "BHS_071_A", "BHS_071_A",
"BHS_071_A", "BHS_071_A", "BHS_071_A", "BHS_071_A", "BHS_071_A"
), BinID = c(7L, 8L, 9L, 10L, 5L, 7L, 8L, 9L, 10L, 3L, 4L, 5L,
7L, 8L, 9L, 10L, 8L, 9L, 10L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L), Freq = c(20L, 27L, 67L, 212L, 1L, 12L, 65L, 122L, 301L,
2L, 1L, 1L, 4L, 14L, 104L, 454L, 7L, 90L, 470L, 6L, 11L, 11L,
7L, 18L, 19L, 15L, 31L, 344L)), .Names = c("IndID", "BinID",
"Freq"), row.names = c(NA, 28L), class = "data.frame")
tidyr provides the complete function that allows you to find missing combinations in your dataset:
tidyr::complete(dat, IndID, BinID = 1:10)
Using:
library(data.table)
setDT(dat)[CJ(BinID = 1:10, IndID = IndID, unique = TRUE), on = .(IndID, BinID)]
Or:
library(dplyr)
library(tidyr)
dat %>%
group_by(IndID) %>%
expand(BinID = 1:10) %>%
left_join(., dat)
gives:
IndID BinID Freq
1: BHS_034_A 1 NA
2: BHS_037_A 1 NA
3: BHS_068_A 1 NA
4: BHS_070_A 1 NA
5: BHS_071_A 1 NA
6: BHS_034_A 2 NA
7: BHS_037_A 2 NA
8: BHS_068_A 2 NA
9: BHS_070_A 2 NA
10: BHS_071_A 2 6
11: BHS_034_A 3 NA
12: BHS_037_A 3 NA
13: BHS_068_A 3 2
14: BHS_070_A 3 NA
15: BHS_071_A 3 11
16: BHS_034_A 4 NA
17: BHS_037_A 4 NA
18: BHS_068_A 4 1
19: BHS_070_A 4 NA
20: BHS_071_A 4 11
21: BHS_034_A 5 NA
22: BHS_037_A 5 1
23: BHS_068_A 5 1
24: BHS_070_A 5 NA
25: BHS_071_A 5 7
26: BHS_034_A 6 NA
27: BHS_037_A 6 NA
28: BHS_068_A 6 NA
29: BHS_070_A 6 NA
30: BHS_071_A 6 18
31: BHS_034_A 7 20
32: BHS_037_A 7 12
33: BHS_068_A 7 4
34: BHS_070_A 7 NA
35: BHS_071_A 7 19
36: BHS_034_A 8 27
37: BHS_037_A 8 65
38: BHS_068_A 8 14
39: BHS_070_A 8 7
40: BHS_071_A 8 15
41: BHS_034_A 9 67
42: BHS_037_A 9 122
43: BHS_068_A 9 104
44: BHS_070_A 9 90
45: BHS_071_A 9 31
46: BHS_034_A 10 212
47: BHS_037_A 10 301
48: BHS_068_A 10 454
49: BHS_070_A 10 470
50: BHS_071_A 10 344

Combine two data frames considering levels of factor of one data frame and column name of another data frame using r

I need to create a new column for a existing data frame considering levels of factors. I have 2 data frames called dat_group and dat_prices. These data frames look like below.
dat_group
Group
1 A
2 A
3 A
4 A
5 A
6 A
7 A
8 A
9 A
10 A
11 C
12 C
13 C
14 C
15 C
16 C
17 C
18 C
19 C
20 C
21 B
22 B
23 B
24 B
25 B
26 B
27 B
28 B
29 B
30 B
dat_price
A B C
1 21 45 24
2 21 45 24
3 21 45 24
4 21 45 24
5 15 11 10
6 15 11 10
7 15 11 10
8 20 13 55
9 20 13 55
10 20 13 55
I need to paste the values of A,B and C columns considering the level in dat_group. The row sequence should be the same order. If I create new column to dat_group as "price"
dat_group$Price<-NA
Then the data frame should be like ;
Group Price
1 A 21
2 A 21
3 A 21
4 A 21
5 A 15
6 A 15
7 A 15
8 A 20
9 A 20
10 A 20
11 C 24
12 C 24
13 C 24
14 C 24
15 C 10
16 C 10
17 C 10
18 C 55
19 C 55
20 C 55
21 B 45
22 B 45
23 B 45
24 B 45
25 B 11
26 B 11
27 B 11
28 B 13
29 B 13
30 B 13
I tried to do this using some available examples e.g.1 e.g.2, but did not work.
Please could anybody help me. The two example data frames can be accessed in following codes. My actual data set has several 1000 rows.
dat_group<- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B", "C"), class = "factor")), .Names = "Group", class = "data.frame", row.names = c(NA,
-30L))
dat_price<-structure(list(A = c(21L, 21L, 21L, 21L, 15L, 15L, 15L, 20L,
20L, 20L), B = c(45L, 45L, 45L, 45L, 11L, 11L, 11L, 13L, 13L,
13L), C = c(24L, 24L, 24L, 24L, 10L, 10L, 10L, 55L, 55L, 55L)), .Names = c("A",
"B", "C"), class = "data.frame", row.names = c(NA, -10L))
library(data.table)
dat_price <- as.data.table(dat_price)
dat_price_new <- cbind(dat_price[, c(1,3), with = FALSE],
dat_price[, 2, with = FALSE])
melt(dat_price_new)
A more defensive solution to your problem at hand. Hopefully this will work even if all of your factor's levels are not in identical multiples.
library(dplyr); library(purrr); library(magrittr)
dat_group$original_order <- seq(1:nrow(dat_group))
dat_group %<>%
split(.$Group) %>%
map(~ mutate(., Price = rep(na.omit(dat_price[,unique(Group)]), n()/length(na.omit(dat_price[,unique(Group)]))))) %>%
bind_rows() %>%
arrange(original_order) %>%
select(-original_order)
dat_group
Group Price
1 A 21
2 A 21
3 A 21
4 A 21
5 A 15
6 A 15
7 A 15
8 A 20
9 A 20
10 A 20
11 C 24
12 C 24
13 C 24
14 C 24
15 C 10
16 C 10
17 C 10
18 C 55
19 C 55
20 C 55
21 B 45
22 B 45
23 B 45
24 B 45
25 B 11
26 B 11
27 B 11
28 B 13
29 B 13
30 B 13
Original (lazy) solution:
dat_group$Price <- rep(unlist(dat_price), length.out = nrow(dat_group))

Finding value in one data.frame and transfering value from other column

I don't know if I will be able to explain it correctly but what I want to achieve really simple.
That's first data.frame. The important value for me is in first column "V1"
> dput(Data1)
structure(list(V1 = c(10L, 5L, 3L, 9L, 1L, 2L, 6L, 4L, 8L, 7L
), V2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "NA", class = "factor"),
V3 = c(18L, 17L, 13L, 20L, 15L, 12L, 16L, 11L, 14L, 19L)), .Names = c("V1",
"V2", "V3"), row.names = c(NA, -10L), class = "data.frame")
Second data.frame:
> dput(Data2)
structure(list(Names = c(9L, 10L, 6L, 4L, 2L, 7L, 5L, 3L, 1L,
8L), Herat = c(30L, 29L, 21L, 25L, 24L, 22L, 28L, 27L, 23L, 26L
), Grobpel = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "NA", class = "factor"), Hassynch = c(19L, 12L,
15L, 20L, 11L, 13L, 14L, 16L, 18L, 17L)), .Names = c("Names",
"Herat", "Grobpel", "Hassynch"), row.names = c(NA, -10L), class = "data.frame"
)
The value from first data.frame can be find in 1st column and I would like to copy the value from 4 column (Hassynch) and put it in the second column in first data.frame.
How to do it in the fastest way ?
library(dplyr)
left_join(Data1, Data2, by=c("V1"="Names"))
# V1 V2 V3 Herat Grobpel Hassynch
# 1 10 NA 18 29 NA 12
# 2 5 NA 17 28 NA 14
# 3 3 NA 13 27 NA 16
# 4 9 NA 20 30 NA 19
# 5 1 NA 15 23 NA 18
# 6 2 NA 12 24 NA 11
# 7 6 NA 16 21 NA 15
# 8 4 NA 11 25 NA 20
# 9 8 NA 14 26 NA 17
# 10 7 NA 19 22 NA 13
# if you don't want V2 and V3, you could
left_join(Data1, Data2, by=c("V1"="Names")) %>%
select(-V2, -V3)
# V1 Herat Grobpel Hassynch
# 1 10 29 NA 12
# 2 5 28 NA 14
# 3 3 27 NA 16
# 4 9 30 NA 19
# 5 1 23 NA 18
# 6 2 24 NA 11
# 7 6 21 NA 15
# 8 4 25 NA 20
# 9 8 26 NA 17
# 10 7 22 NA 13
Here's a toy example that I made some time ago to illustrate merge. left_join from dplyr is also good, and data.table almost certainly has another option.
You can subset your reference dataframe so that it contains only the key variable and value variable so that you don't end up with an unmanageable dataframe.
id<-as.numeric((1:5))
m<-c("a","a","a","","")
n<-c("","","b","b","b")
dfm<-data.frame(cbind(id,m))
head(dfm)
id m
1 1 a
2 2 a
3 3 a
4 4
5 5
dfn<-data.frame(cbind(id,n))
head(dfn)
id n
1 1
2 2
3 3 b
4 4 b
5 5 b
dfm$id<-as.numeric(dfm$id)
dfn$id<-as.numeric(dfn$id)
dfm<-subset(dfm,id<4)
head(dfm)
id m
1 1 a
2 2 a
3 3 a
dfn<-subset(dfn,id!=1 & id!=2)
head(dfn)
id n
3 3 b
4 4 b
5 5 b
df.all<-merge(dfm,dfn,by="id",all=TRUE)
head(df.all)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
4 4 <NA> b
5 5 <NA> b
df.all.m<-merge(dfm,dfn,by="id",all.x=TRUE)
head(df.al.lm)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
df.all.n<-merge(dfm,dfn,by="id",all.y=TRUE)
head(df.all.n)
id m n
1 3 a b
2 4 <NA> b
3 5 <NA> b

How to select data that have complete cases of a certain column?

I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points

Resources