R: Subsetting a data.table with repeated column names with numerical positions - r

I have a data.table that looks like this
> dput(DT)
A B C A B C D
1: 1 2 3 3 5 6 7
2: 2 1 3 2 1 3 4
Here's the dput
DT <- structure(list(A = 1:2, B = c(2L, 1L), C = c(3L, 3L), A = c(3L,
2L), B = c(5L, 1L), C = c(6L, 3L), D = c(7L, 4L)), .Names = c("A",
"B", "C", "A", "B", "C", "D"), row.names = c(NA, -2L), class = c("data.table",
"data.frame"))
Basically, I want to subset them according to their headers. So for header "B", I would do this:
subset(DT,,grep(unique(names(DT))[2],names(DT)))
B B
1: 2 2
2: 1 1
As you can see, the values are wrong as the second column is simply a repeat of the first. I want to get this instead:
B B
1: 2 5
2: 1 1
Can anyone help me please?

The following alternatives work for me:
pos <- grep("B", names(DT))
DT[, ..pos]
# B B
# 1: 2 5
# 2: 1 1
DT[, .SD, .SDcols = patterns("B")]
# B B
# 1: 2 5
# 2: 1 1
DT[, names(DT) %in% unique(names(DT))[2], with = FALSE]
# B B
# 1: 2 5
# 2: 1 1

Related

How do I eliminate duplicates in data.table using mult

The solution to this simple problem has eluded me for several hours. I have a data table in which a value is identified by several classification variables (A, B, L). Where there are observations characterized by duplicate classification variables A & B, I want to retain the one that has the highest 'L'. So, if I have a table generated with this code
set.seed(17)
DT <- data.table(A=rep(c("a","b"),each=5),
B=c("a","b","c","d","d","a","b","b","c","d"),
L=c(1,1,1,2,1,1,1,2,1,1),
val=rnbinom(10, size=2, mu=3))
Making the following:
A B L val
1: a a 1 1
2: a b 1 10
3: a c 1 3
4: a d 1 5
5: a d 2 2
6: b a 1 8
7: b b 1 7
8: b b 2 1
9: b c 1 2
10: b d 1 2
I have tried commands such as
setkey(DT,A,B,L)
DT[ , .(A,B,L,val) , mult="last"]
but I'm just not getting something.
I want a resulting table that looks like this
A B L val
1: a a 1 1
2: a b 1 10
3: a c 1 3
5: a d 2 2
6: b a 1 8
8: b b 2 1
9: b c 1 2
10: b d 1 2
DT[, lapply(.SD, last), .(A,B)])
should also work and seems to be a bit faster than the merge solution
solution option
library(data.table)
dt <- structure(list(A = c("a", "a", "a", "a", "a", "b", "b", "b",
"b", "b"), B = c("a", "b", "c", "d", "d", "a", "b", "b", "c",
"d"), L = c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L), val = c(1L,
10L, 3L, 5L, 2L, 8L, 7L, 1L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-10L))
setDT(dt)
merge(dt[, list(L = last(L)), by =list(A, B)], dt)
#> A B L val
#> 1: a a 1 1
#> 2: a b 1 10
#> 3: a c 1 3
#> 4: a d 2 2
#> 5: b a 1 8
#> 6: b b 2 1
#> 7: b c 1 2
#> 8: b d 1 2
Created on 2021-03-24 by the reprex package (v1.0.0)
set.seed(17)
library(data.table)
DT <- data.table(A=rep(c("a","b"),each=5),
B=c("a","b","c","d","d","a","b","b","c","d"),
L=c(1,1,1,2,1,1,1,2,1,1),
val=rnbinom(10, size=2, mu=3))
result <- DT[DT[, .I[L == max(L)], by = list(A, B)]$V1]
> result
A B L val
1: a a 1 1
2: a b 1 1
3: a c 1 3
4: a d 2 12
5: b a 1 6
6: b b 2 2
7: b c 1 3
8: b d 1 5
Here's how I'd do it (without mult)
DT[order(-L), .SD[1], .(A,B)]
With mult something like this would do it - note that Im doing an actual join here
DT[order(L)][unique(DT[, .(A, B)]), on = c('A', 'B'), mult = 'last']
#> A B L val
#> 1: a a 1 1
#> 2: a b 1 1
#> 3: a c 1 3
#> 4: a d 2 12
#> 5: b a 1 6
#> 6: b b 2 2
#> 7: b c 1 3
#> 8: b d 1 5

Remove NA in front of one specific string but leave in front of another specific string, by group

I have this data frame:
df <- data.frame(
id = rep(1:4, each = 4),
status = c(
NA, "a", "c", "a",
NA, "b", "c", "c",
NA, NA, "a", "c",
NA, NA, "b", "b"),
stringsAsFactors = FALSE)
For each group (id), I aim to remove the rows with one or multiple leading NA in front of an "a" (in the column "status") but not in front of a "b".
The final data frame should look like this:
structure(list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L),
status = c("a", "c", "a", NA, "b", "c", "c", "a", "c", NA, NA, "b", "b")),
.Names = c("id", "status"), row.names = c(NA, -13L), class = "data.frame")
How do I do that?
Edit: alternatively, how would I do it to preserve other variables in the data frame such as the variable otherVar in the following example:
df2 <- data.frame(
id = rep(1:4, each = 4),
status = c(
NA, "a", "c", "a",
NA, "b", "c", "c",
NA, NA, "a", "c",
NA, NA, "b", "b"),
otherVar = letters[1:16],
stringsAsFactors = FALSE)
We can group by 'id', summarise the 'status' by pasteing the elements together, then use gsub to remove the NA before the 'a' and convert it to 'long' format with separate_rows
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
summarise(status = gsub("(NA, ){1,}(?=a)", "", toString(status),
perl = TRUE)) %>%
separate_rows(status, convert = TRUE)
# A tibble: 13 x 2
# id status
# <int> <chr>
# 1 1 a
# 2 1 c
# 3 1 a
# 4 2 NA
# 5 2 b
# 6 2 c
# 7 2 c
# 8 3 a
# 9 3 c
#10 4 NA
#11 4 NA
#12 4 b
#13 4 b
Or using data.table with the same methodology
library(data.table)
out1 <- setDT(df)[, strsplit(gsub("(NA, ){1,}(?=a)", "",
toString(status), perl = TRUE), ", "), id]
setnames(out1, 'V1', "status")[]
# id status
# 1: 1 a
# 2: 1 c
# 3: 1 a
# 4: 2 NA
# 5: 2 b
# 6: 2 c
# 7: 2 c
# 8: 3 a
# 9: 3 c
#10: 4 NA
#11: 4 NA
#12: 4 b
#13: 4 b
Update
For the updated dataset 'df2'
i1 <- setDT(df2)[, .I[seq(which(c(diff((status %in% "a") +
rleid(is.na(status))) > 1), FALSE))] , id]$V1
df2[-i1]
# id status otherVar
# 1: 1 a b
# 2: 1 c c
# 3: 1 a d
# 4: 2 NA e
# 5: 2 b f
# 6: 2 c g
# 7: 2 c h
# 8: 3 a k
# 9: 3 c l
#10: 4 NA m
#11: 4 NA n
#12: 4 b o
#13: 4 b p
From zoo with na.locf and is.na, notice it assuming you data is ordered.
df[!(na.locf(df$status,fromLast = T)=='a'&is.na(df$status)),]
id status
2 1 a
3 1 c
4 1 a
5 2 <NA>
6 2 b
7 2 c
8 2 c
11 3 a
12 3 c
13 4 <NA>
14 4 <NA>
15 4 b
16 4 b
Here's a dplyr solution and a not as pretty base translation :
dplyr
library(dplyr)
df %>% group_by(id) %>%
filter(status[!is.na(status)][1]!="a" | !is.na(status))
# # A tibble: 13 x 2
# # Groups: id [4]
# id status
# <int> <chr>
# 1 1 a
# 2 1 c
# 3 1 a
# 4 2 <NA>
# 5 2 b
# 6 2 c
# 7 2 c
# 8 3 a
# 9 3 c
# 10 4 <NA>
# 11 4 <NA>
# 12 4 b
# 13 4 b
base
do.call(rbind,
lapply(split(df,df$id),
function(x) x[x$status[!is.na(x$status)][1]!="a" | !is.na(x$status),]))
# id status
# 1.2 1 a
# 1.3 1 c
# 1.4 1 a
# 2.5 2 <NA>
# 2.6 2 b
# 2.7 2 c
# 2.8 2 c
# 3.11 3 a
# 3.12 3 c
# 4.13 4 <NA>
# 4.14 4 <NA>
# 4.15 4 b
# 4.16 4 b
note
Will fail if not all NAs are leading because will remove all NAs from groups starting with "a" as a first non NA value.

Counting successor combinations in a data.frame

I got a data.frame that looks like the following one:
OBJECT ID TASK
1 A
1 C
1 D
1 E
2 A
2 B
2 C
2 D
2 F
Now I would like to count the unique successive combinations within the data.frame in order to get following result:
PREDECESSOR SUCCESSOR COUNT
A C 1
C D 2
D E 1
A B 1
B C 1
D F 1
I've already figured out to extract the successive values with the help of two for loops, but I'm failing the assignment and counting task within a new data.frame (or list).
aggregate(COUNT~.,
data.frame(PREDECESSOR = head(df1$TASK, -1),
SUCCESSOR = tail(df1$TASK, -1),
COUNT = 1),
length)
# PREDECESSOR SUCCESSOR COUNT
#1 E A 1
#2 A B 1
#3 A C 1
#4 B C 1
#5 C D 2
#6 D E 1
#7 D F 1
You could use a similar approach even if you want to first split by OBJECT.ID
temp = do.call(rbind, lapply(split(df1, df1$OBJECT.ID), function(X){
aggregate(COUNT~., data.frame(PREDECESSOR = head(X$TASK, -1),
SUCCESSOR = tail(X$TASK, -1),
COUNT = 1),
length)
}))
aggregate(COUNT~., temp, length)
# PREDECESSOR SUCCESSOR COUNT
#1 A C 1
#2 B C 1
#3 C D 2
#4 D E 1
#5 A B 1
#6 D F 1
DATA
df1 = structure(list(OBJECT.ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), TASK = c("A", "C", "D", "E", "A", "B", "C", "D", "F")), .Names = c("OBJECT.ID",
"TASK"), class = "data.frame", row.names = c(NA, -9L))
Solution using data.table:
Code:
library(data.table)
setDT(df)
df[, TASK0 := shift(TASK), OBJECT]
df[!is.na(TASK0), .N, .(TASK, TASK0)][, .(
COUNT = sum(N)), .(PREDECESSOR = TASK0, SUCCESSOR = TASK)]
Result:
PREDECESSOR SUCCESSOR COUNT
1: A C 1
2: C D 2
3: D E 1
4: A B 1
5: B C 1
6: D F 1
Explanation:
setDT(df): turns data.frame into a data.table object
[, TASK0 := shift(TASK), OBJECT]: gets previous letter for each OBJECT
!is.na(TASK0): gets rid of first row for each OBJECT (they don't have PREDECESSOR)
.N, .(TASK, TASK0): counts occurences of TASK and TASK0 (previous letter combinations)
sum(N): sums counts
Data (df):
structure(list(OBJECT = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
TASK = c("A", "C", "D", "E", "A", "B", "C", "D", "F")), .Names = c("OBJECT",
"TASK"), row.names = c(NA, -9L), class = c("data.table", "data.frame"
))
Just to get the counts, you do it with the following two lines:
cc <- cbind(df$TASK,c(df$TASK[-1],"LAST"))
table(paste(cc[,1],cc[2],sep="-"))
The result is
A-B A-C B-C C-D D-E D-F E-A F-LAST
1 1 1 2 1 1 1 1

How to append group row into dataframe

I have this df1:
A B C
1 2 3
5 7 9
where A B C are columns names.
I have another df2 with one column:
A
1
2
3
4
I would like to append df2 for each column of df1, creating this final dataframe:
A B C
1 2 3
5 7 9
1 1 1
2 2 2
3 3 3
4 4 4
is it possible to do it?
data.frame(sapply(df1, c, unlist(df2)), row.names = NULL)
# A B C
#1 1 2 3
#2 5 7 9
#3 1 1 1
#4 2 2 2
#5 3 3 3
#6 4 4 4
DATA
df1 = structure(list(A = c(1L, 5L), B = c(2L, 7L), C = c(3L, 9L)), .Names = c("A",
"B", "C"), class = "data.frame", row.names = c(NA, -2L))
df2 = structure(list(A = 1:4), .Names = "A", class = "data.frame", row.names = c(NA,
-4L))
We can replicate df2 for the number of columns of df1, unname it, then rbind it.
rbind(df1, unname(rep(df2, ncol(df1))))
# A B C
# 1 1 2 3
# 2 5 7 9
# 3 1 1 1
# 4 2 2 2
# 5 3 3 3
# 6 4 4 4
Data:
df1 <- structure(list(A = c(1L, 5L), B = c(2L, 7L), C = c(3L, 9L)), .Names = c("A",
"B", "C"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(A = 1:4), .Names = "A", row.names = c(NA, -4L), class = "data.frame")
We can use base R methods
rbind(df1, setNames(as.data.frame(do.call(cbind, rep(list(df2$A), 3))), names(df1)))
# A B C
#1 1 2 3
#2 5 7 9
#3 1 1 1
#4 2 2 2
#5 3 3 3
#6 4 4 4
data
df1 <- structure(list(A = c(1L, 5L), B = c(2L, 7L), C = c(3L, 9L)), .Names = c("A",
"B", "C"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(A = 1:4), .Names = "A", class = "data.frame",
row.names = c(NA, -4L))
Here is a base R method with rbind, rep, and setNames:
rbind(dat, setNames(data.frame(rep(dat1, ncol(dat))), names(dat)))
A B C
1 1 2 3
2 5 7 9
3 1 1 1
4 2 2 2
5 3 3 3
6 4 4 4
Edit: turns outdata.frame isn't necessary:
rbind(dat, setNames(rep(dat1, ncol(dat)), names(dat)))
will work.
data
dat <-
structure(list(A = c(1L, 5L), B = c(2L, 7L), C = c(3L, 9L)), .Names = c("A",
"B", "C"), class = "data.frame", row.names = c(NA, -2L))
dat1 <-
structure(list(A = 1:4), .Names = "A", row.names = c(NA, -4L),
class = "data.frame")
I just love R, here is yet another Base R solution but with mapply:
data.frame(mapply(c, df1, df2))
Result:
A B C
1 1 2 3
2 5 7 9
3 1 1 1
4 2 2 2
5 3 3 3
6 4 4 4
Note:
No need to deal with colnames like almost all the other solutions... The key to why this works is that "mapply calls FUN for the values of ... [each element]
(re-cycled to the length of the longest...[element]" (See ?mapply). In other words, df2$A is recycled to however many columns df1 has.
Data:
df1 = structure(list(A = c(1L, 5L), B = c(2L, 7L), C = c(3L, 9L)), .Names = c("A",
"B", "C"), class = "data.frame", row.names = c(NA, -2L))
df2 = structure(list(A = 1:4), .Names = "A", row.names = c(NA, -4L), class = "data.frame")
Data:
df1 <- data.frame(A=c(1,5),
B=c(2,7),
C=c(3,9))
df2 <- data.frame(A=c(1,2,3,4))
Solution:
df2 <- matrix(rep(df2$A, ncol(df1)), ncol=ncol(df1))
colnames(df2) <- colnames(df1)
rbind(df1,df2)
Result:
A B C
1 1 2 3
2 5 7 9
3 1 1 1
4 2 2 2
5 3 3 3
6 4 4 4
A solution from purrr, which uses map_dfc to loop through all columns in df1 to combine all the elements with df2$A.
library(purrr)
map_dfc(df1, ~c(., df2$A))
# A tibble: 6 x 3
A B C
<int> <int> <int>
1 1 2 3
2 5 7 9
3 1 1 1
4 2 2 2
5 3 3 3
6 4 4 4
Data
df1 <- structure(list(A = c(1L, 5L), B = c(2L, 7L), C = c(3L, 9L)), .Names = c("A",
"B", "C"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(A = 1:4), .Names = "A", class = "data.frame",
row.names = c(NA, -4L))
By analogy with #useR's excellent Base R answer, here's a tidyverse solution:
library(purrr)
map2_df(df1, df2, c)
A B C
1 1 2 3
2 5 7 9
3 1 1 1
4 2 2 2
5 3 3 3
6 4 4 4
Here are a few other (less desirable) options from when I first answered this question.
library(dplyr)
bind_rows(df1, df2 %>% mutate(B=A, C=A))
Or, if we want to dynamically get the number of columns and their names from df1:
bind_rows(df1,
df2[,rep(1,ncol(df1))] %>% setNames(names(df1)))
And one more Base R method:
rbind(df1, setNames(df2[,rep(1,ncol(df1))], names(df1)))
For the sake of completeness, here is data.table approach which doesn't require to handle column names:
library(data.table)
setDT(df1)[, lapply(.SD, c, df2$A)]
A B C
1: 1 2 3
2: 5 7 9
3: 1 1 1
4: 2 2 2
5: 3 3 3
6: 4 4 4
Note that the OP has described df2 to consist only of one column.
There is also a base R version of this approach:
data.frame(lapply(df1, c, df2$A))
A B C
1 1 2 3
2 5 7 9
3 1 1 1
4 2 2 2
5 3 3 3
6 4 4 4
This is similar to d.b's approach but doesn't required to deal with column names.

Preserving missing fields when row-wise binding elements of a list in R

I have a named list, and I want to bind its elements. I am a big fan of data.table::rbindlist() but it removes NA entries. Is there anyway I can preserve NA entries?
Here's my code:
dput(Result)
structure(list(a = c(1L, 3L), b = c(2L, 4L), c = 4L, d = integer(0),
e = integer(0), f = integer(0)), .Names = c("a", "b", "c",
"d", "e", "f"))
Here's what I tried for data.table
Attempt1 : Using data.table
Result1<-data.table::rbindlist(lapply(Result, as.data.frame),use.names=TRUE, fill=TRUE, idcol="Name")
However, I lost d and e.
Attempt2 : Using dplyr
dplyr::bind_rows(lapply(Result, as.data.frame))
Again, I lost d and e.
Expected Output:
Result1
Name X[[i]]
1: a 1
2: a 3
3: b 2
4: b 4
5: c 4
6: d NA
7: e NA
8: f NA
I'd appreciate any help.
Here you go:
Result = structure(list(a = c(1L, 3L), b = c(2L, 4L), c = 4L, d = integer(0),
e = integer(0), f = integer(0)), .Names = c("a", "b", "c",
"d", "e", "f"))
Result2 = lapply(Result, function(x){
if(length(x)==0){NA}else{x}
})
Result3 = data.table::rbindlist(lapply(Result2,
as.data.frame),use.names=TRUE, fill=TRUE, idcol="Name")
The problem is that integer(0) is not NA, so you must convert them to NA as shown for Result2.
Result:
> Result3
Name X[[i]]
1: a 1
2: a 3
3: b 2
4: b 4
5: c 4
6: d NA
7: e NA
8: f NA
Replace the zero length elements with NA, then use rbindlist.
Result[!lengths(Result)] <- NA
## or
## is.na(Result) <- !lengths(Result)
rbindlist(lapply(Result, as.data.table), id = "Name")
# Name V1
# 1: a 1
# 2: a 3
# 3: b 2
# 4: b 4
# 5: c 4
# 6: d NA
# 7: e NA
# 8: f NA
You could also do this in base R with
is.na(Result) <- !lengths(Result)
data.frame(
Name = rep(names(Result), lengths(Result)),
V1 = unlist(Result, use.names = FALSE)
)
# Name V1
# 1 a 1
# 2 a 3
# 3 b 2
# 4 b 4
# 5 c 4
# 6 d NA
# 7 e NA
# 8 f NA

Resources