Select multiple ranges of columns using column names in data.table - r

Let say I have a data table,
dt = data.table(matrix(1:50, nrow = 5));
colnames(dt) = letters[1:10];
> dt
a b c d e f g h i j
1: 1 6 11 16 21 26 31 36 41 46
2: 2 7 12 17 22 27 32 37 42 47
3: 3 8 13 18 23 28 33 38 43 48
4: 4 9 14 19 24 29 34 39 44 49
5: 5 10 15 20 25 30 35 40 45 50
I want to select several discontinuous ranges of columns like: a, c:d, f:h and j. This can be done easily via dplyr's select():
dt %>% select(a, c:d, f:h, j)
I am looking for a data.table way of achieving the same.
Right now, I can either select columns individually in any order: dt[ , .(a, c)] or giving just one sequence of column names on the form startcol:endcol:
dt[ , c:f]
However, I can't combine the above two methods to select several column ranges in one shot in .SDcols, like I did in dplyr::select

We can use the range part in .SDcols and then append the other column by concatenating
dt[, c(list(a= a), .SD) , .SDcols = c:d]
If there are multiple ranges, we create a sequence of ranges by match, and then get the corresponding column names
i1 <- match(c("c", "f"), names(dt))
j1 <- match(c("d", "h"), names(dt))
nm1 <- c("a", names(dt)[unlist(Map(`:`, i1, j1))], "j")
dt[, ..nm1]
# a c d f g h j
#1: 1 11 16 26 31 36 46
#2: 2 12 17 27 32 37 47
#3: 3 13 18 28 33 38 48
#4: 4 14 19 29 34 39 49
#5: 5 15 20 30 35 40 50
Also, the dplyr methods can be used within the data.table
dt[, select(.SD, a, c:d, f:h, j)]
# a c d f g h j
#1: 1 11 16 26 31 36 46
#2: 2 12 17 27 32 37 47
#3: 3 13 18 28 33 38 48
#4: 4 14 19 29 34 39 49
#5: 5 15 20 30 35 40 50

Here is a workaround with cbind and two or more selections.
cbind(dt[, .(a)], dt[, c:d])
# a c d
# 1: 1 11 16
# 2: 2 12 17
# 3: 3 13 18
# 4: 4 14 19
# 5: 5 15 20

Related

Subset data frame where values are greater than another data frame

Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]

How to properly combined columns into one column using R

I have 3 sets of data. Each one is a column of variables:
A B C
81 35 31
62 34 33
46 36 31
45 31 33
81 35 31
62 34 33
46 36 31
45 31 33
81 35 31
62 34 33
46 36 31
45 31 33
I have been trying to use rbind to combine these three data sets into one dataset with one column.
Combine<-rbind(A,B,C)
Instead I get something this, where not only do I end up with a series of shorter columns, the numbers all change. How do I stop this from happening?
V1 V2 V3 V4
14 9 9 5
19 15 14 5
# example data frames
dt1 = data.frame(A = 1:5)
dt2 = data.frame(B = 3:10)
dt3 = data.frame(C = 5:7)
# change to a common column name
names(dt1) = "x"
names(dt2) = "x"
names(dt3) = "x"
# bind rows
rbind(dt1, dt2, dt3)
# x
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 3
# 7 4
# 8 5
# 9 6
# 10 7
# 11 8
# 12 9
# 13 10
# 14 5
# 15 6
# 16 7

Selecting rows based on index positions, when the index positions are passed through variables in data.table R

I can select one column by index position in data.table by passing the index position through a variable like this:
DT <- data.table(a = 1:6, b=10:15, c=20:25, d=30:35, e = 40:45)
i <- 1
j <- 5
DT[, ..i]
But how can I select columns i : i+2 and j in one line of code using data.table syntax?
Your advice will be appreciated.
If you don't want to use lukeA's approach using the with = FALSE parameter you have other choices as well:
DT[, .SD, .SDcols = c(i:(i+2), j)]
# a b c e
#1: 1 10 20 40
#2: 2 11 21 41
#3: 3 12 22 42
#4: 4 13 23 43
#5: 5 14 24 44
#6: 6 15 25 45
Note the parantheses around (i+2) because the colon operator takes precendence.
This one is a modification of OP's code and not exactly a one-liner as requested:
icol <- c(i:(i+2), j); DT[, ..icol]
a b c e
1: 1 10 20 40
2: 2 11 21 41
3: 3 12 22 42
4: 4 13 23 43
5: 5 14 24 44
6: 6 15 25 45

How to join tables and generate sums of columns?

I have several tables( two in particular example) with the same structure. I would like to join on ID_Position & ID_Name and generate the sum of January and February in the output table (There might be some NAs in both columns)
ID_Position<-c(1,2,3,4,5,6,7,8,9,10)
Position<-c("A","B","C","D","E","H","I","J","X","W")
ID_Name<-c(11,12,13,14,15,16,17,18,19,20)
Name<-c("Michael","Tobi","Chris","Hans","Likas","Martin","Seba","Li","Sha","Susi")
jan<-c(10,20,30,22,23,2,22,24,26,28)
feb<-c(10,30,20,12,NA,3,NA,22,24,26)
df1 <- data.frame(ID_Position,Position,ID_Name,Name,jan,feb)
ID_Position<-c(1,2,3,4,5,6,7,8,9,10)
Position<-c("A","B","C","D","E","H","I","J","X","W")
ID_Name<-c(11,12,13,14,15,16,17,18,19,20)
Name<-c("Michael","Tobi","Chris","Hans","Likas","Martin","Seba","Li","Sha","Susi")
jan<-c(10,20,30,22,NA,NA,22,24,26,28)
feb<-c(10,30,20,12,23,3,3,22,24,26)
df2 <- data.frame(ID_Position,Position,ID_Name,Name,jan,feb)
I tried the inner and the full join. But that seems to work as I desire:
library(plyr)
test<-join(df1, df2, by =c("ID_Position","ID_Name") , type = "inner", match = "all")
Desired output:
ID_Position Position ID_Name Name jan feb
1 A 11 Michael 20 20
2 B 12 Tobi 40 60
3 C 13 Chris 60 40
4 D 14 Hans 44 24
5 E 15 Likas 23 23
6 H 16 Martin 2 6
7 I 17 Seba 44 22
8 J 18 Li 48 44
9 X 19 Sha 52 48
10 W 20 Susi 56 52
Your desired output doesn't seem entirely correct, but here's an example of how you can do this efficiently using data.table binary join which allows you to efficiently run functions while joining using the by = .EACHI option
library(data.table)
setkey(setDT(df1), ID_Position, ID_Name, Name)
setkey(setDT(df2), ID_Position, ID_Name, Name)
df2[df1, .(jan = sum(jan, i.jan, na.rm = TRUE),
feb = sum(feb, i.feb, na.rm = TRUE)),
by = .EACHI]
# ID_Position ID_Name Name jan feb
# 1: 1 11 Michael 20 20
# 2: 2 12 Tobi 40 60
# 3: 3 13 Chris 60 40
# 4: 4 14 Hans 44 24
# 5: 5 15 Likas 46 0
# 6: 6 16 Martin 0 6
# 7: 7 17 Seba 44 0
# 8: 8 18 Li 48 44
# 9: 9 19 Sha 52 48
# 10: 10 20 Susi 56 52

Enumerate instances of a factor level

I have a data frame with 150000 lines in long format with multiple occurences of the same id variable. I'm using reshape (from stat, rather than package=reshape(2)) to convert this to wide format. I am generating a variable to count each occurence of a given level of id to use as an index.
I've got this working with a small dataframe using plyr, but it is far too slow for my full df. Can I programme this more efficiently?
I've struggled doing this with the reshape package as I have around 30 other variables. It may be best to reshape only what I'm looking at (rather than the whole df) for each individual analysis.
> # u=id variable with three value variables
> u<-c(rep("a",4), rep("b", 3),rep("c", 6), rep("d", 5))
> u<-factor(u)
> v<-1:18
> w<-20:37
> x<-40:57
> df<-data.frame(u,v,w,x)
> df
u v w x
1 a 1 20 40
2 a 2 21 41
3 a 3 22 42
4 a 4 23 43
5 b 5 24 44
6 b 6 25 45
7 b 7 26 46
8 c 8 27 47
9 c 9 28 48
10 c 10 29 49
11 c 11 30 50
12 c 12 31 51
13 c 13 32 52
14 d 14 33 53
15 d 15 34 54
16 d 16 35 55
17 d 17 36 56
18 d 18 37 57
>
> library(plyr)
> df2<-ddply(df, .(u), transform, count=rank(u, ties.method="first"))
> df2
u v w x count
1 a 1 20 40 1
2 a 2 21 41 2
3 a 3 22 42 3
4 a 4 23 43 4
5 b 5 24 44 1
6 b 6 25 45 2
7 b 7 26 46 3
8 c 8 27 47 1
9 c 9 28 48 2
10 c 10 29 49 3
11 c 11 30 50 4
12 c 12 31 51 5
13 c 13 32 52 6
14 d 14 33 53 1
15 d 15 34 54 2
16 d 16 35 55 3
17 d 17 36 56 4
18 d 18 37 57 5
> reshape(df2, idvar="u", timevar="count", direction="wide")
u v.1 w.1 x.1 v.2 w.2 x.2 v.3 w.3 x.3 v.4 w.4 x.4 v.5 w.5 x.5 v.6 w.6 x.6
1 a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
5 b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
8 c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
14 d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA
I still can't quite figure out why you would want to ultimately convert your dataset from wide to long, because to me, that seems like it would be an extremely unwieldy dataset to work with.
If you're looking to speed up the enumeration of your factor levels, you can consider using ave() in base R, or .N from the "data.table" package. Considering that you are working with a lot of rows, you might want to consider the latter.
First, let's make up some data:
set.seed(1)
df <- data.frame(u = sample(letters[1:6], 150000, replace = TRUE),
v = runif(150000, 0, 10),
w = runif(150000, 0, 100),
x = runif(150000, 0, 1000))
list(head(df), tail(df))
# [[1]]
# u v w x
# 1 b 6.368412 10.52822 223.6556
# 2 c 6.579344 75.28534 450.7643
# 3 d 6.573822 36.87630 283.3083
# 4 f 9.711164 66.99525 681.0157
# 5 b 5.337487 54.30291 137.0383
# 6 f 9.587560 44.81581 831.4087
#
# [[2]]
# u v w x
# 149995 b 4.614894 52.77121 509.0054
# 149996 f 5.104273 87.43799 391.6819
# 149997 f 2.425936 60.06982 160.2324
# 149998 a 1.592130 66.76113 118.4327
# 149999 b 5.157081 36.90400 511.6446
# 150000 a 3.565323 92.33530 252.4982
table(df$u)
#
# a b c d e f
# 25332 24691 24993 24975 25114 24895
Load our required packages:
library(plyr)
library(data.table)
Create a "data.table" version of our dataset
DT <- data.table(df, key = "u")
DT # Notice that the data are now automatically sorted
# u v w x
# 1: a 6.2378578 96.098294 643.2433
# 2: a 5.0322400 46.806132 544.6883
# 3: a 9.6289786 87.915303 334.6726
# 4: a 4.3393403 1.994383 753.0628
# 5: a 6.2300123 72.810359 579.7548
# ---
# 149996: f 0.6268414 15.608049 669.3838
# 149997: f 2.3588955 40.380824 658.8667
# 149998: f 1.6383619 77.210309 250.7117
# 149999: f 5.1042725 87.437989 391.6819
# 150000: f 2.4259363 60.069820 160.2324
DT[, .N, by = key(DT)] # Like "table"
# u N
# 1: a 25332
# 2: b 24691
# 3: c 24993
# 4: d 24975
# 5: e 25114
# 6: f 24895
Now let's run a few basic tests. The results from ave() aren't sorted, but they are in "data.table" and "plyr", so we should also test the timing for sorting when using ave().
system.time(AVE <- within(df, {
count <- ave(as.numeric(u), u, FUN = seq_along)
}))
# user system elapsed
# 0.024 0.000 0.027
# Now time the sorting
system.time(AVE2 <- AVE[order(AVE$u, AVE$count), ])
# user system elapsed
# 0.264 0.000 0.262
system.time(DDPLY <- ddply(df, .(u), transform,
count=rank(u, ties.method="first")))
# user system elapsed
# 0.944 0.000 0.984
system.time(DT[, count := 1:.N, by = key(DT)])
# user system elapsed
# 0.008 0.000 0.004
all(DDPLY == AVE2)
# [1] TRUE
all(data.frame(DT) == AVE2)
# [1] TRUE
That syntax for "data.table" sure is compact, and it's speed is blazing!
Using base R to create an empty matrix and then fill it in appropriately can often be significantly faster. In the code below I suspect the slow part would be converting the data frame to a matrix and transposing, as in the first two lines; if so, that could perhaps be avoided if it could be stored differently to start with.
g <- df$a
x <- t(as.matrix(df[,-1]))
k <- split(seq_along(g), g)
n <- max(sapply(k, length))
out <- matrix(ncol=n*nrow(x), nrow=length(k))
for(idx in seq_along(k)) {
out[idx, seq_len(length(k[[idx]])*nrow(x))] <- x[,k[[idx]]]
}
rownames(out) <- names(k)
colnames(out) <- paste(rep(rownames(x), n), rep(seq_len(n), each=nrow(x)), sep=".")
out
# b.1 c.1 d.1 b.2 c.2 d.2 b.3 c.3 d.3 b.4 c.4 d.4 b.5 c.5 d.5 b.6 c.6 d.6
# a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
# b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
# c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
# d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA

Resources