Create multiple variables in data.table based other variables names [duplicate] - r

This question already has answers here:
Multiple pairwise differences based on column name patterns
(3 answers)
Multiply several sets of columns in the same data.table
(2 answers)
Closed 2 years ago.
I am trying to create a series of variables, c1, c2, and c3, based on the values of two sets of variables, a1, a2, and a3, and b1, b2, and b3. The code below shows a hard-coded solution, but in reality I don't know the total number of set of variables, say an and bn. As you can see the name of the c variables depend on the names of the a and b variables.
Is there a way in data.table to do this? I tried to do it by using purrr::map2 within data.table but I could not make it work. I would highly appreciate your help.
Thanks.
library(data.table)
DT <- data.table(
a1 = c(1, 2, 3),
a2 = c(1, 2, 3)*2,
a3 = c(1, 2, 3)*3,
b1 = c(5, 6, 7),
b2 = c(5, 6, 7)*4,
b3 = c(5, 6, 7)*5
)
DT[]
#> a1 a2 a3 b1 b2 b3
#> 1: 1 2 3 5 20 25
#> 2: 2 4 6 6 24 30
#> 3: 3 6 9 7 28 35
DT[,
`:=`(
c1 = a1 + b1,
c2 = a2 + b2,
c3 = a3 + b3
)
]
DT[]
#> a1 a2 a3 b1 b2 b3 c1 c2 c3
#> 1: 1 2 3 5 20 25 6 22 28
#> 2: 2 4 6 6 24 30 8 28 36
#> 3: 3 6 9 7 28 35 10 34 44
Created on 2020-08-26 by the reprex package (v0.3.0)

This first part is mostly defensive, guarding against: a* variables without matching b* variables; vice versa; and different order of each:
anames <- grep("^a[0-9]+$", colnames(DT), value = TRUE)
bnames <- grep("^b[0-9]+$", colnames(DT), value = TRUE)
numnames <- gsub("^a", "", anames)
anames <- sort(anames[gsub("^a", "", anames) %in% numnames])
bnames <- sort(bnames[gsub("^b", "", bnames) %in% numnames])
cnames <- gsub("^b", "c", bnames)
If you know the number ranges a priori and want something less-dynamic but more straight-forward, then
anames <- paste0("a", 1:3)
bnames <- paste0("b", 1:3)
cnames <- paste0("c", 1:3)
Now the magic:
DT[, (cnames) := Map(`+`, mget(anames), mget(bnames)) ]
DT
# a1 a2 a3 b1 b2 b3 c1 c2 c3
# 1: 1 2 3 5 20 25 6 22 28
# 2: 2 4 6 6 24 30 8 28 36
# 3: 3 6 9 7 28 35 10 34 44

You could tackle this issue if you split DT column-wise by the pattern of the names first, and then aggregate it
# removes numbers from col names
(ptn <- sub("\\d", "", names(DT)))
# [1] "a" "a" "a" "b" "b" "b"
# get unique numbers contained in the col names (as strings but it doesn't matter here)
(nmb <- unique(sub("\\D", "", names(DT))))
# [1] "1" "2" "3"
Next step is to split DT and finally do the aggregation
DT[, paste0("c", nmb) := do.call(`+`, split.default(DT, f = ptn))]
Result
DT
# a1 a2 a3 b1 b2 b3 c1 c2 c3
#1: 1 2 3 5 20 25 6 22 28
#2: 2 4 6 6 24 30 8 28 36
#3: 3 6 9 7 28 35 10 34 44

We can melt to long format, create the column 'c', dcast into 'wide' format and then cbind
library(data.table)
cbind(DT, dcast(melt(DT, measure = patterns('^a', '^b'))[,
c := value1 + value2], rowid(variable) ~ paste0('c', variable),
value.var = 'c')[, variable := NULL])
# a1 a2 a3 b1 b2 b3 c1 c2 c3
#1: 1 2 3 5 20 25 6 22 28
#2: 2 4 6 6 24 30 8 28 36
#3: 3 6 9 7 28 35 10 34 44

A base R option
u<-split.default(DT,gsub("\\D","",names(DT)))
cbind(DT,do.call(cbind,Map(rowSums,setNames(u,paste0("c",names(u))))))
which gives
a1 a2 a3 b1 b2 b3 c1 c2 c3
1: 1 2 3 5 20 25 6 22 28
2: 2 4 6 6 24 30 8 28 36
3: 3 6 9 7 28 35 10 34 44

Related

Downsample to equalize the counts for pairs of factor levels?

Suppose you have a factor variable whose level labels come in pairs
(such as 'a1' and 'a2', 'b1' and 'b2', etc.), and these pairs have unequal n-sizes.
x <- factor(c(rep("a1", 10), rep("a2", 15),rep("b1", 5), rep("b2", 30),rep("c1", 33), rep("c2", 22)))
> table(x)
a1 a2 b1 b2 c1 c2
10 15 5 30 33 22
But you wanted to randomly downsample the larger-sized level of each pair to
equalize their n-sizes. Here's the desired outcome:
a1 a2 b1 b2 c1 c2
10 10 5 5 22 22
I have found that caret::downSample() can downsample to equalize all the levels of
a factor:
x_ds <- caret::downSample(1:115, x)
table(x_ds$Class)
a1 a2 b1 b2 c1 c2
5 5 5 5 5 5
And I have the notion to use split() in conjunction with downSample(), but I'm having trouble figuring out a way to split on the level pairs. How could this be done?

How to read identify the indexes of pairs of numbers from a data.frame?

I have a large data.frame:
t1 t2 t3 t4 t5 t6 t7 t8
7 15 30 37 4 11 30 37
4 31 44 30 37 39 44 18
3 49 39 34 44 43 26 24
4 31 26 33 12 47 37 15
3 27 34 23 30 30 37 4
9 46 39 34 8 43 26 24
For each row, I would like to identify specific (eg. read into by user) sequences of numbers in column t1 to t8 .
A sequence consists of numbers that follow each other in a chronological order (time is defined by t1...t8)
Example of sequences:
30, 37 happening at [t3, t4] as well [t7, t8]
As you see from the above example I want the index of the start and end columns (eg time t1...t8) and the number of times this occurs.
Desire input:
Please specify your sequence: 30 37
Desired output:
'The timing of 30 37 is:
[t3] to [t4]
[t7] to [t8]
[t4] to [t5]
My question is how to write a function that identify the indexes of a specific sequences. Any help is welcomed, please
Below the code that I want to improve:
apply(m, 1, function(x) {
u <- unique(x)
u <- u[sapply(u, function(u) any(diff(which(x == u)) > 1))]
lapply(setNames(u, u), function(u){
ind <- which(x == u)
lapply(seq(length(ind) - 1),
function(i) x[seq(ind[i] + 1, ind[i + 1] - 1)])
})
})
An alternative solution with plyr package and without do.call:
library(plyr)
obs = read.table(text=
"t1 t2 t3 t4 t5 t6 t7 t8
7 15 30 37 4 11 30 37
4 31 44 30 37 39 44 18
3 49 39 34 44 43 26 24
4 31 26 33 12 47 37 15
3 27 34 23 30 30 37 4
9 46 39 34 8 43 26 24",
header=TRUE)
# Find target in one row
f = function(v, target) {
n = length(v)
m = length(target)
res = {}
for (i in 1:(n-m+1)) {
if (all(target==v[i:(i+m-1)])) res = c(res,i)
}
data.frame(From=res, To=res+m-1)
}
# Find target in all rows
find_matches = function(df, target) {
df$Row = 1:nrow(df)
M = adply(df, 1, f, target=target)
M[, (ncol(M)-2):ncol(M)]
}
# Test
find_matches(obs, c(30,37))
# Row From To
#1 1 3 4
#2 1 7 8
#3 2 4 5
#4 5 6 7
Here is one function which can be helpful. For every row, we paste every element with it's next element and check if it matches with the numbers passed. The function returns a dataframe with row number and column names where a match is found.
return_match <- function(df, x, y) {
#Paste the numbers to match
concat_str <- paste(x, y, sep = "-")
#For every row in dataframe
do.call(rbind, lapply(seq_len(nrow(df)), function(i) {
#Subset the row
x <- df[i, ]
#Paste every value with it's next value and compare it with concat_str
inds = paste(x[-length(x)], x[-1L], sep = "-") == concat_str
if(any(inds)) {
#Get the column numbers to match
row <- which(inds)
#subset the column name and add row number
transform(as.data.frame(t(sapply(row, function(y)
names(df)[c(y, y + 1)]))), row = i)
}
}))
}
return_match(df, 30, 37)
# V1 V2 row
#1 t3 t4 1
#2 t7 t8 1
#3 t4 t5 2
#4 t6 t7 5
return_match(df, 39, 34)
# V1 V2 row
#1 t3 t4 3
#2 t3 t4 6

Use the levels of a dataframe column to add a new column with an incrementing number unique to each level

I'm trying to create a new column in a dataframe that contains an incrementing number based on the levels of a different column. That is, I want to rename the levels of a column so that each level has a unique, incrementing number.
df <- data.frame(y1 = c(100, 100, 100, 200, 200, 500, 500, 500),
y2 = c(6, 5, 4, 2, 5, 4, 3, 2))
df$y1 <- as.factor(df$y1)
levels(df$y1) ## [1] "100" "200" "500"
Expected output: a new y3 column with new level names based on the levels of y1. The "b" isn't necessary, I can add that on later.
y1 y2 y3
100 6 b1
100 5 b1
100 4 b1
200 2 b2
200 5 b2
500 4 b3
500 3 b3
500 2 b3
I've messed around with lapply and various for loops, but I don't really know what I'm doing here... stuff like this:
for (i in levels(df$y1)){
batchnum <- 1
if (i == df$y1){
df$y3 <- paste0("b", batchnum)
batchnum <- batchnum + 1
}
}
This just labels y3 with "b1" for each row, I guess because if is not vectorized or something?
## Warning messages:
1: In if (i == df$y1) { :
the condition has length > 1 and only the first element will be used
Using data.table:
library(data.table)
setDT(df)
df[, y3 := .GRP, by = y1]
df[, y3 := paste0("b", y3)] # you can change "b" with whatever you want
y1 y2 y3
1: 100 6 b1
2: 100 5 b1
3: 100 4 b1
4: 200 2 b2
5: 200 5 b2
6: 500 4 b3
7: 500 3 b3
8: 500 2 b3
The most direct and simple approach (taking advantage of the fact that as.numeric will generate numbers corresponding to the factor levels):
df$y3 <- paste0('b', as.numeric(df$y1))
If it's not clear why this works, look at the following code on its own:
as.numeric(df$y1)
A dplyr approach:
require(dplyr);
df %>% mutate(y3 = paste0("b", as.numeric(y1)));
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3
Or you also do:
df %>% mutate(y3 = paste0("b", cumsum(!duplicated(y1))));
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3
Here's one way:
x <- c(100,100,100,200,200,500,500,500)
paste0("b",rep(seq_along(table(x)),table(x)))
[1] "b1" "b1" "b1" "b2" "b2" "b3" "b3" "b3"
One can use group_indices function from dplyr to create new column as:
library(dplyr)
df %>% mutate(y3 = paste0("b", group_indices(.,y1)))
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3

Multiply columns in a data frame by a vector

What I want to do is multiply all the values in column 1 of a data.frame by the first element in a vector, then multiply all the values in column 2 by the 2nd element in the vector, etc...
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
c1 c2 c3
1 1 4 7
2 2 5 8
3 3 6 9
v1 <- c(1,2,3)
So the result is this:
c1 c2 c3
1 1 8 21
2 2 10 24
3 3 12 27
I can do this one column at a time but what if I have 100 columns? I want to be able to do this programmatically.
Or simply diagonalize the vector, so that each row entry is multiplied by the corresponding element in v1:
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- as.matrix(cbind(c1,c2,c3))
v1 <- c(1,2,3)
d1%*%diag(v1)
[,1] [,2] [,3]
[1,] 1 8 21
[2,] 2 10 24
[3,] 3 12 27
Transposing the dataframe works.
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
v1 <- c(1,2,3)
t(t(d1)*v1)
# c1 c2 c3
#[1,] 1 8 21
#[2,] 2 10 24
#[3,] 3 12 27
EDIT: If all columns are not numeric, you can do the following
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
# Adding a column of characters for demonstration
d1$c4 <- c("rr", "t", "s")
v1 <- c(1,2,3)
#Choosing only numeric columns
index <- which(sapply(d1, is.numeric) == TRUE)
d1_mat <- as.matrix(d1[,index])
d1[,index] <- t(t(d1_mat)*v1)
d1
# c1 c2 c3 c4
#1 1 8 21 rr
#2 2 10 24 t
#3 3 12 27 s
We can also replicate the vector to make the lengths equal and then multiply
d1*v1[col(d1)]
# c1 c2 c3
#1 1 8 21
#2 2 10 24
#3 3 12 27
Or use sweep
sweep(d1, 2, v1, FUN="*")
Or with mapply to multiply the corresponding columns of 'data.frame' and elements of 'vector'
mapply(`*`, d1, v1)

How to get a numbered list renumbering when a value changes

I have 2 lists of numbers (col1 & col2) below.
I'd like to add 2 columns (col3 & col4) that do the following.
col3 numbers col2 starting at 1 every time col2 changes (e.g. from b2 to b3).
col4 is TRUE on the last occurrence for each value in col2.
The data is sorted by col1, then col2 to begin.
Note. values in col2 can occur for different values of col1. (i.e. I can have b1 for every value of col 1 (a, b, c))
I can get this working fine for ~5000 rows (~6 sec), but scaling to ~1 million rows it hangs up.
Here is my code
df$col3 <- 0
df$col4 <- FALSE
stopHere <- nrow(df)
c1 <- 'xxx'
c2 <- 'xxx'
for (i in 1:stopHere) {
if (df[i, "col1"] != c1) {
c2 <- 0
c3 <- 1
c1 <- df[i, "col1"]
}
if (df[i, "col2"] != c2) {
df[i - 1, "col4"] <- TRUE
c3 <- 1
c2 <- df[i, "col2"]
}
df[i, "col3"] <- c3
c3 <- c3 + 1
}
This is my desired output.
1 a b1 1 FALSE
2 a b1 2 FALSE
3 a b1 3 TRUE
4 a b2 1 FALSE
5 a b2 2 TRUE
6 a b3 1 FALSE
7 a b3 2 FALSE
8 a b3 3 FALSE
9 a b3 4 FALSE
10 a b3 5 TRUE
11 b b1 1 FALSE
12 b b1 2 FALSE
13 b b1 3 FALSE
14 b b1 4 TRUE
15 b b2 1 FALSE
16 b b2 2 FALSE
17 b b2 3 FALSE
18 b b2 4 TRUE
19 c b1 1 TRUE
20 c b2 1 FALSE
21 c b2 2 FALSE
22 c b2 3 TRUE
23 c b3 1 FALSE
24 c b3 2 TRUE
25 c b4 1 FALSE
26 c b4 2 FALSE
27 c b4 3 FALSE
28 c b4 4 FALSE
Here is a vectorized solution that works for your sample data:
dat <- data.frame(
V1 = rep(letters[1:3], c(10, 8, 10)),
V2 = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
)
Create columns 3 and 4
zz <- rle(as.character(dat$V2))$lengths
dat$V3 <- sequence(zz)
dat$V4 <- FALSE
dat$V4[head(cumsum(zz), -1)] <- TRUE
The results:
dat
V1 V2 V3 V4
1 a b1 1 FALSE
2 a b1 2 FALSE
3 a b1 3 TRUE
4 a b2 1 FALSE
5 a b2 2 TRUE
6 a b3 1 FALSE
7 a b3 2 FALSE
8 a b3 3 FALSE
9 a b3 4 FALSE
10 a b3 5 TRUE
11 b b1 1 FALSE
12 b b1 2 FALSE
13 b b1 3 FALSE
14 b b1 4 TRUE
15 b b2 1 FALSE
16 b b2 2 FALSE
17 b b2 3 FALSE
18 b b2 4 TRUE
19 c b1 1 TRUE
20 c b2 1 FALSE
21 c b2 2 FALSE
22 c b2 3 TRUE
23 c b3 1 FALSE
24 c b3 2 TRUE
25 c b4 1 FALSE
26 c b4 2 FALSE
27 c b4 3 FALSE
28 c b4 4 FALSE
Some example data would be helpful. Nevertheless, this should be a good place to start. With 3 unique values in col1, and 4 in col2, it only takes a second for 10^6 rows:
n = 10^6
col1 = sample(c('a', 'b', 'c'), n, replace=T)
col2 = sample(paste('b', 1:4, sep=''), n, replace=T)
data = data.frame(col1, col2, col3=0, col4=FALSE)
data = data[do.call(order, data), ]
data$col3 = unlist(t(tapply(as.numeric(data$col2), data[,1:2], function(x) 1:length(x))))
data$col4[c(diff(data$col3), -1) < 0] = TRUE
First, make your starting data reproducible, and make col1 and col2 columns in a dataframe.
dat <- read.table(textConnection(
"a b1
a b1
a b1
a b2
a b2
a b3
a b3
a b3
a b3
a b3
b b1
b b1
b b1
b b1
b b2
b b2
b b2
b b2
c b1
c b2
c b2
c b2
c b3
c b3
c b4
c b4
c b4
c b4"), stringsAsFactors=FALSE)
names(dat) <- c("col1", "col2")
Run length encoding gives the lengths of your sequences, since everything is starting out sorted.
runs <- rle(dat$col2)
Now manipulate that info. For each element in the length component, create a sequence of that length and put them all together. The indicies of the TRUE values for col4 can be gotten from the cumsum of the lengths.
dat$col3 <- unlist(sapply(runs$lengths, function(l) seq(length.out=l)))
dat$col4 <- FALSE
dat$col4[cumsum(runs$lengths)] <- TRUE
For the result:
> dat
col1 col2 col3 col4
1 a b1 1 FALSE
2 a b1 2 FALSE
3 a b1 3 TRUE
4 a b2 1 FALSE
5 a b2 2 TRUE
6 a b3 1 FALSE
7 a b3 2 FALSE
8 a b3 3 FALSE
9 a b3 4 FALSE
10 a b3 5 TRUE
11 b b1 1 FALSE
12 b b1 2 FALSE
13 b b1 3 FALSE
14 b b1 4 TRUE
15 b b2 1 FALSE
16 b b2 2 FALSE
17 b b2 3 FALSE
18 b b2 4 TRUE
19 c b1 1 TRUE
20 c b2 1 FALSE
21 c b2 2 FALSE
22 c b2 3 TRUE
23 c b3 1 FALSE
24 c b3 2 TRUE
25 c b4 1 FALSE
26 c b4 2 FALSE
27 c b4 3 FALSE
28 c b4 4 TRUE
Note that the last line has col4 TRUE, which matches your written description (last of a set is TRUE), but does not match your example output. I don't know which you want.
This solution doesn't need any loops, nor rle or other clever functions; just mere merge and aggregate functions.
Preparing your data (used Andrie's code) first:
df <- data.frame(
x = rep(letters[1:3], c(10, 8, 10)),
y = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
)
The solution:
minmax <- with(df, merge(
aggregate(seq(x), by = list(x = x, y = y), min),
aggregate(seq(x), by = list(x = x, y = y), max)
))
names(minmax)[3:4] = c("min", "max") # unique pairs with min/max global order
result <- with(merge(df, minmax),
data.frame(x, y, count = seq(x) - min + 1, last = seq(x) == max))
This solution assumes that the input is sorted as you said, but can be easily modified to work on unsorted tables (and keep them unsorted).

Resources