How to reorder data.table columns (without copying) - r

I'd like to reorder columns in my data.table x, given a character vector of column names, neworder:
library(data.table)
x <- data.table(a = 1:3, b = 3:1, c = runif(3))
neworder <- c("c", "b", "a")
Obviously I could do:
x[ , neworder, with = FALSE]
# or
x[ , ..neworder]
# c b a
# 1: 0.8476623 3 1
# 2: 0.4787768 2 2
# 3: 0.3570803 1 3
but that would require copying the entire dataset again. Is there another way to do this?

Use setcolorder():
library(data.table)
x <- data.table(a = 1:3, b = 3:1, c = runif(3))
x
# a b c
# [1,] 1 3 0.2880365
# [2,] 2 2 0.7785115
# [3,] 3 1 0.3297416
setcolorder(x, c("c", "b", "a"))
x
# c b a
# [1,] 0.2880365 3 1
# [2,] 0.7785115 2 2
# [3,] 0.3297416 1 3
From ?setcolorder:
In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.
so should be pretty efficient. See ?setcolorder for details.

One may find it easier to use the above solution, but instead sort by column number. For example:
library(data.table)
> x <- data.table(a = 1:3, b = 3:1, c = runif(3))
> x
a b c
[1,] 1 3 0.2880365
[2,] 2 2 0.7785115
[3,] 3 1 0.3297416
> setcolorder(x, c(3,2,1))
> x
c b a
[1,] 0.2880365 3 1
[2,] 0.7785115 2 2
[3,] 0.3297416 1 3

Related

R: reorder a data frame with groups while preserving order within groups

R coders! I have a data frame, plan, with two columns. One column has group labels, lab, and the other, tr has only two distinct values in it.
lab <- rep(letters[1:2], each = 4)
tr <- c(1, 2, 2, 1, 1, 2, 1, 2)
plan <- data.frame(lab = lab, tr = tr)
> plan
lab tr
1 a 1
2 a 2
3 a 2
4 a 1
5 b 1
6 b 2
7 b 1
8 b 2
I have another vector, order_new, which is a reordered version of lab.
order_new <- lab[sample(1:8)]
> order_new
[1] "b" "b" "a" "a" "b" "a" "b" "a"
I want to reorder the data frame above so the tr values are sorted in the order given by order_new but with the order within the original lab groups preserved. The result I want is:
plan_new <- data.frame(order_new = order_new, tr = c(1, 2, 1, 2, 1, 2, 2, 1))
> plan_new
order_new tr
1 b 1
2 b 2
3 a 1
4 a 2
5 b 1
6 a 2
7 b 2
8 a 1
The first row in the new data frame is a "b" value and so takes the first "b" value in the original data frame. Row 2, also a "b", takes the second "b" value in the original. The third row, an "a", takes the first "a" value in the original etc.
I can't find anything close enough in past answers to work this out and am really looking forward to someone helping me out with this!
If you don't mind a loop
order_new=c("b", "b", "a", "a", "b", "a", "b", "a")
tmp=split(plan$tr,plan$lab)
res=list()
for (x in 1:length(order_new)) {
res[[x]]=tmp[[order_new[x]]][1]
tmp[[order_new[x]]]=tail(tmp[[order_new[x]]],-1)
}
data.frame(
"lab"=order_new,
"tr"=unlist(res)
)
lab tr
1 b 1
2 b 2
3 a 1
4 a 2
5 b 1
6 a 2
7 b 2
8 a 1
Here is a data.table approach of things.. can easily be tinkerd into a dplyr or baseR solution, followint the same logic..
I included all intermediate results to show you the results of each line..
lab <- rep(letters[1:2], each = 4)
tr <- c(1, 2, 2, 1, 1, 2, 1, 2)
plan <- data.frame(lab = lab, tr = tr)
#hard coded, since sample is not reproducible without set.seed()
order_new <- c("b", "b", "a", "a", "b", "a", "b", "a")
library( data.table )
#make plan a data.table
setDT(plan)
#set row_id's by grope (lab)
plan[, row_id := rowid( lab ) ]
# lab tr row_id
# 1: a 1 1
# 2: a 2 2
# 3: a 2 3
# 4: a 1 4
# 5: b 1 1
# 6: b 2 2
# 7: b 1 3
# 8: b 2 4
#make a new data.table for the new ordering
plan_new <- data.table( order_new = order_new )
#also add rownumbers by group
plan_new[, row_id := rowid( order_new ) ][]
# order_new row_id
# 1: b 1
# 2: b 2
# 3: a 1
# 4: a 2
# 5: b 3
# 6: a 3
# 7: b 4
# 8: a 4
#now join the tr-value from data.table 'plan' to 'plkan2', based on the rowid
plan_new[ plan, tr := i.tr, on = .(order_new = lab, row_id) ]
# order_new row_id tr
# 1: b 1 1
# 2: b 2 2
# 3: a 1 1
# 4: a 2 2
# 5: b 3 1
# 6: a 3 2
# 7: b 4 2
# 8: a 4 1
#drop the row_id column if needed
plan_new[, row_id := NULL ][]
# order_new tr
# 1: b 1
# 2: b 2
# 3: a 1
# 4: a 2
# 5: b 1
# 6: a 2
# 7: b 2
# 8: a 1

How to make this code more efficient in R?

I know this is a stupid question, but I'm kinda frustrated with my code because it takes so much time. Jere is one part of my code.
basically I have a matrix called "distance"...
a b c
1 2 5 7
2 6 8 4
3 9 2 3
and then lets say I have a column in a data frame, contains of {a,b,c}
c1 c2 c3
c ... ...
a
a just another column
b
c ... ...
so I want to do a match, I wanna make another matrix with ncol=nrow(distance), and nrow=nrow(c1). where replace the factor value with their distance value. Here's an example of the first column of matrix that I'm going to make
a will replaced by 2
b will replaced by 5
c will replaced by 7
and for the second column, i will take row number 2 from distance matrix, and so on... so the result will be like this
m1 m2 m3
7 4 3
2 6 9
2 6 9
5 8 2
7 4 3
That is just an easy example, and I'm running this code, but when it deals with large iterations, it's kinda stressful for me.
for(l in 1:ncol(d.cat)){
get.unique = sort(unique(d.cat[, l]))
for(j in 1:nrow(d.cat)){
value = as.character(d.cat[j, l])
index = which(get.unique == value)
d2[j,l] = (d[[l]][i, index])
}
}
d.cat is categorical data. And d[[...]] is the list of matrix distance for every column in d.cat.
Try to store the indices and do the updating in one go. Lets say your distance matrix is dmat and data frame is df and you want to create a matrix named newmat
a.ind = which(df$c1=="a")
b.ind = which(df$c1=="b")
c.ind = which(df$c1=="c")
newmat = matrix(0,nrow=length(df$c1),ncol=3)
newmat[a.ind,] = dmat[,1]
newmat[b.ind,] = dmat[,2]
newmat[c.ind,] = dmat[,3]
Here's some data
set.seed(123)
d = matrix(1:9, 3, dimnames=list(NULL, letters[1:3]))
df = data.frame(c1 = sample(letters[1:3], 10, TRUE), stringsAsFactors=FALSE)
and a solution
t(d[, match(df$c1, colnames(d))])
For example
> d
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> df$c1
[1] "a" "c" "b" "c" "c" "a" "b" "c" "b" "b"
> t(d[,match(df$c1, colnames(d))])
[,1] [,2] [,3]
a 1 2 3
c 7 8 9
b 4 5 6
c 7 8 9
c 7 8 9
a 1 2 3
b 4 5 6
c 7 8 9
b 4 5 6
b 4 5 6
Your data
mat <- matrix(c(2,6,9,5,8,2,7,4,3), nrow=3)
rownames(mat) <- 1:3
colnames(mat) <- letters[1:3]
library(dplyr)
set.seed(1)
df <- as.data.frame(matrix(sample(letters[1:3], 12, replace=TRUE), nrow=4)) %>%
setNames(paste0("c", 1:3))
# c1 c2 c3
# 1 a a b
# 2 b c a
# 3 b c a
# 4 c b a
Using purrr::map2_df, iterate through columns of df and columns of tmat
library(purrr)
tmat <- t(mat)
map2_df(df, seq_len(ncol(tmat)), ~tmat[,.y][.x])
# # A tibble: 4 x 3
# c1 c2 c3
# <dbl> <dbl> <dbl>
# 1 2. 6. 2.
# 2 5. 4. 9.
# 3 5. 4. 9.
# 4 7. 8. 9.
Here is my attempt using the tidyverse :
library(tidyverse)
# Lets create some example
distance <- data_frame(a = sample(1:10, 1000, T), b = sample(1:10, 1000, T), c = sample(1:10, 1000, T))
c1 <- data_frame(c1 = sample(letters[1:3], 1000, T), c2 = sample(letters[1:3], 1000, T))
# First rearrange a little bit your data to make it more tidy
distance2 <- distance %>%
mutate(i = seq_len(n())) %>%
gather(col, value, -i)
c2 <- c1 %>%
mutate(i = seq_len(n()) %>%
gather(col, value, -i)
# Now just join the data and spread it again
c12 %>%
left_join(distance2, by = c("i", "value" = "col")) %>%
select(i, col, value.y) %>%
spread(col, value.y)

Convert a matrix to columns

Assuming I have a matrix looks like below, the values up or down the diagonal are the same. In other words, [,1] x [2,] and [,2] x [1,] both are 2 in the matrix.
> m = cbind(c(1,2,3),c(2,4,5),c(3,5,6))
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 5
[3,] 3 5 6
Then I have real name for 1, 2, and 3 as well.
>Real_name
A B C # A represents 1, B represents 2, and C represents 3.
If I would like to convert the matrix to 3 columns containing corresponding real name for each pair, and the pair must be unique, A x B is the same as B x A, so we keep A x B only. How can I achieve it using R?
A A 1
A B 2
A C 3
B B 4
B C 5
C C 6
The following is straightforward:
m <- cbind(c(1,2,3), c(2,4,5), c(3,5,6))
## read `?lower.tri` and try `v <- lower.tri(m, diag = TRUE)` to see what `v` is
## read `?which` and try `which(v, arr.ind = TRUE)` to see what it gives
ij <- which(lower.tri(m, diag = TRUE), arr.ind = TRUE)
Real_name <- LETTERS[1:3]
data.frame(row = Real_name[ij[, 1]], col = Real_name[ij[, 2]], val = c(m[ij]))
# row col val
#1 A A 1
#2 B A 2
#3 C A 3
#4 B B 4
#5 C B 5
#6 C C 6
colnames(m) <- c("A", "B", "C")
rownames(m) <- c("A", "B", "C")
m[lower.tri(m)] = NA # replace lower triangular elements with NA
data.table::melt(m, na.rm = TRUE) # melt and remove NA
# Var1 Var2 value
#1 A A 1
#4 A B 2
#5 B B 4
#7 A C 3
#8 B C 5
#9 C C 6
Or you can do it in a single line: melt(replace(m, lower.tri(m), NA), na.rm = TRUE)
This will also work:
g <- expand.grid(1:ncol(m), 1:ncol(m))
g <- g[g[,2]>=g[,1],]
cbind.data.frame(sapply(g, function(x) Real_name[x]), Val=m[as.matrix(g)])
Var1 Var2 Val
1 A A 1
2 A B 2
3 B B 4
4 A C 3
5 B C 5
6 C C 6

how can i melt a data.table with concatenated column names

I'm using dcast.data.table to convert a long data.table to a wide data.table
library(data.table)
library(reshape2)
set.seed(1234)
dt.base <- data.table(A = rep(c(1:3),2), B = rep(c(1:2),3), C=c(1:4,1,2),thevalue=rnorm(6))
#from long to wide using dcast.data.table()
dt.cast <- dcast.data.table(dt.base, A ~ B + C, value.var = "thevalue", fun = sum)
#now some stuff happens e.g., please do not bother what happens between dcast and melt
setkey(dt.cast, A)
dt.cast[2, c(2,3,4):=1,with = FALSE]
now i want to melt the data.table back again to the original column layout and here i'm stuck, how do I separate the concatenated columnames from the casted data.table, this is my problem
dt.melt <- melt(dt.cast,id.vars = c("A"), value.name = "thevalue")
I need two columns instead of one
the result that i'm looking for can be produced with this code
#update
dt.base[A==2 & B == 1 & C == 1, thevalue :=1]
dt.base[A==2 & B == 2 & C == 2, thevalue :=1]
#insert (2,1,3 was not there in the base data.table)
dt.newrow <- data.table(A=2, B=1, C=3, thevalue = 1)
dt.base <-rbindlist(list(dt.base, dt.newrow))
dt.base
As always any help is appreciated
Would that work for you?
colnames <- c("B", "C")
dt.melt[, (colnames) := (colsplit(variable, "_", colnames))][, variable := NULL]
subset(dt.melt, thevalue != 0)
# or dt.melt[thevalue != 0, ]
# A thevalue B C
#1: 1 -1.2070657 1 1
#2: 2 1.0000000 1 1
#3: 2 1.0000000 1 3
#4: 3 1.0844412 1 3
#5: 2 1.0000000 2 2
#6: 3 0.5060559 2 2
#7: 1 -2.3456977 2 4
If your data set isn't representable and there could be zeros in valid rows, here's alternative approach
colnames <- c("B", "C")
setkey(dt.melt[, (colnames) := (colsplit(variable, "_",colnames))][, variable := NULL], A, B, C)
setkey(dt.base, A, B, C)
dt.base <- dt.melt[rbind(dt.base, data.table(A = 2, B = 1, C = 3), fill = T)]
dt.base[, thevalue.1 := NULL]
## A B C thevalue
## 1: 1 1 1 -1.2070657
## 2: 1 2 4 -2.3456977
## 3: 2 1 1 1.0000000
## 4: 2 2 2 1.0000000
## 5: 3 1 3 1.0844412
## 6: 3 2 2 0.5060559
## 7: 2 1 3 1.0000000
Edit
As. suggested by #Arun, the most efficient way would be to use #AnandaMahto cSplit function, as it is using data.table too, i.e,
cSplit(dt.melt, "variable", "_")
Second Edit
In order to save the manual merges, you can set fill = NA (for example) while dcasting and then do everything in one go with csplit, e.g.
dt.cast <- dcast.data.table(dt.base, A ~ B + C, value.var = "thevalue", fun = sum, fill = NA)
setkey(dt.cast, A)
dt.cast[2, c(2,3,4):=1,with = FALSE]
dt.melt <- melt(dt.cast,id.vars = c("A"), value.name = "thevalue")
dt.cast <- cSplit(dt.melt, "variable", "_")[!is.na(thevalue)]
setnames(dt.cast, 3:4, c("B","C"))
# A thevalue B C
# 1: 1 -1.2070657 1 1
# 2: 2 1.0000000 1 1
# 3: 2 1.0000000 1 3
# 4: 3 1.0844412 1 3
# 5: 2 1.0000000 2 2
# 6: 3 0.5060559 2 2
# 7: 1 -2.3456977 2 4

Create a vector from repetitons of items from a matrix

I have a data frame m
A 2
B 3
C 4
and I want to create a data frame like
A 1
A 2
B 1
B 2
B 3
C 1
C 2
C 3
C 4
Any help? Thanks a lot in advance
Your original question can be answered by:
text <- LETTERS[1:3]
n <- 2:4
rep(text, times=n)
[1] "A" "A" "B" "B" "B" "C" "C" "C" "C"
Your new question is quite different:
df <- data.frame(
text <- LETTERS[1:3],
n <- 2:4
)
data.frame(
text = rep(df$text, times=df$n),
seq = sequence(df$n)
)
text seq
1 A 1
2 A 2
3 B 1
4 B 2
5 B 3
6 C 1
7 C 2
8 C 3
9 C 4
rep accepts vectors. Try this:
dat <- data.frame(V1 = letters[1:3], V2 = 2:4)
rep(dat[, 1], dat[, 2])
> rep(dat[, 1], dat[, 2])
[1] a a b b b c c c c
Assuming m is a data frame:
m <- data.frame(V1 = LETTERS[1:3], V2 = 2:4, stringsAsFactors = FALSE)
This will do what you want:
with(m, rep(V1, times = V2))
e.g.
> with(m, rep(V1, times = V2))
[1] "A" "A" "B" "B" "B" "C" "C" "C" "C"
Edit: To address the edit made by the OP, try the following:
with(m, data.frame(X1 = rep(V1, times = V2),
X2 = unlist(lapply(V2, seq_len))))
Which produces:
> with(m, data.frame(X1 = rep(V1, times = V2),
+ X2 = unlist(lapply(V2, seq_len))))
X1 X2
1 A 1
2 A 2
3 B 1
4 B 2
5 B 3
6 C 1
7 C 2
8 C 3
9 C 4
Or more succinctly via sequence() — as per #Andrie's Answer (which I also keep forgetting about):
with(m, data.frame(X1 = rep(V1, times = V2), X2 = sequence(V2)))
#Andrie's answer is the only one so far that answers your new question. There may be a better way to do this but:
m <- data.frame(V1 = LETTERS[1:3], V2 = 2:4, stringsAsFactors = FALSE)
library(plyr)
ddply(m,"V1",function(x) data.frame(V2=seq(x[,2])))

Resources