paste two data.table columns - r

dt <- data.table(L=1:5,A=letters[7:11],B=letters[12:16])
L A B
1: 1 g l
2: 2 h m
3: 3 i n
4: 4 j o
5: 5 k p
Now I want to paste columns "A" and "B" to get a new one, let's call it "new":
dt2
L A B new
1: 1 g l gl
2: 2 h m hm
3: 3 i n in
4: 4 j o jo
5: 5 k p kp

I had a similar issue but had many columns, and didn't want to type them each manually.
New version
(based on comment from #mnel)
dt[, new:=do.call(paste0,.SD), .SDcols=-1]
This is roughly twice as fast as the old version, and seems to sidestep the quirks. Note the use of .SDcols to identify the columns to use in paste0. The -1 uses all columns but the first, since the OP wanted to paste columns A and B but not L.
If you would like to use a different separator:
dt[ , new := do.call(paste, c(.SD, sep = ":"))]
Old version
You can use .SD and by to handle multiple columns:
dt[,new:=paste0(.SD,collapse=""),by=seq_along(L)]
I added seq_along in case L was not unique. (You can check this using dt<-data.table(L=c(1:4,4),A=letters[7:11],B=letters[12:16])).
Also, in my actual instance for some reason I had to use t(.SD) in the paste0 part. There may be other similar quirks.

Arun's comment answered this question:
dt[,new:=paste0(A,B)]

This should do it:
dt <- data.table(dt, new = paste(dt$A, dt$B, sep = ""))

If you want to paste strictly using column indexes (when you may not know the row names)...
I want to get a new column by pasting two columns 6 and 4
dt$new <- apply( dt[,c(6,4)], 1, function(row){ paste(row[1],row[2],sep="/") })

Related

R perform summary operation and subset result by data.table column

I want to use a list external to my data.table to inform what a new column of data should be, in that data.table. In this case, the length of the list element corresponding to a data.table attribute;
# dummy list. I am interested in extracting the vector length of each list element
l <- list(a=c(3,5,6,32,4), b=c(34,5,6,34,2,4,6,7), c = c(3,4,5))
# dummy dt, the underscore number in Attri2 is the element of the list i want the length of
dt <- data.table(Attri1 = c("t","y","h","g","d","e","d"),
Attri2 = c("fghd_1","sdafsf_3","ser_1","fggx_2","sada_2","sfesf_3","asdas_2"))
# extract that number to a new attribute, just for clarity
dt[, list_gp := tstrsplit(Attri2, "_", fixed=TRUE, keep=2)]
# then calculate the lengths of the vectors in the list, and attempt to subset by the index taken above
dt[,list_len := '[['(lapply(1, length),list_gp)]
Error in lapply(l, length)[[list_gp]] : no such index at level 1
I envisaged the list_len column to be 5,3,5,8,8,3,8
A couple of things.
tstrsplit gives you a string. convert to number.
not quite sure about the [[ construct there, see proposed solution:
dt[, list_gp := as.numeric( tstrsplit(Attri2, "_", fixed=TRUE, keep=2)[[1]] )]
dt[, list_len := sapply( l[ list_gp ], length ) ]
Output:
> dt
Attri1 Attri2 list_gp list_len
1: t fghd_1 1 5
2: y sdafsf_3 3 3
3: h ser_1 1 5
4: g fggx_2 2 8
5: d sada_2 2 8
6: e sfesf_3 3 3
7: d asdas_2 2 8

One to Many Join in data.table

I am using data.table to do a one-to-many merge. Instead of matching with all the rows, the output is showing only the last matched row for each unique value of the key.
a <- data.table(x = 1:2L, y = letters[1:4])
b <- data.table(x = c(1L,3L))
setkey(a,x)
setkey(b,x)
I want to do a many to one (b to a) join based on column x.
c <- a[b,on=.(x)]
c
# x y
# 1: 1 a
# 2: 1 c
# 3: 3 NA
However, this approach creates a new data.table called c, instead of making a new data.table, I use the following code to add the column y with b.
b[a,y:=i.y]
Now b looks like,
b
# x y
# 1: 1 c
# 2: 3 NA
The desired output is the one in the first method (c). Is there a way of using := and output all the rows instead of the last matched row alone?
PS: The reason I want to use method 2 using := is because my data is huge and I do not want to make copies. The example I showed reflects what happens in my data.

Pass a column name as an object and not a string for data.table

I'm using data.table to make aggregation, collapse and group by. The thing is that i know a method to do this with column number but when i put a by it directly make the aggregation. I just want the collapse to be done without group by but putting the by. i know this method:
dt[,X := list(paste(X, collapse = ";")),by = list(Y,Z)]
What i want to do now is:
dt[,names(dt)[1] := list(paste(names(dt)[1], collapse = ";")),by = list(Y,Z)]
But with this code it just write me X at each line
here is an example:
X <- c("a","b","c","d","e","f","g")
Y <- c(1,2,3,4,4,6,4)
Z <- c(10,11,23,8,8,1,3)
dt <- data.table(X,Y,Z)
This is the desired output, but i need to now this because i'm trying to do this in multiple columns (i have a data frame with 400 columns):
X Y Z
1: a 1 10
2: b 2 11
3: c 3 23
4: d;e 4 8
5: f 6 1
6: g 4 3
You should wrap names(dt)[1] inside get():
dt[,names(dt)[1] := list(paste(get(names(dt)[1]), collapse = ";")),by = list(Y,Z)]
Additionally, if you want to deduplicate your data you can use unique(dt).
To apply your functions to multiple columns, you can use .SD in combination with lapply(). For example pasting together the first two cols, grouped by Z:
dt[, lapply(.SD, function(x) paste(x, collapse=";")), by=list(Z),.SDcols=names(dt)[1:2]]

R data.table column names not working within a function

I am trying to use a data.table within a function, and I am trying to understand why my code is failing. I have a data.table as follows:
DT <- data.table(my_name=c("A","B","C","D","E","F"),my_id=c(2,2,3,3,4,4))
> DT
my_name my_id
1: A 2
2: B 2
3: C 3
4: D 3
5: E 4
6: F 4
I am trying to create all pairs of "my_name" with different values of "my_id", which for DT would be:
Var1 Var2
A C
A D
A E
A F
B C
B D
B E
B F
C E
C F
D E
D F
I have a function to return all pairs of "my_name" for a given pair of values of "my_id" which works as expected.
get_pairs <- function(id1,id2,tdt) {
return(expand.grid(tdt[my_id==id1,my_name],tdt[my_id==id2,my_name]))
}
> get_pairs(2,3,DT)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
Now, I want to execute this function for all pairs of ids, which I try to do by finding all pairs of ids and then using mapply with the get_pairs function.
> combn(unique(DT$my_id),2)
[,1] [,2] [,3]
[1,] 2 2 3
[2,] 3 4 4
tid1 <- combn(unique(DT$my_id),2)[1,]
tid2 <- combn(unique(DT$my_id),2)[2,]
mapply(get_pairs, tid1, tid2, DT)
Error in expand.grid(tdt[my_id == id1, my_name], tdt[my_id == id2, my_name]) :
object 'my_id' not found
Again, if I try to do the same thing without an mapply, it works.
get_pairs3(tid1[1],tid2[1],DT)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
Why does this function fail only when used within an mapply? I think this has something to do with the scope of data.table names, but I'm not sure.
Alternatively, is there a different/more efficient way to accomplish this task? I have a large data.table with a third id "sample" and I need to get all of these pairs for each sample (e.g. operating on DT[sample=="sample_id",] ). I am new to the data.table package, and I may not be using it in the most efficient way.
The function debugonce() is extremely useful in these scenarios.
debugonce(mapply)
mapply(get_pairs, tid1, tid2, DT)
# Hit enter twice
# from within BROWSER
debugonce(FUN)
# Hit enter twice
# you'll be inside your function, and then type DT
DT
# [1] "A" "B" "C" "D" "E" "F"
Q # (to quit debugging mode)
which is wrong. Basically, mapply() takes the first element of each input argument and passes it to your function. In this case you've provided a data.table, which is also list. So, instead of passing the entire data.table, it's passing each element of the list (columns).
So, you can get around this by doing:
mapply(get_pairs, tid1, tid2, list(DT))
But mapply() simplifies the result by default, and therefore you'd get a matrix back. You'll have to use SIMPLIFY = FALSE.
mapply(get_pairs, tid1, tid2, list(DT), SIMPLIFY = FALSE)
Or simply use Map:
Map(get_pairs, tid1, tid2, list(DT))
Use rbindlist() to bind the results.
HTH
Enumerate all possible pairs
u_name <- unique(DT$my_name)
all_pairs <- CJ(u_name,u_name)[V1 < V2]
Enumerate observed pairs
obs_pairs <- unique(
DT[,{un <- unique(my_name); CJ(un,un)[V1 < V2]}, by=my_id][, !"my_id"]
)
Take the difference
all_pairs[!J(obs_pairs)]
CJ is like expand.grid except that it creates a data.table with all of its columns as its key. A data.table X must be keyed for a join X[J(Y)] or a not-join X[!J(Y)] (like the last line) to work. The J is optional, but makes it more obvious that we're doing a join.
Simplifications. #CathG pointed out that there is a cleaner way of constructing obs_pairs if you always have two sorted "names" for each "id" (as in the example data): use as.list(un) in place of CJ(un,un)[V1 < V2].
Why does this function fail only when used within an mapply? I think
this has something to do with the scope of data.table names, but I'm
not sure.
The reason the function is failing has nothing to do with scoping in this case. mapply vectorizes the function, it takes each element of each parameter and passes to the function. So, in your case, the data.table elements are its columns, so mapply is passing the column my_name instead of the complete data.table.
If you want to pass the complete data.table to mapply, you should use the MoreArgs parameter. Then your function will work:
res <- mapply(get_pairs, tid1, tid2, MoreArgs = list(tdt=DT), SIMPLIFY = FALSE)
do.call("rbind", res)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
5 A E
6 B E
7 A F
8 B F
9 C E
10 D E
11 C F
12 D F

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Resources