R Data table - add vector of values as a column - r

I would like to create a new column inside my data table, this column being a vector of values; but I am getting the following error:
DT = data.table(x=rep(c("a","b"),c(2,3)),y=1:5)
>
> DT
x y
1: a 1
2: a 2
3: b 3
4: b 4
5: b 5
> DT[, my_vec := rep(0,y)]
Error in rep(0, y) : invalid 'times' argument
My expected result is:
> DT
x y my_vec
1: a 1 0
2: a 2 0 0
3: b 3 0 0 0
4: b 4 0 0 0 0
5: b 5 0 0 0 0 0
Is there a way to do that?

The syntax is a little cumbersome, but you can do this:
DT[, my_vec := list(list(rep(0, y))), by = y]
DT
# x y my_vec
#1: a 1 0
#2: a 2 0,0
#3: b 3 0,0,0
#4: b 4 0,0,0,0
#5: b 5 0,0,0,0,0

It is not clear whether you need a list as my_vec or a vector. If it is the latter, we group by sequence of rows, replicate the 0 with 'y' and paste the elements together within each group.
DT[, my_vec := paste(rep(0, y), collapse=' ') , 1:nrow(DT)]
DT
# x y my_vec
#1: a 1 0
#2: a 2 0 0
#3: b 3 0 0 0
#4: b 4 0 0 0 0
#5: b 5 0 0 0 0 0

Related

How to remove rows where all columns are zero using data.table

I have the following data frame:
# the original dataset
dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))
It looks like this:
> dat
a b c
1 0 1 0
2 0 0 0
3 2 0 1
4 3 0 3
What I want to do is to remove rows with all zero, resulting in :
a b c
0 1 0
2 0 1
3 0 3
How can I do that with data.table.
In reality I have much higher dimension need to be processed so need to be super fast.
I tried this but still slow:
dat <- dat[Reduce(`|`, dat), ]
You can try with rowSums -
library(data.table)
setDT(dat)
dat[rowSums(dat != 0) != 0]
# a b c
#1: 0 1 0
#2: 2 0 1
#3: 3 0 3
Again using rowSums(). I think this is more readable.
library(data.table)
dat[(rowSums(dat) !=0),]
a b c
1 0 1 0
3 2 0 1
4 3 0 3

How to choose multiple columns as condition for row selection

For example,
set.seed(1984)
d <- data.table(name=letters[1:26],a=rbinom(26,1,0.5),b=rbinom(26,1,0.5),c=rbinom(26,1,0.5))
I can remove rows that a, b, c columns are 0 by:
d[,if(sum(a,b,c) != 0) .SD,by=.(a,b,c)]
the result is:
a b c name
1: 1 1 1 a
2: 1 1 1 u
3: 1 1 1 x
4: 0 1 0 b
5: 0 1 0 d
6: 0 1 0 h
7: 0 1 1 c
8: 0 1 1 g
9: 0 1 1 o
10: 0 1 1 q
11: 0 1 1 t
12: 1 1 0 e
13: 1 1 0 k
14: 1 1 0 y
15: 1 0 0 f
16: 1 0 0 i
17: 1 0 0 r
18: 1 0 0 s
19: 1 0 0 w
20: 0 0 1 j
21: 0 0 1 v
22: 1 0 1 m
23: 1 0 1 n
a b c name
Now, I have two questions:
How to keep "name" column as the first column?
How to choose a, b, c columns as a simple expression (like a:c, but a:c is not meant a, b, c)? If there are hundreds columns, I can't type endless a, b, c ... in sum function or being the parameters of by.
Add question:
if it is not sum (has rowSums version for handling rows) but other functions like max, how to resovle question 1 and 2 without apply function family (apply function family is designed for data frame, I am afraid of they will decrease the speed of data table).
We could use Reduce with + to create a logical vector based on the columns specified in the .SDcols
d[d[, Reduce(`+`, .SD) != 0, .SDcols = a:c]]
Other options include (#nicola's)
d[Reduce("+",d[,a:c])!=0]
Or as suggested by #Frank using pmax to create a column ('keep') based on the maximum value on on each row, convert it to logical from binary and based on that subset the rows and columns
d[, keep := as.logical(do.call(pmax, .SD)), .SDcols=!"name"][(keep), !"keep"]
You could also use rowSums function:
d[rowSums(d[,2:4])!=0,]

how to select subset only by [] in r?

a<-data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
I want to convert the element of q1,q2 which !=1 to 0,and I want to use only [].I believe all the subset can be done by [].
a[grep("q\\d",colnames(a),perl=TRUE)!=1,grep("q\\d",colnames(a),perl=TRUE)]<-0
but it doesn't work, what's the problem?
We create the a numeric index of the column names that start with 'q' followed by numbers ('nm1'), use that to subset the columns in 'a' and assign the values that are not equal to 1 in that subset to 0.
nm1 <- grep("q\\d+", names(a))
a[nm1][a[nm1] != 1] <- 0
and make sure we have the columns as character class by using stringsAsFactors= FALSE in the data.frame
The above replacement is based on a logical matrix (a[nm1]!=1) which may create memory problems if the dataset is really big. In that case, it is better to loop through the columns and replace with 0
a[nm1] <- lapply(a[nm1], function(x) replace(x, x!=1, 0))
data
a <- data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
Just in case, if you know column names, you can use them for indexing.
a<-data.frame(q1=rep(c(1,'A','B'),4), q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
col_n <- c("q1", "q2")
a[, col_n][a[, col_n]!=1]<-0
> a
q1 q2 w1
1 1 1 1
2 0 0 A
3 0 0 B
4 1 0 C
5 0 1 1
6 0 0 A
7 1 0 B
8 0 0 C
9 0 1 1
10 1 0 A
11 0 0 B
12 0 0 C
data.table approach:
a<-data.table(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
a[,grep("^q", colnames(a), value = T):=lapply(a[,grep("^q", colnames(a), value = T), with = F], function(x) ifelse(x == 1, 1, 0))]
> a
q1 q2 w1
1: 1 1 1
2: 0 0 A
3: 0 0 B
4: 1 0 C
5: 0 1 1
6: 0 0 A
7: 1 0 B
8: 0 0 C
9: 0 1 1
10: 1 0 A
11: 0 0 B
12: 0 0 C

How to change multiple columns' data type in R?

I'd like to convert all fields ending with _FL from character to numeric. I thought this code will work, but is does not: all these fields are filled up with NAs. What's wrong with it?
library(data.table)
#s = fread('filename.csv',header = TRUE,sep = ";",dec = ".")
s=data.table(ID=(1:10), B=rnorm(10), C_FL=c("I","N"), D_FL=(0:1), E_FL=c("N","I"))
cn=colnames(s)
# Change all fields ending with _FL from "N"/"I" to numeric 0/1
for (i in cn){
if(substr(i,nchar(i)-2,nchar(i))=='_FL'){
s[,i] = as.numeric(gsub("I",1,gsub("N",0,s[,i])))
}
}
Another option is to find the character columns which contain "_FL" by using intersect(), and convert these to binary columns based on the condition == "N":
library(data.table)
# Find relevant columns
chr.cols <- names(s)[intersect(which(sapply(s,is.character)),
grep("_FL", names(s)))]
# Convert to numeric
for(col in chr.cols) set(s, j = col, value = as.numeric(s[[col]] == "N"))
# See result
> s
ID B C_FL D_FL E_FL
1: 1 0.6175364 0 0 1
2: 2 -0.9500318 1 1 0
3: 3 -0.6341547 0 0 1
4: 4 -0.8055696 1 1 0
5: 5 -0.3139938 0 0 1
6: 6 0.4676558 1 1 0
7: 7 1.6455591 0 0 1
8: 8 -0.4544377 1 1 0
9: 9 0.3512442 0 0 1
10: 10 0.3828367 1 1 0
One way to do it,
library(data.table)
#create function to change the values
f1 <- function(x){ifelse(x == 'N', 1, ifelse(x == 'I', 0, x))}
#get the columns to apply the function
ind <- names(s)[grepl('_fl', names(s))]
s[, (ind) := lapply(.SD, f1), .SDcols = ind]
#to convert to numeric then,
s[, (ind) := lapply(.SD, as.numeric), .SDcols = ind]
s
# id b c_fl d_fl e_fl
# 1: 1 0.20818371 0 0 1
# 2: 2 -0.06470128 1 1 0
# 3: 3 -1.03487884 0 0 1
# 4: 4 1.38119541 1 1 0
# 5: 5 -0.67924124 0 0 1
# 6: 6 0.84424732 1 1 0
# 7: 7 -0.65531266 0 0 1
# 8: 8 0.44867938 1 1 0
# 9: 9 0.15805731 0 0 1
#10: 10 -0.42541642 1 1 0
str(s)
Classes ‘data.table’ and 'data.frame': 10 obs. of 5 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ b : num 0.7464 -0.7491 -0.7144 0.561 0.0243 ...
$ c_fl: num 0 1 0 1 0 1 0 1 0 1
$ d_fl: num 0 1 0 1 0 1 0 1 0 1
$ e_fl: num 1 0 1 0 1 0 1 0 1 0
- attr(*, ".internal.selfref")=<externalptr>
With tidyverse:
library("tidyverse")
data %>%
mutate_at(.vars = vars(ends_with("_FL")),
.fun = numeric)

Populating data from one data.table to another

I have a distance matrix (as data.table) showing pairwise distances between a number of items, but not all items are in the matrix. I need to create a larger data.table that has all the missing items populated. I can do this with matrices fairly easily:
items=c("a", "b", "c", "d")
small_matrix=matrix(c(0, 1, 2, 3), nrow=2, ncol=2,
dimnames=list(c("a", "b"), c("a", "b")))
# create zero matrix of the right size
full_matrix <- matrix(0, ncol=length(items), nrow=length(items),
dimnames=list(items, items))
# populate items from the small matrix
full_matrix[rownames(small_matrix), colnames(small_matrix)] <- small_matrix
full_matrix
# a b c d
# a 0 2 0 0
# b 1 3 0 0
# c 0 0 0 0
# d 0 0 0 0
What is the equivalent of that in data.table? I can create an 'id' column in small_DT and use it as the key, but I'm not sure how to overwrite items in full_DT that has the same id/column pair.
Let's convert to data.table and keep the row names as an extra column:
dts = as.data.table(small_matrix, keep = T)
# rn a b
#1: a 0 2
#2: b 1 3
dtf = as.data.table(full_matrix, keep = T)
# rn a b c d
#1: a 0 0 0 0
#2: b 0 0 0 0
#3: c 0 0 0 0
#4: d 0 0 0 0
Now just join on the rows, and assuming small matrix is always a subset you can do the following:
dtf[dts, names(dts) := dts, on = 'rn']
dtf
# rn a b c d
#1: a 0 2 0 0
#2: b 1 3 0 0
#3: c 0 0 0 0
#4: d 0 0 0 0
Above assumes version 1.9.5+. Otherwise you'll need to set the key first.
Suppose you have these two data.table:
dt1 = as.data.table(small_matrix)
# a b
#1: 0 2
#2: 1 3
dt2 = as.data.table(full_matrix)
# a b c d
#1: 0 0 0 0
#2: 0 0 0 0
#3: 0 0 0 0
#4: 0 0 0 0
You can't operate like with data.frame or matrix, eg by doing:
dt2[rownames(full_matrix) %in% rownames(small_matrix), names(dt1), with=F] <- dt1
This code will raise an error, because to affect new values, you need to use the := operator:
dt2[rownames(full_matrix) %in% rownames(small_matrix), names(dt1):=dt1][]
# a b c d
#1: 0 2 0 0
#2: 1 3 0 0
#3: 0 0 0 0
#4: 0 0 0 0

Resources