How to run a function inside data.table?

How to run a function inside data.table? - r

I've created an example data.table
library(data.table)
set.seed(1)
siz <- 10
my <- data.table(
AA=c(rep(NA,siz-1),"11/11/2001"),
BB=sample(c("wrong", "11/11/2001"),siz, prob=c(1000000,1), replace=T),
CC=sample(siz),
DD=rep("11/11/2001",siz),
EE=rep("HELLO", siz)
)
my[2,AA:=1]
NA wrong 3 11/11/2001 HELLO
1 wrong 2 11/11/2001 HELLO
NA wrong 6 11/11/2001 HELLO
NA wrong 10 11/11/2001 HELLO
NA wrong 5 11/11/2001 HELLO
NA wrong 7 11/11/2001 HELLO
NA wrong 8 11/11/2001 HELLO
NA wrong 4 11/11/2001 HELLO
NA wrong 1 11/11/2001 HELLO
11/11/2001 wrong 9 11/11/2001 HELLO
If I run this code
patt <- "^\\d\\d?/\\d\\d?/\\d{4}$"
sapply(my, function(x) (grepl(patt,x )))
I get a table with TRUE whenever there is a date.
AA BB CC DD EE
[1,] FALSE FALSE FALSE TRUE FALSE
[2,] FALSE FALSE FALSE TRUE FALSE
[3,] FALSE FALSE FALSE TRUE FALSE
[4,] FALSE FALSE FALSE TRUE FALSE
[5,] FALSE FALSE FALSE TRUE FALSE
[6,] FALSE FALSE FALSE TRUE FALSE
[7,] FALSE FALSE FALSE TRUE FALSE
[8,] FALSE FALSE FALSE TRUE FALSE
[9,] FALSE FALSE FALSE TRUE FALSE
[10,] TRUE FALSE FALSE TRUE FALSE
But if I do it like this:
my[,lapply(.SD, grepl, patt)]
I just get this result:
AA BB CC DD EE
1: NA FALSE FALSE FALSE FALSE
Why?
How can I get the same result writing wverything inside the brackets?

We need to specify the pattern argument if we are not using anonymous function call
my[,lapply(.SD, grepl, pattern = patt)]
Or otherwise with an anonymous function call
my[,lapply(.SD, function(x) grepl(patt, x))]

Related

NA incorrectly appearing in selected/subsetted data

I'm stumped by the following:
z <- data.frame(a=c(1,2,3,4,5,6), b=c("Yes","Yes","No","No","",NA))
is.na(z$b)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
z$a[z$b=="Yes"]
[1] 1 2 NA
is.na(z$a[z$b=="Yes"])
[1] FALSE FALSE TRUE
Why is it that when I select z$b=="Yes", NA appears as a third value for the subsetted z$a?
When I subset, however, this isn't a problem:
subset(z, b=="Yes")$a
[1] 1 2
Many thanks in advance.

How many elements of a vector are smaller or equal to each element of this vector?

I am interested in writing a program that gives the number of elements of vector x that are smaller or equal to any given value within vector x.
Let's say
x = [1,3,8,7,6,4,3,10,12]
I want to calculate the number of elements within x which are smaller or equal to 1, to 3, to 8 etc. For example the fifth element of x[5] is 6 and the number of elements smaller or equal to 6 equals to 5. However, I only know how to do an element-wise comparison, e.g x[1]<=x[3]
I suppose that I will be using the for loop and have something like this here:
for (i in length(x)){
if (x[i]<=x[i]){
print(x[i])}
# count number of TRUEs
}
However, this code obviously does not do what I want.

Use outer to make all comparisons at once:
outer(x, x, "<=")
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [2,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [3,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
# [4,] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
# [5,] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
# [6,] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
# [7,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
# [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
colSums(outer(x, x, "<="))
#[1] 1 3 7 6 5 4 3 8 9

You can also use the *apply family as follows,
sapply(x, function(i) sum(x <= i))
#[1] 1 3 7 6 5 4 3 8 9

We can use findInterval
findInterval(x, sort(x))
#[1] 1 3 7 6 5 4 3 8 9

Another alternative is to use rank, which ranks the values. Setting the ties.method argument to "max" retrieves the inclusive value ("<=" versus "<").
rank(x, ties.method="max")
[1] 1 3 7 6 5 4 3 8 9

Compare each element of a variable within each group

Consider the data frame in R:
set.seed(36)
y <- runif(10,0,200)
group <- sample(rep(1:2, each=5))
d <- data.frame(y, group)
I want to compare all y against all y within each group. The following codes do this correctly:
d_split <- split(d, d$group)
a <- with(d_split[[1]],outer(y, y, "<="))
b <- with(d_split[[2]],outer(y, y, "<="))
But while I am doing this inside a function, and the number of group varies (group will be an argument of that function), then I cannot proceed in this manner. How can I elegantly write the last three line codes to compare all y against all y within each group?

To perform the same operation for multiple groups we can use lapply and perform the outer operation for every group.
lapply(split(d, d$group), function(x) outer(x[["y"]], x[["y"]], "<="))
#$`1`
# [,1] [,2] [,3] [,4] [,5]
#[1,] TRUE TRUE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE FALSE
#[3,] TRUE TRUE TRUE FALSE TRUE
#[4,] TRUE TRUE TRUE TRUE TRUE
#[5,] TRUE TRUE FALSE FALSE TRUE
#$`2`
# [,1] [,2] [,3] [,4] [,5]
#[1,] TRUE TRUE FALSE TRUE FALSE
#[2,] FALSE TRUE FALSE TRUE FALSE
#[3,] TRUE TRUE TRUE TRUE TRUE
#[4,] FALSE FALSE FALSE TRUE FALSE
#[5,] TRUE TRUE FALSE TRUE TRUE

Here is an option without splitting
library(data.table)
setDT(d)[, as.data.table(outer(y, y, "<=")), group]
# group V1 V2 V3 V4 V5
#1: 1 TRUE TRUE FALSE FALSE FALSE
#2: 1 FALSE TRUE FALSE FALSE FALSE
#3: 1 TRUE TRUE TRUE FALSE TRUE
#4: 1 TRUE TRUE TRUE TRUE TRUE
#5: 1 TRUE TRUE FALSE FALSE TRUE
#6: 2 TRUE TRUE FALSE TRUE FALSE
#7: 2 FALSE TRUE FALSE TRUE FALSE
#8: 2 TRUE TRUE TRUE TRUE TRUE
#9: 2 FALSE FALSE FALSE TRUE FALSE
#10: 2 TRUE TRUE FALSE TRUE TRUE
Or in a 'long' format with CJ
setDT(d)[, CJ(y, y), group][, V1 <= V2, group]

Create column based on unique values [duplicate]

This question already has answers here:
Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
(10 answers)
Closed 6 years ago.
I'd like to create columns in a data frame based on the unique values from a single column.
E.g.
Column1
A
B
C
Into
A B C
True False False
False True False
False False True

We can use table
!!table(1:nrow(df1), df1$Column1)
# A B C
# 1 TRUE FALSE FALSE
# 2 FALSE TRUE FALSE
# 3 FALSE FALSE TRUE
Or using mtabulate from qdapTools
library(qdapTools)
mtabulate(df1$Column1)!=0
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE
Or using model.matrix
model.matrix(~Column1-1, df1)!=0
# Column1A Column1B Column1C
#1 TRUE FALSE FALSE
#2 FALSE TRUE FALSE
#3 FALSE FALSE TRUE

You could also use a loop,
sapply(df$Column1, function(i) grepl(i, df$Column1))
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE

You can also use dcast from reshape2 package
library(reshape2)
!is.na(dcast(df, Column1 ~ Column1))[, -1]
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE

find whether string is matched in many columns at once and return a matrix of logicals?

Given a data structure like the following:
set.seed(10)
fruits <- c("apple", "orange", "pineapple")
fruits2 <- data.frame(id = 1:10, fruit1 = sample(fruits, 10, replace = T), fruit2 = sample(fruits, 10, replace = T), fruit3 = sample(fruits, 10, replace = T))
> fruits2
id fruit1 fruit2 fruit3
1 1 orange orange pineapple
2 2 apple orange orange
3 3 orange apple pineapple
4 4 pineapple orange orange
5 5 apple orange orange
6 6 apple orange pineapple
7 7 apple apple pineapple
8 8 apple apple apple
9 9 orange orange pineapple
10 10 orange pineapple orange
I can easily test whether any location in the data.frame is exactly equal to a given string with fruits2 == "mystring" and it will return a very convenient format . For example:
fruits2 == "orange"
id fruit1 fruit2 fruit3
[1,] FALSE TRUE TRUE FALSE
[2,] FALSE FALSE TRUE TRUE
[3,] FALSE TRUE FALSE FALSE
[4,] FALSE FALSE TRUE TRUE
[5,] FALSE FALSE TRUE TRUE
[6,] FALSE FALSE TRUE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE TRUE TRUE FALSE
[10,] FALSE TRUE FALSE TRUE
However, what I would really like to do is search for a pattern (e.g. "apple") and have the same format returned. That is, I would like to be able to test whether every item in the data.frame contains (but is not necessarily equal to) the string "apple" and have the same matrix of logicals returned. In this case, I would like it to produce:
id fruit1 fruit2 fruit3
[1,] FALSE FALSE FALSE TRUE
[2,] FALSE TRUE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE TRUE FALSE FALSE
[5,] FALSE TRUE FALSE FALSE
[6,] FALSE TRUE FALSE TRUE
[7,] FALSE TRUE TRUE TRUE
[8,] FALSE TRUE TRUE TRUE
[9,] FALSE FALSE FALSE TRUE
[10,] FALSE FALSE TRUE FALSE
Is there any easy way to do this in R without specifying multiple patterns (I know in this case fruits2 == "apple" | fruits2 == "pineapple"would do it, but in my real dataset enumerating all possible strings to exactly match is not possible)?
I figure there are workarounds and I could write a function to do it using grepl() but I am wondering if there is a simpler solution.

In base R,
> apply(fruits2,2,function(x){grepl("apple",x)})
id fruit1 fruit2 fruit3
[1,] FALSE FALSE FALSE TRUE
[2,] FALSE TRUE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE TRUE FALSE FALSE
[5,] FALSE TRUE FALSE FALSE
[6,] FALSE TRUE FALSE TRUE
[7,] FALSE TRUE TRUE TRUE
[8,] FALSE TRUE TRUE TRUE
[9,] FALSE FALSE FALSE TRUE
[10,] FALSE FALSE TRUE FALSE
n = 10000
fruits2 <- data.frame(id = 1:n, fruit1 = sample(fruits, n, replace = T), fruit2 = sample(fruits, n, replace = T), fruit3 = sample(fruits, n, replace = T))
> system.time(apply(fruits2,2,function(x){grepl("apple",x)}))
user system elapsed
0.016 0.000 0.019
> system.time(colwise(myfun)(fruits2))
user system elapsed
0.016 0.000 0.017
> system.time(sapply(fruits2,function(x) grepl('apple',x)))
user system elapsed
0.032 0.000 0.034
As #eddi points out, lapply is indeed the fastest:
> system.time(do.call("cbind",lapply(colnames(fruits2),function(x) grepl('apple',fruits2[,x]))))
user system elapsed
0.016 0.000 0.016

Dunno if you count this as simpler, but you can use colwise from the plyr package:
myfun <- function(x) grepl('apple', x)
colwise(myfun)(fruits2)
id fruit1 fruit2 fruit3
1 FALSE FALSE FALSE TRUE
2 FALSE TRUE FALSE FALSE
3 FALSE FALSE TRUE TRUE
4 FALSE TRUE FALSE FALSE
5 FALSE TRUE FALSE FALSE
6 FALSE TRUE FALSE TRUE
7 FALSE TRUE TRUE TRUE
8 FALSE TRUE TRUE TRUE
9 FALSE FALSE FALSE TRUE
10 FALSE FALSE TRUE FALSE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to run a function inside data.table? - r

We need to specify the pattern argument if we are not using anonymous function call my[,lapply(.SD, grepl, pattern = patt)] Or otherwise with an anonymous function call my[,lapply(.SD, function(x) grepl(patt, x))]

Related

NA incorrectly appearing in selected/subsetted data

How many elements of a vector are smaller or equal to each element of this vector?

Compare each element of a variable within each group

Create column based on unique values [duplicate]

find whether string is matched in many columns at once and return a matrix of logicals?

Categories

Resources