R Best way to delete data.table rows in a function call - r

I am looking for the best way to subset iris dataset in a function call. Here is the code -
data(iris)
remove_rows <- function(x)
{
x = setDT(x)[Species == "virginica"]
}
remove_rows(iris)
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
146: 6.7 3.0 5.2 2.3 virginica
147: 6.3 2.5 5.0 1.9 virginica
148: 6.5 3.0 5.2 2.0 virginica
149: 6.2 3.4 5.4 2.3 virginica
150: 5.9 3.0 5.1 1.8 virginica
As you can see, none of the rows are deleted after running remove_rows function. This is understandable as library data.table does not have the functionality to remove rows by reference.
The workaround I have used is to update remove_rows function and return the new object from the function -
library(data.table)
remove_rows <- function(x)
{
x= setDT(x)[Species == "virginica"]
return(x)
}
iris = remove_rows(iris)
This has solved the problem, but since this data.table is huge in my case (iris is just a toy example), it takes a lot of time to run this function and copy the subset in iris dataset.
Is there a workaround to this situation?

This is not yet implemented feature. Highly requested. You can track its progress in https://github.com/Rdatatable/data.table/issues/635
Function setsubset that you are about to test is not complete. It lacks the C part to set true length of object to a shorter than the original, so without actually adding that missing piece, it won't help you much. As is now, it will return a subset at the beginning of the data.table and remaining rows will be garbage.
For now you have to return new object from a function and assign it to (possibly) same variable as the one you are passing to the function. If you really don't want to do this you can always use assign to parent frame, but it is less elegant.

Related

Is it possible to combine parameters to a subset function that is generated programmatically in R?

Before my question, here is a little background.
I am creating a general purpose data shaping and charting library for plotting survey data of a particular format.
As part of my scripts, I am using the subset function on my data frame. The way I am working is that I have a parameter file where I can pass this subsetting criteria into my functions (so I don't need to directly edit my main library). The way I do this is as follows:
subset_criteria <- expression(variable1 != "" & variable2 == TRUE)
(where variable1 and variable2 are columns in my data frame, for example).
Then in my function, I call this as follows:
my.subset <- subset(my.data, eval(subset_criteria))
This part works exactly as I want it to work. But now I want to augment that subsetting criteria inside the function, based on some other calculations that can only be performed inside the function. So I am trying to find a way to combine together these subsetting expressions.
Imagine inside my function I create some new column in my data frame automatically, and then I want to add a condition to my subsetting that says that this additional column must be TRUE.
Essentially, I do the following:
my.data$newcolumn <- with(my.data, ifelse(...some condition..., TRUE, FALSE))
Then I want my subsetting to end up being:
my.subset <- subset(my.data, eval(subset_criteria & newcolumn == TRUE))
But it does not seem like simply doing what I list above is valid. I get the wrong solution. So I'm looking for a way of combining these expressions using expression and eval so that I essentially get the combination of all the conditions.
Thanks for any pointers. It would be great if I can do this without having to rewrite how I do all my expressions, but I understand that might be what is needed...
Bob
You should probably avoid two things: using subset in non-interactive setting (see warning in the help pages) and eval(parse()). Here we go.
You can change the expression into a string and append it whatever you want. The trick is to convert the string back to expression. This is where the aforementioned parse comes in.
sub1 <- expression(Species == "setosa")
subset(iris, eval(sub1))
sub2 <- paste(sub1, '&', 'Petal.Width > 0.2')
subset(iris, eval(parse(text = sub2))) # your case
> subset(iris, eval(parse(text = sub2)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
22 5.1 3.7 1.5 0.4 setosa
24 5.1 3.3 1.7 0.5 setosa
27 5.0 3.4 1.6 0.4 setosa
32 5.4 3.4 1.5 0.4 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa

Apply R function to multiple objects and rewrite object

I'm trying to do the following:
define a function which creates an additional column based on existing columns in a data frame
apply said function to multiple objects (data frames), rewriting the original data frame
For example, say the function is to divide the Petal.Length by Petal.Width in iris.
divvy <- function(mydataframe){mydataframe$divvy <- mydataframe$Petal.Length/mydataframe$Petal.Width}
This part is easy.
Now imagine I have three (or three thousand) iris dataframes:
iris2 <- iris
iris4 <- iris
iris5 <- iris
What I am trying to avoid is this:
iris <- divvy(iris)
iris2 <- divvy(iris2)
iris4 <- divvy(iris4)
iris5 <- divvy(iris5)
times infinity for the number of iris data frames that I have
... with something along the lines of
lapply(c(iris,iris2,iris4,iris4), function(x) divvy(x))
And end up with iris, iris2, iris4, and iris5 having the new divvy column. How do I do this?
Please note: I do NOT want to create a meta-object that has all of the irises within it.
We could use data.table to do this:
library(data.table)
divvy <- function(x){x[,divvy := Petal.Length/Petal.Width]}
iris2 <- data.table(iris)
iris4 <- data.table(iris)
iris5 <- data.table(iris)
test <- lapply(list(iris2,iris4,iris5), function(x) divvy(x))
Where test looks like this (just showing the first 2 elements of the list):
> test
[[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species divvy
1: 5.1 3.5 1.4 0.2 setosa 7.000000
2: 4.9 3.0 1.4 0.2 setosa 7.000000
3: 4.7 3.2 1.3 0.2 setosa 6.500000
4: 4.6 3.1 1.5 0.2 setosa 7.500000
5: 5.0 3.6 1.4 0.2 setosa 7.000000
---
146: 6.7 3.0 5.2 2.3 virginica 2.260870
147: 6.3 2.5 5.0 1.9 virginica 2.631579
148: 6.5 3.0 5.2 2.0 virginica 2.600000
149: 6.2 3.4 5.4 2.3 virginica 2.347826
150: 5.9 3.0 5.1 1.8 virginica 2.833333
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species divvy
1: 5.1 3.5 1.4 0.2 setosa 7.000000
2: 4.9 3.0 1.4 0.2 setosa 7.000000
3: 4.7 3.2 1.3 0.2 setosa 6.500000
4: 4.6 3.1 1.5 0.2 setosa 7.500000
5: 5.0 3.6 1.4 0.2 setosa 7.000000
---
146: 6.7 3.0 5.2 2.3 virginica 2.260870
147: 6.3 2.5 5.0 1.9 virginica 2.631579
148: 6.5 3.0 5.2 2.0 virginica 2.600000
149: 6.2 3.4 5.4 2.3 virginica 2.347826
150: 5.9 3.0 5.1 1.8 virginica 2.833333
EDIT*** In response to OP updating questions specs:
You could try this:
for(i in c("iris2", "iris4", "iris5")){
x <- divvy(get(i))
assign(paste0(i,"divvied"), x)
}
Although i'd recommend against assign, especially for a lot of objects. You could extract the elements from the test list which i made in the first half of the answer, you'd still get the same answer, just a little cleaner and less clutter.
What the code does is pulls in the iris data tables as a string, and then reads them using get. This is passed to your divvy function, creating a data.table x. I then use assign to create the data.table with the suffix divvied.

using 'ifelse' in R: variable taking static value

I am trying to create new variable in a dataset based on the value of an indicator. The following is the code for the same:
prac_data <- head(iris,10)
COPY_IND='Y' ##declaring the indicator to be 'Y'
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
I get the following output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New_Var
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 5.1
3 4.7 3.2 1.3 0.2 setosa 5.1
4 4.6 3.1 1.5 0.2 setosa 5.1
5 5.0 3.6 1.4 0.2 setosa 5.1
6 5.4 3.9 1.7 0.4 setosa 5.1
7 4.6 3.4 1.4 0.3 setosa 5.1
8 5.0 3.4 1.5 0.2 setosa 5.1
9 4.4 2.9 1.4 0.2 setosa 5.1
10 4.9 3.1 1.5 0.1 setosa 5.1
I actually want to copy the variable 'Sepal.Length' in the 'New_Var' for every observation if indicator(COPY_IND) is Yes('Y').
If I do the the following, I get the desired response:
if (COPY_IND=='Y')
{
prac_data$New_Var <- prac_data$Sepal.Length
} else {prac_data$New_Var <- 'N'}
I just want to understand why R treats both 'if-else' approaches differently?
Is there another better elegant way to the same?
Thanks in advance!!
Actually, this might be easier to read as an answer.
From ifelse() help: "ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE".
Your test is just a single value, so ifelse() returns a single value, either Sepal.Length[1] or N, which is then duplicated across the whole column.
You need rowwise() on your way: prac_data <- prac_data %>% rowwise() %>% mutate(New_Var = ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
COPY_IND is always "Y" in your case, then the code could be simplified to prac_data$New_Var = prac_data$Sepal.Length. Even if you want to use ifelse statement row-wisely, it is better to follow the instructions in the help document
Further note that if(test) yes else no is much more efficient and often much preferable to ifelse(test, yes, no) whenever test is a simple true/false result, i.e., when length(test) == 1.
I guess the desired COPY_IND should be one column of the data frame/vector rather than a single fixed value. In this case, you code generate the right answer, e.g. keep the first five number:
library(dplyr)
prac_data <- head(iris,10)
prac_data$COPY_IND=c(rep('Y',5),rep('N',5))
#COPY_IND=c(rep('Y',5),rep('N',5)) works too
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
generates the right column.

In R, why does is.na cause data.table to display the data.table as ouput? Version 1.9.4

The data.table package (which is amazingly useful) still prints the data.table output in the following scenario. Is this a known issue? It seems to occur when is.na is used.
Earlier Posting for Reference
di <- data.table(iris)
di[is.na(Sepal.Length),Color := "Blue"]
packageVersion("data.table")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
146: 6.7 3.0 5.2 2.3 virginica
147: 6.3 2.5 5.0 1.9 virginica
148: 6.5 3.0 5.2 2.0 virginica
149: 6.2 3.4 5.4 2.3 virginica
150: 5.9 3.0 5.1 1.8 virginica
> packageVersion("data.table")
[1] ‘1.9.4’
6/14/2015 Edit:
Thanks for the responses. Indeed it seems that the issue is that no records meet the criteria, whereas my is.na example is just an example of the general issue. To confirm, this line also causes the data.table to display:
di[Sepal.Length > 100,Color := "Blue"]
By the way, even if the column already exists the data.table still gets displayed if no records are found. As so:
d2 <- data.table(iris)
d2[,Clr := NA]
d2[Sepal.Length > 100, Clr := "Blue"]
Sounds like the authorities are already aware of this and have it tackled. I can work around the issue in the meantime.

splitting a data.table, then modifying by reference

I have a use-case where I need to split a data.table, then apply different modify-by-reference operations to each partition. However, splitting forces copying of each table.
Here's a toy example on the iris dataset:
#split the data
DT <- data.table(iris)
out <- split(DT, DT$Species)
#assign partitions to global environment
NAMES <- as.character(unique(DT$Species))
lapply(seq_along(out), function(x) {
assign(NAMES[x], out[[x]], envir=.GlobalEnv)})
#modify by reference, same function applied to different columns for different partitions
#would do this programatically in real use case
virginica[ ,summ:=sum(Petal.Length)]
setosa[ ,summ:=sum(Petal.Width)]
#rbind all (again, programmatic)
do.call(rbind, list(virginica, setosa))
Then I get the following warning:
Warning message:
In `[.data.table`(out$virginica, , `:=`(cumPedal, cumsum(Petal.Width))) :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference.
I know this is related to putting data.tables in lists. Is there any workaround for this use case, or a way to avoid using split? Note that in the real case, I want to modify by reference programatically, so hardcoding a solution won't work.
Here's an example of using .EACHI to achieve what it sounds like you're trying to do:
## Create a data.table that indicates the pairs of keys to columns
New <- data.table(
Species = c("virginica", "setosa", "versicolor"),
FunCol = c("Petal.Length", "Petal.Width", "Sepal.Length"))
## Set the key of your original data.table
setkey(DT, Species)
## Now use .EACHI
DT[New, temp := cumsum(get(FunCol)), by = .EACHI][]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species temp
# 1: 5.1 3.5 1.4 0.2 setosa 0.2
# 2: 4.9 3.0 1.4 0.2 setosa 0.4
# 3: 4.7 3.2 1.3 0.2 setosa 0.6
# 4: 4.6 3.1 1.5 0.2 setosa 0.8
# 5: 5.0 3.6 1.4 0.2 setosa 1.0
# ---
# 146: 6.7 3.0 5.2 2.3 virginica 256.9
# 147: 6.3 2.5 5.0 1.9 virginica 261.9
# 148: 6.5 3.0 5.2 2.0 virginica 267.1
# 149: 6.2 3.4 5.4 2.3 virginica 272.5
# 150: 5.9 3.0 5.1 1.8 virginica 277.6
## Basic verification
head(cumsum(DT["setosa", ]$Petal.Width), 5)
# [1] 0.2 0.4 0.6 0.8 1.0
tail(cumsum(DT["virginica", ]$Petal.Length), 5)

Resources