I am new to data.table and think that this is an easy question, but can't seem to find the answer anywhere.
I want to subset a table based on the value of two columns, whose names I know. But I want to compare against a value which I don't know in advance. That is, I want to use a variable for the i portion of DT[]. But I can't seem to figure out how to do it. Everything I see explains how to use a variable for j (i.e. column names), but not for i.
When I just put the name of the variable in, i.e.
setkey(dtpredictions, colA, colB)
nextweek = dtpredictions[J(uservar, weekvar)]
it returns the entire table. Trying to apply the answer to FAQ 1.6, I tried:
nextweek = dtpredictions[J(eval(quote(uservar)), eval(quote(weekvar)))]
and
nextweek = dtpredictions[J(eval(user), eval(week))]
but both still returned the whole table.
I am pretty sure this is very simple, but I am stuck.
EDIT
I apologize for not clarifying earlier: I would like to do a binary search, since I need the speedup. I know that I can do a vector scan using ==, but I would prefer not to.
Found the problem - one of my variables had the same name as a column in the table. I actually saw a question about a similar problem here, but didn't even realize that I had that issue. (It was another column in the table, not the one I was subsetting on.)
I changed the name of the variable I was using to subset and now it works.
hmmm...interesting. Does this code seem to work for you? I am not getting the same error. I am using data.table 1.9.3.
require(data.table)
iris <- data.table(iris)
#Create new categorical variable
set.seed(1)
iris[ , new.var := sample(letters[1:5],150,replace=TRUE)]
#Set keys
setkey(iris,Species,new.var)
#Create variables to reference
check1 <- "setosa"
check2 <- "b"
#Return matches
iris[J(check1,check2)]
And the resulting table:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new.var
1: 5.1 3.5 1.4 0.2 setosa b
2: 4.9 3.0 1.4 0.2 setosa b
3: 5.0 3.6 1.4 0.2 setosa b
4: 5.4 3.7 1.5 0.2 setosa b
5: 4.3 3.0 1.1 0.1 setosa b
6: 5.7 3.8 1.7 0.3 setosa b
7: 5.1 3.7 1.5 0.4 setosa b
8: 4.8 3.4 1.9 0.2 setosa b
9: 5.0 3.0 1.6 0.2 setosa b
10: 5.2 3.5 1.5 0.2 setosa b
11: 4.7 3.2 1.6 0.2 setosa b
Is this you are looking for?
setkey(dtpredictions, colA, colB)
nextweek <- dtpredictions[colA == uservar & colB == weekvar]
Related
I am trying to add columns with "" value to the data table. But getting following error. Can anyone help me here.
Since it is converted to data.table , I am unable to convert.
iris_sam <- iris
iris_sam <- as.data.table(iris_sam)
iris_sam[c("new", "New1")] <- ""
Error in `[.data.table`(x, i, which = TRUE) :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
data.table uses different syntax, please look into the documentation. For this case you can assign the new columns like this:
library(data.table)
iris_sam <- iris
iris_sam <- as.data.table(iris_sam)
iris_sam[j = c("new", "New1") := ""]
head(iris_sam, 5)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new New1
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
Before my question, here is a little background.
I am creating a general purpose data shaping and charting library for plotting survey data of a particular format.
As part of my scripts, I am using the subset function on my data frame. The way I am working is that I have a parameter file where I can pass this subsetting criteria into my functions (so I don't need to directly edit my main library). The way I do this is as follows:
subset_criteria <- expression(variable1 != "" & variable2 == TRUE)
(where variable1 and variable2 are columns in my data frame, for example).
Then in my function, I call this as follows:
my.subset <- subset(my.data, eval(subset_criteria))
This part works exactly as I want it to work. But now I want to augment that subsetting criteria inside the function, based on some other calculations that can only be performed inside the function. So I am trying to find a way to combine together these subsetting expressions.
Imagine inside my function I create some new column in my data frame automatically, and then I want to add a condition to my subsetting that says that this additional column must be TRUE.
Essentially, I do the following:
my.data$newcolumn <- with(my.data, ifelse(...some condition..., TRUE, FALSE))
Then I want my subsetting to end up being:
my.subset <- subset(my.data, eval(subset_criteria & newcolumn == TRUE))
But it does not seem like simply doing what I list above is valid. I get the wrong solution. So I'm looking for a way of combining these expressions using expression and eval so that I essentially get the combination of all the conditions.
Thanks for any pointers. It would be great if I can do this without having to rewrite how I do all my expressions, but I understand that might be what is needed...
Bob
You should probably avoid two things: using subset in non-interactive setting (see warning in the help pages) and eval(parse()). Here we go.
You can change the expression into a string and append it whatever you want. The trick is to convert the string back to expression. This is where the aforementioned parse comes in.
sub1 <- expression(Species == "setosa")
subset(iris, eval(sub1))
sub2 <- paste(sub1, '&', 'Petal.Width > 0.2')
subset(iris, eval(parse(text = sub2))) # your case
> subset(iris, eval(parse(text = sub2)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
22 5.1 3.7 1.5 0.4 setosa
24 5.1 3.3 1.7 0.5 setosa
27 5.0 3.4 1.6 0.4 setosa
32 5.4 3.4 1.5 0.4 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
I'm looking at this brief tuto for data.table
https://www.r-bloggers.com/r-data-table-tutorial-with-50-examples/
but I get stuck when the author talks about setkey()
I will put my example. I work with iris database so it can be easy replicated
mydata <- as.data.table(iris)
#Change variable names
mydata <- setnames(mydata, c("Sepal.Length","Sepal.Width", "Petal.Length", "Petal.Width", "Species"),
c("sepal_length", "sepal_width", "petal_length", "petal_width", "species"))
Now I will use a factor variable and a numeric variable as keys:
setkey(mydata, species, petal_length)
Using this works perfectly:
> mydata[.("setosa", 1.4)]
sepal_length sepal_width petal_length petal_width species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 5.0 3.6 1.4 0.2 setosa
4: 4.6 3.4 1.4 0.3 setosa
5: 4.4 2.9 1.4 0.2 setosa
6: 4.8 3.0 1.4 0.1 setosa
7: 5.1 3.5 1.4 0.3 setosa
8: 5.2 3.4 1.4 0.2 setosa
9: 5.5 4.2 1.4 0.2 setosa
10: 4.9 3.6 1.4 0.1 setosa
11: 4.8 3.0 1.4 0.3 setosa
12: 4.6 3.2 1.4 0.2 setosa
13: 5.0 3.3 1.4 0.2 setosa
But this throws an error:
mydata[.("setosa", <1.4)]
Error: inesperado '<' in "mydata[.("setosa", <"
So my question is if it is possible to include >, <, >=, <= when searching using setkey because that function is supposed to work on variables of any type. If yes, what will be the correct form to call something such as mydata[.("setosa", <1.4)]
I looked at:
R data.table setkey with numeric column
R data.table 1.9.2 issue on setkey
but found nothing useful to answer my question.
I also read data.table documentation but there are no useful examples.
Any comment will be much appreciated.
It appears like you are subsetting rather than extracting identical matches. THe below feels more like the natural syntax
mydata[species=="setosa" & petal_length < 1.4]
or a non-equi join like this
mydata[.(species="setosa", i.petal_length=1.4), on=.(species, petal_length < i.petal_length)]
I found somethig that can be useful using seq function.
Suppose I want to retrieve the observations for setosa which have between petal_length from 1.4 to 2.
Following the example in my original question, we can use:
na.omit(mydata[.("setosa", seq(1.4,2, 0.1))])
Which returns the observations we wanted.
seq(1.4, 2, 0.1)
returns a sequence from 1.4 to 2 by 0.1 steps. This looks for values in the data.table and generates observations for 1.6, 1.8 and 1.9 which are NA. That's why the first function which is called is na.omit
Hope this can be useful for somebody.
I am trying to create new variable in a dataset based on the value of an indicator. The following is the code for the same:
prac_data <- head(iris,10)
COPY_IND='Y' ##declaring the indicator to be 'Y'
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
I get the following output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New_Var
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 5.1
3 4.7 3.2 1.3 0.2 setosa 5.1
4 4.6 3.1 1.5 0.2 setosa 5.1
5 5.0 3.6 1.4 0.2 setosa 5.1
6 5.4 3.9 1.7 0.4 setosa 5.1
7 4.6 3.4 1.4 0.3 setosa 5.1
8 5.0 3.4 1.5 0.2 setosa 5.1
9 4.4 2.9 1.4 0.2 setosa 5.1
10 4.9 3.1 1.5 0.1 setosa 5.1
I actually want to copy the variable 'Sepal.Length' in the 'New_Var' for every observation if indicator(COPY_IND) is Yes('Y').
If I do the the following, I get the desired response:
if (COPY_IND=='Y')
{
prac_data$New_Var <- prac_data$Sepal.Length
} else {prac_data$New_Var <- 'N'}
I just want to understand why R treats both 'if-else' approaches differently?
Is there another better elegant way to the same?
Thanks in advance!!
Actually, this might be easier to read as an answer.
From ifelse() help: "ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE".
Your test is just a single value, so ifelse() returns a single value, either Sepal.Length[1] or N, which is then duplicated across the whole column.
You need rowwise() on your way: prac_data <- prac_data %>% rowwise() %>% mutate(New_Var = ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
COPY_IND is always "Y" in your case, then the code could be simplified to prac_data$New_Var = prac_data$Sepal.Length. Even if you want to use ifelse statement row-wisely, it is better to follow the instructions in the help document
Further note that if(test) yes else no is much more efficient and often much preferable to ifelse(test, yes, no) whenever test is a simple true/false result, i.e., when length(test) == 1.
I guess the desired COPY_IND should be one column of the data frame/vector rather than a single fixed value. In this case, you code generate the right answer, e.g. keep the first five number:
library(dplyr)
prac_data <- head(iris,10)
prac_data$COPY_IND=c(rep('Y',5),rep('N',5))
#COPY_IND=c(rep('Y',5),rep('N',5)) works too
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
generates the right column.
First of all, I have a dataframe (lets call it "years") with 5 rows and 10 columns. I need to build a new one doing (x1-x2)/x1, being x1 the first element and x2 the second element of a column in "years", then (x2-x3)/x2 and so forth. I thought rollapply would be the best tool for the task, but I can't figure out how to define such function to insert it in rollapply.
I'm new to R, so I hope my question is not too basic. Anyway, I couldn't find a similar question here so I'd be really thankful if someone could help me.
You can use transform, diff and length, no need to use rollapply
> df <- head(iris,5) # some data
> transform(df, New = c(NA, diff(Sepal.Length)/Sepal.Length[-length(Sepal.Length)] ))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa -0.03921569
3 4.7 3.2 1.3 0.2 setosa -0.04081633
4 4.6 3.1 1.5 0.2 setosa -0.02127660
5 5.0 3.6 1.4 0.2 setosa 0.08695652
diff.zoo in the zoo package with the arithmetic=FALSE argument will divide each number by the prior in each column:
library(zoo)
as.data.frame(1 - diff(zoo(DF), arithmetic = FALSE))