I am trying to add columns with "" value to the data table. But getting following error. Can anyone help me here.
Since it is converted to data.table , I am unable to convert.
iris_sam <- iris
iris_sam <- as.data.table(iris_sam)
iris_sam[c("new", "New1")] <- ""
Error in `[.data.table`(x, i, which = TRUE) :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
data.table uses different syntax, please look into the documentation. For this case you can assign the new columns like this:
library(data.table)
iris_sam <- iris
iris_sam <- as.data.table(iris_sam)
iris_sam[j = c("new", "New1") := ""]
head(iris_sam, 5)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new New1
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
Related
I am importing a dataset from a third party and would would like to be able to validate that all of the columns in the incoming dataset are named as agreed to and expected. To do this, I intended to use the verify statement in assertr's package in R with has_all_names. I can accomplish this with no problem if I manually enter the column names to be verified, but I can't seem to accomplish this by passing in a vector that contains the names of the columns to be verified. So for example, using the build-in iris dataset, I can verify that existence of the all the column names if I manually enter the names as an argument to the has_all_names function, but if I have the names stored in a vector and attempt to use it for verification, it does not work:
#Create a sample list of column names to be verified
#In my real work, I obtain this list from a database
(names(iris)->expected_variable_names)
Which outputs:
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
But then I run the following and:
#This works:
iris %>% verify(has_all_names("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"))
#But this does not:
iris %>% verify(has_all_names(expected_variable_names))
When I attempt to run the line that does not work, this generates:
verification [has_all_names(expected_variable_names)] failed! (1 failure)
verb redux_fn predicate column index value
1 verify NA has_all_names(expected_variable_names) NA 1 NA
Error: assertr stopped execution
Obviously, the failed attempt is indicating that not all of the column names are found in the dataframe, but since I'm passing in all the variable names that are indeed on the dataset, it should succeed. How can I pass into verify a vector or possibly even a list of column names to validate? I've tried a number of different variations of this last attempt with no success.
Thanks.
We may use invoke
library(purrr)
library(dplyr)
library(assertr)
iris %>%
verify(invoke(has_all_names, expected_variable_names))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
...
Or with exec from rlang
library(rlang)
iris %>%
verify(exec(has_all_names, !!!expected_variable_names))
Or with do.call from base R
iris %>%
verify(do.call(has_all_names,
as.list(expected_variable_names)))
I'm using the Ionosphere dataset in R and am trying to write a loop that will create new columns that are standardized iterations of existing columns and name them accordingly.
I've got the "cname" as the new column name and c as the original. The code is:
install.packages("mlbench")
library(mlbench)
data('Ionosphere')
library(robustHD)
col <- colnames(Ionosphere)
for (c in col[1:length(col)-1]){
cname <- paste(c,"Std")
Ionosphere$cname <- standardize(Ionosphere$c)
}
But get the following error:
"Error in `$<-.data.frame`(`*tmp*`, "cname", value = numeric(0)) :
replacement has 0 rows, data has 351
In addition: Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA"
I feel like there's something super-simple I'm missing but I just can't see it.
Any help gratefully received.
We can use lapply, a custom-made standardization function, setNames, and cbind.
I do not have access to your dataset, so I am using the iris dataset as an example:
df<-iris
cbind(df, set_names(lapply(df[1:4],
\(x) (x - mean(x))/sd(x)),
paste0(names(df[1:4]), '_Std')))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_Std Sepal.Width_Std Petal.Length_Std Petal.Width_Std
1 5.1 3.5 1.4 0.2 setosa -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 4.9 3.0 1.4 0.2 setosa -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 4.7 3.2 1.3 0.2 setosa -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 4.6 3.1 1.5 0.2 setosa -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 5.0 3.6 1.4 0.2 setosa -1.01843718 1.24503015 -1.33575163 -1.3110521482
...
I feel these transformations get easier with dplyr:
library(dplyr)
iris %>% mutate(across(where(is.numeric),
~ (.x - mean(.x))/sd(.x),
.names = "{col}_Std"))
When I run the following code, I expect the value of the Sepal_Width_2 column to be Sepal_Width + 1, but it is in fact Sepal_Width + 2. What gives?
require(dplyr)
require(sparklyr)
Sys.setenv(SPARK_HOME='/usr/lib/spark')
sc <- spark_connect(master="yarn")
# for this example these variables are hard coded
# but in my actual code these are named dynamically
sw_name <- as.name('Sepal_Width')
sw2 <- "Sepal_Width_2"
sw2_name <- as.name(sw2)
ir <- copy_to(sc, iris)
print(head(ir %>% mutate(!!sw2 := sw_name))) # so far so good
# Source: spark<?> [?? x 6]
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Sepal_Width_2
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 5.1 3.5 1.4 0.2 setosa 3.5
# 4.9 3 1.4 0.2 setosa 3
# 4.7 3.2 1.3 0.2 setosa 3.2
# 4.6 3.1 1.5 0.2 setosa 3.1
# 5 3.6 1.4 0.2 setosa 3.6
# 5.4 3.9 1.7 0.4 setosa 3.9
print(head(ir %>% mutate(!!sw2 := sw_name) %>% mutate(!!sw2 := sw2_name + 1))) # i guess 2+2 != 4?
# Source: spark<?> [?? x 6]
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Sepal_Width_2
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 5.1 3.5 1.4 0.2 setosa 5.5
# 4.9 3 1.4 0.2 setosa 5
# 4.7 3.2 1.3 0.2 setosa 5.2
# 4.6 3.1 1.5 0.2 setosa 5.1
# 5 3.6 1.4 0.2 setosa 5.6
# 5.4 3.9 1.7 0.4 setosa 5.9
My use case requires that I use the dynamic variable naming you see above. In this example it is rather silly (compared to just using the variables directly), but in my use case I'm running the same function across hundreds of different spark tables. They all have the same "schema" in terms of the number of columns and what each column is (outputs from some machine learning models), but the names differ because each table contains the output for a different model. The names are predictable, but since they vary, I construct them dynamically as you see here instead of hardcoding them.
It appears that Spark knows how to add 2 and 2 together when the names are hardcoded, but when the names are dynamic it suddenly freaks out.
You might be misusing as.name which is leading sparklyr to misinterpret your input.
Note that your code errors when just working on a local table:
sw_name <- as.name('Sepal.Width') # swap "_" to "." to match variable names
sw2 <- "Sepal_Width_2"
sw2_name <- as.name(sw2)
data(iris)
print(head(iris %>% mutate(!!sw2 := sw_name)))
# Error: Problem with `mutate()` input `Sepal_Width_2`.
# x object 'Sepal.Width' not found
# i Input `Sepal_Width_2` is `sw_name`.
Note that you are using both the !! operator from rlang with as.name from base R. But you are not using them together as demonstrated in this question.
I recommend you use sym and !! from the rlang package instead of as.name, and that you apply both to character strings that are column names. The following works locally, and is consistent with the non-standard evaluation guidance. So it should translate to spark:
library(dplyr)
data(iris)
sw <- 'Sepal.Width'
sw2 <- paste0(sw, "_2")
head(iris %>% mutate(!!sym(sw2) := !!sym(sw)))
head(iris %>% mutate(!!sym(sw2) := !!sym(sw)) %>% mutate(!!sym(sw2) := !!sym(sw2) + 1))
I'm not sure which package was the culprit (sparklyr, dplyr, R, who knows), but this has been fixed when I upgraded from 3.6.3/sparklyr 1.5 to R 4.0.2/sparklyr 1.7.0.
I am new to data.table and think that this is an easy question, but can't seem to find the answer anywhere.
I want to subset a table based on the value of two columns, whose names I know. But I want to compare against a value which I don't know in advance. That is, I want to use a variable for the i portion of DT[]. But I can't seem to figure out how to do it. Everything I see explains how to use a variable for j (i.e. column names), but not for i.
When I just put the name of the variable in, i.e.
setkey(dtpredictions, colA, colB)
nextweek = dtpredictions[J(uservar, weekvar)]
it returns the entire table. Trying to apply the answer to FAQ 1.6, I tried:
nextweek = dtpredictions[J(eval(quote(uservar)), eval(quote(weekvar)))]
and
nextweek = dtpredictions[J(eval(user), eval(week))]
but both still returned the whole table.
I am pretty sure this is very simple, but I am stuck.
EDIT
I apologize for not clarifying earlier: I would like to do a binary search, since I need the speedup. I know that I can do a vector scan using ==, but I would prefer not to.
Found the problem - one of my variables had the same name as a column in the table. I actually saw a question about a similar problem here, but didn't even realize that I had that issue. (It was another column in the table, not the one I was subsetting on.)
I changed the name of the variable I was using to subset and now it works.
hmmm...interesting. Does this code seem to work for you? I am not getting the same error. I am using data.table 1.9.3.
require(data.table)
iris <- data.table(iris)
#Create new categorical variable
set.seed(1)
iris[ , new.var := sample(letters[1:5],150,replace=TRUE)]
#Set keys
setkey(iris,Species,new.var)
#Create variables to reference
check1 <- "setosa"
check2 <- "b"
#Return matches
iris[J(check1,check2)]
And the resulting table:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new.var
1: 5.1 3.5 1.4 0.2 setosa b
2: 4.9 3.0 1.4 0.2 setosa b
3: 5.0 3.6 1.4 0.2 setosa b
4: 5.4 3.7 1.5 0.2 setosa b
5: 4.3 3.0 1.1 0.1 setosa b
6: 5.7 3.8 1.7 0.3 setosa b
7: 5.1 3.7 1.5 0.4 setosa b
8: 4.8 3.4 1.9 0.2 setosa b
9: 5.0 3.0 1.6 0.2 setosa b
10: 5.2 3.5 1.5 0.2 setosa b
11: 4.7 3.2 1.6 0.2 setosa b
Is this you are looking for?
setkey(dtpredictions, colA, colB)
nextweek <- dtpredictions[colA == uservar & colB == weekvar]
I have a dataset with many missing values. Some of the missing values are NAs, some are Nulls, and others have varying lengths of blank spaces. I would like to utilize the fread function in R to be able to read all these values as missing.
Here is an example:
#Find fake data
iris <- data.table(iris)[1:5]
#Add missing values non-uniformly
iris[1,Species:=' ']
iris[2,Species:=' ']
iris[3,Species:='NULL']
#Write to csv and read back in using fread
write.csv(iris,file="iris.csv")
fread("iris.csv",na.strings=c("NULL"," "))
V1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 1 5.1 3.5 1.4 0.2
2: 2 4.9 3.0 1.4 0.2 NA
3: 3 4.7 3.2 1.3 0.2 NA
4: 4 4.6 3.1 1.5 0.2 setosa
5: 5 5.0 3.6 1.4 0.2 setosa
From the above example, we see that I am unable to account for the first missing value since there are many blank spaces. Any one know of a way to account for this?
Thanks so much for the wonderful answer from #eddi.
fread("sed 's/ *//g' iris.csv",na.strings=c("",NA,"NULL"))