Append a column of NA values: lit() and withColumn() giving error - r

I am trying to append a column of null values to a SparkR DataFrame with the following code:
w <- rbind(3, 0, 2, 3, NA, 1)
z <- rbind("a", "b", "c", "d", "e", "f")
x <- rbind(3, 3, 3, 3, 3, 3)
d <- cbind.data.frame(w, z, x)
B <- as.DataFrame(sqlContext, d)
B1 <- sample(B, withReplacement = FALSE, fraction = 0.5)
B2 <- except(B, B1)
col_sub <- c("z", "x")
B2 <- select(B2, col_sub)
B2 <- withColumn(B2, "w", lit(NA))
But, the last expression returns the error: Error in FUN(X[[i]], ...) : Unsupported data type: null. I have used the lit operation to produce a column of null values before, but I'm not sure why it won't work this time.
Also, this has been discussed on SE before, see this question. I'm completely clueless as to why my expression yields that error. For reference, I'm using SparkR 1.6.1.

No matter if it works or not adding column this way is not a good practice. Since the only practical reason to add column which contains only undefined values is enforcing specific schema for unions or external writes you should always use columns of specific type.
For example:
withColumn(B2, "w", cast(lit(NULL), "double"))

Spark columns can have types numeric, character. My understanding is that it is intended that columns of other data types are illegal.
NA is not recognized by SparkR in the same way that R recognizes it as being an indicator of a missing value. SparkR sees NA as being a value of type logical. For example:
dtypes(NA)
unable to find an inherited method for function ‘dtypes’ for signature ‘"logical"’
If you try to add a column of NA's, Spark tries to create a column of type logical, which is not a valid data type for a column. Hence the error.
There are a couple of places where SparkR (1.6.2) is inconsistent in trapping errors around creating illegal column types. As you found, SparkR throws an error if you use lit(NA), but SparkR will let you convert an R data.frame with a column of NAs and it successfully creates an illegal column of type "logical"
x <- c(NA,NA,NA, NA, NA)
dfX <- data.frame(x)
colnames(dfX) <- c("Empty")
sdfX <- createDataFrame(sqlContext, dfX)
str(sdfX)
'DataFrame': 1 variables:
$ Empty: logi NA NA NA NA NA

Related

Accessing List Value by looking up value with same index

I was following this blog post:
https://www.robert-hickman.eu/post/dixon_coles_1/
And in a number of places, he gets a value from a list by putting in a value with the equivalent index, rather like using it as a key in a python dictionary. This would be an example in the link:
What I understand he's done is basically this:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
example_list <- as.list(example_df)
example_list$values[a]
Error: object 'a' not found
But I get an NA value for this - am I missing something ?
Thanks in advance !
The way a list works in R, makes it not really practical to address a list like that, as the values in the list aren't associated like that.
Which leads to this:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
example_list <- as.list(example_df)
#Gives NULL
example_list[example_list$teams == "a"]$values
#Gives 1, 2, 3
example_list[example_list$teams == "b"]$values
#Gives NULL
example_list[example_list$teams == "b"]$values
You can see that this wouldn't work, because the syntax you would expect to work in this case, throws an error "incorrect amount of dimensions":
example_list[example_list$teams == "b", ]$values
However, it is really easy to address a data frame, or any matrix like structure in the way you want to:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
#Gives 1
example_df[example_df$teams == "a", ]$values
#Gives 2
example_df[example_df$teams == "b", ]$values
#Gives 3
example_df[example_df$teams == "b", ]$values
What I think is happening in the tutorial you shared is something else. As far as I can see, there are no names passed through to the list, but variables. It is not giving the value of a higher dimensional thing, but rather the value of the list itself.
That also makes a lot more sense, as that is what the syntax is doing. "teams[1]" Simply returns the first value in the list called "teams" (even if that value is a vector or whatever) Of course, teams[i], where i is a variable, also works. What I mean is this:
teams = list(A = 1, B = 2, C = 3, D = 4)
#Gives A
teams[1]
If you want to understand why one of them works and the other one doesn't, here is both together. Throw it in RStudio, and look through the Environment.
## One dimensional
teams = list(A = "a", B = "very", C = "good", D = "example")
#Gives "very"
teams[2]
## Two dimensional
teams <- c("a","b","c")
values <- c(1,2,3)
teams2 <- list(teams, values)
#Gives "a, b, c"
teams2[1]
#Gives NULL
teams2[3]

Issue class coercion filtered R data.table ifelse, if_else, if ... else

I am encountering problems with R data.table, while converting a character variable to a numeric variable based on some conditions:
library(data.table)
DT1 <- data.table(a = "A", b = "B")
DT2 <- data.table(a = "A", b = "B")
DT1[a == "A", b := ifelse(b == "B", 1, 0)] #option 1: incorrect behavior
DT2[, b := ifelse(b == "B", 1, 0)] #option 2: correct behavior
Expected correct output:
a b
1: A (character) 1 (numeric)
However, with option 1 I am getting the following output (with a warning):
a b
1: A (character) 1 (character)
Warning message:
In [.data.table(DT1, a == "A", :=(b, ifelse(b == "B", 1, 0))) :
Coerced double RHS to character to match the type of the target column (column 2 named 'b'). If the target column's type character is correct, it's best for efficiency to avoid the coercion and create the RHS as type character. To achieve that consider R's type postfix: typeof(0L) vs typeof(0), and typeof(NA) vs typeof(NA_integer_) vs typeof(NA_real_). You can wrap the RHS with as.character() to avoid this warning, but that will still perform the coercion. If the target column's type is not correct, it's best to revisit where the DT was created and fix the column type there; e.g., by using colClasses= in fread(). Otherwise, you can change the column type now by plonking a new column (of the desired type) over the top of it; e.g. DT[, b:=as.double(b)]. If the RHS of := has nrow(DT) elements then the assignment is called a column plonk and is the way to change a column's type. Column types can be observed with sapply(DT,typeof).
Q: Can somebody explain me why option 1 does not work? Does this seem like a bug to you?
Extra's:
it is obviously also possible to do the following:
DT3 <- data.table(a = "A", b = "B")
DT3[, b := ifelse(a == "A" & b == "B", 1, 0)] #option 3: correct behavior
However, I prefer option 1 over option 3 because I would like to keep the variable logic & filter logic separate.
Note: the issue also arises when replacing ifelse with dplyr::if_else or base::if...else
Classes have a hierarchy - character is more general than numeric. If you assign a character to (part) of a numeric vector, it's safe to convert the whole vector to character, because numerics can be represented as character.
In this case, you assign a numeric to part of a character vector, and data.table has the option to either
(a) check the whole vector (column) to see if it's safe to convert to numeric (expensive, and perhaps unexpected and surprising to users)
(b) convert the numeric value to character.
My guess is that when you use DT1[a == "A", ...], the internals assume that you are assigning to only part of the vector, even when your condition happens to match every row. So data.table performs the efficient and safe (b) option above and converts your numeric to a character.
On the other hand, the syntax DT2[, b := ifelse(b == "B", 1, 0)] overwrites the entire b column - it doesn't matter what was there before, you're putting a numeric there now.
I think the real lesson is that, if you want to change the class of a column you should do it explicitly rather than relying on automatic conversion based on assigning new values to a part of the column.

Search a list of columns for a set of values in R

I have a dataset where I am trying to, by row, check about 25 columns to see if they contain a value from a list. I am not having a problem referencing the list of values to search for, but I am having trouble searching multiple columns at once. I initially thought to create a list of columns to reference, but that doesn't see to be working because you can't use a list.
Right now, I am checking each column individually for a set of values, but I was hoping to do this with less code because I will want to reference this set of columns more than once while cleaning these data. This is what I am currently using:
Dx.Elem<-list(c("DX1", "DX2", "DX3", "DX4", "DX5", "DX6", "DX7", "DX8", "DX9", "DX10", "DX11", "DX12", "DX13", "DX14", "DX15", "DX16", "DX17", "DX18",
"DX19", "DX20", "DX21", "DX22", "DX23", "DX24", "DX25"))
Dx.Panc9<-list("86384", "86394", "86382", "86392", "86381", "86391", "86383", "86393")
mydata2$Panc9<-0
mydata2$Panc9[mydata2$DX1 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX2 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX3 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX4 %in% Dx.Panc9]<-1
The assignment of 1s actually goes to referencing mydata2$DX25, I just cut it off here to spare redundancy.
I have tried substituting referencing a list, but that doesn't work because it can't use a list.
mydata2$Panc9[mydata2[, Dx.Elem] %in% Dx.Panc9]<-1
and I get this error
Error in .subset(x, j) : invalid subscript type 'list'
Is there a way to use a list to achieve what I am trying to achieve?
Thank you for any help.
For your specific case:
lapply(mydata2[Dx.Elem], `%in%`, Dx.Panc9)
With some example data:
# create example data
set.seed(1234)
df <- data.frame(
x1 = round(runif(100, 1, 10)),
x2 = round(runif(100, 1, 10)),
x3 = round(runif(100, 1, 10)),
x4 = round(runif(100, 1, 10)),
x5 = round(runif(100, 1, 10))
)
# vector of numbers to search for (like Dx.Panc9)
numcheck <- c(2, 4)
# columns of data.frame in which to search (like Dx.Elem)
mycols <- c("x2", "x3", "x4", "x5")
# perform the check
result_list <- lapply(df[mycols], `%in%`, numcheck)
This returns a list where each element is a vector of length nrow(df). If your question is whether any column contains any the desired numbers, you can do something like this:
result_df <- data.frame(result_list)
rowSums(result_df) > 0

as.formula does not like equivalence '=' (object not found)

consider the following example
df1 <- data.frame(a=c(1,2,3),b=c(2,4,6));
transform(df1,c=a+b)
a b c
1 1 2 3
2 2 4 6
3 3 6 9
So far, so good. Now I would like to code this dynamically, using as.formula:
transform(df1,as.formula("c=a+b"))
However, R says
Error in eval(expr, envir, enclos) : object 'b' not found
This error does not occur using "~" as separator of left hand and right hand side. Can I somehow delay the evaluation of the formula? Is it possible at all to use as.formula on an assignment? I have tried fiddling around with 'with' but to no avail.
I've solved the problem you mentioned in your comment, since that seems to be your real goal. This avoids messing with the formulae from your original question.
A reproducible version of your dataset.
group_names <- apply(
expand.grid("X", c("X", "O", "Y"), c("A", "B", "C"), "_", 0:9, 0:9),
1,
paste,
collapse = ""
)
n_groups <- 50
n_points_per_group <- 10
df1 <- as.data.frame(matrix(
runif(n_points_per_group * n_groups),
ncol = n_groups
))
colnames(df1) <- sample(group_names, n_groups)
Now convert the data frame to long format. (Using reshape package here. You could also use stats::reshape.)
melted_df1 <- melt(df1)
Define a grouping based upon your criteria that the second character and the number match.
melted_df1$group <- with(melted_df1, paste(
substring(variable, 2, 2),
substring(variable, 5, 6),
sep = ""
))
Now call tapply (or plyr::ddply if you prefer) to get the summary stats.
with(melted_df1, tapply(value, group, mean))

R: How to replace elements of a data.frame?

I'm trying to replace elements of a data.frame containing "#N/A" with "NULL", and I'm running into problems:
foo <- data.frame("day"= c(1, 3, 5, 7), "od" = c(0.1, "#N/A", 0.4, 0.8))
indices_of_NAs <- which(foo == "#N/A")
replace(foo, indices_of_NAs, "NULL")
Error in [<-.data.frame(*tmp*, list, value = "NULL") :
new columns would leave holes after existing columns
I think that the problem is that my index is treating the data.frame as a vector, but that the replace function is treating it differently somehow, but I'm not sure what the issue is?
NULL really means "nothing", not "missing" so it cannot take the place of an actual value - for missing R uses NA.
You can use the replacement method of is.na to directly update the selected elements, this will work with a logical result. (Using which for indices will only work with is.na, direct use of [ invokes list access, which is the cause of your error).
foo <- data.frame("day"= c(1, 3, 5, 7), "od" = c(0.1, "#N/A", 0.4, 0.8))
NAs <- foo == "#N/A"
## by replace method
is.na(foo)[NAs] <- TRUE
## or directly
foo[NAs] <- NA
But, you are already dealing with strings (actually a factor by default) in your od column by forced coercion when it was created with c(), and you might need to treat columns individually. Any numeric column will never have a match on the string "#N/A", for example.
Why not
x$col[is.na(x$col)]<-value
?
You wont have to change your dataframe
The replace function expects a vector and you're supplying a data.frame.
You should really try to use NA and NULL instead of the character values that you're currently using. Otherwise you won't be able to take advantage of all of R's functionality to handle missing values.
Edit
You could use an apply function, or do something like this:
foo <- data.frame(day= c(1, 3, 5, 7), od = c(0.1, NA, 0.4, 0.8))
idx <- which(is.na(foo), arr.ind=TRUE)
foo[idx[1], idx[2]] <- "NULL"
You cannot assign a real NULL value in this case, because it has length zero. It is important to understand the difference between NA and NULL, so I recommend that you read ?NA and ?NULL.

Resources