I'm trying to replace elements of a data.frame containing "#N/A" with "NULL", and I'm running into problems:
foo <- data.frame("day"= c(1, 3, 5, 7), "od" = c(0.1, "#N/A", 0.4, 0.8))
indices_of_NAs <- which(foo == "#N/A")
replace(foo, indices_of_NAs, "NULL")
Error in [<-.data.frame(*tmp*, list, value = "NULL") :
new columns would leave holes after existing columns
I think that the problem is that my index is treating the data.frame as a vector, but that the replace function is treating it differently somehow, but I'm not sure what the issue is?
NULL really means "nothing", not "missing" so it cannot take the place of an actual value - for missing R uses NA.
You can use the replacement method of is.na to directly update the selected elements, this will work with a logical result. (Using which for indices will only work with is.na, direct use of [ invokes list access, which is the cause of your error).
foo <- data.frame("day"= c(1, 3, 5, 7), "od" = c(0.1, "#N/A", 0.4, 0.8))
NAs <- foo == "#N/A"
## by replace method
is.na(foo)[NAs] <- TRUE
## or directly
foo[NAs] <- NA
But, you are already dealing with strings (actually a factor by default) in your od column by forced coercion when it was created with c(), and you might need to treat columns individually. Any numeric column will never have a match on the string "#N/A", for example.
Why not
x$col[is.na(x$col)]<-value
?
You wont have to change your dataframe
The replace function expects a vector and you're supplying a data.frame.
You should really try to use NA and NULL instead of the character values that you're currently using. Otherwise you won't be able to take advantage of all of R's functionality to handle missing values.
Edit
You could use an apply function, or do something like this:
foo <- data.frame(day= c(1, 3, 5, 7), od = c(0.1, NA, 0.4, 0.8))
idx <- which(is.na(foo), arr.ind=TRUE)
foo[idx[1], idx[2]] <- "NULL"
You cannot assign a real NULL value in this case, because it has length zero. It is important to understand the difference between NA and NULL, so I recommend that you read ?NA and ?NULL.
Related
I wish to gives values in a vector names. I know how to do that but in this case I have many names and many values, both within vectors within lists, and typing them by hand would by suicide.
This method:
> values <- c('jessica' = 1, 'jones' = 2)
> values
jessica jones
1 2
obviously works. However, this method:
> names <- c('jessica', 'jones')
> values <- c(names[1] = 1, names[2] = 2)
Error: unexpected '=' in "values <- c(names[1] ="
Well... I cannot understand why R refuses to read these as pure characters to assign them as names.
I realize I can create values and names separately and then assign names as names(values) but again, my actual case is far more complex. But really I would just like to know why this particular issue occurs.
EDIT I: The ACTUAL data I have is a list of vectors, each is a different combination of amounts of ingredients, and then a giant vector of ingredient names. I cannot just set the name vector as names, because the individual names need to be placed by hand.
EDIT II: Example of my data structure.
ingredients <- c('ing1', 'ing2', 'ing3', 'ing4') # this vector is much longer in reality
amounts <- list(c('ing1' = 1, 'ing2' = 2, 'ing4' = 3),
c('ing2' = 2, 'ing3' = 3),
c('ing1' = 12, 'ing2' = 4, 'ing3' = 3),
c('ing1' = 1, 'ing2' = 1, 'ing3' = 2, 'ing4' = 5))
# this list too is much longer
I could type each numeric value's name individually as presented, but there are many more, and so I tried instead to input the likes of:
c(ingredients[1] = 1, ingredients[2] = 2, ingredients[4] = 3)
But this throws an error:
Error: unexpected '=' in "amounts <- list(c(ingredients[1] ="
We can use setNames
setNames(1:2, names)
Another option is deframe if we have a two column dataset
library(tibble)
tibble(names, val = 1:2) %>%
deframe
I have more of a conceptual question. I am looking for a way of deleting the entire row out of a dataframe if it contains a reference to data that doesn't exist in a second dataframe. The code below will produce you a data set for this problem.
v1 <- c(1, 2, 3, 4, 5, 6, 8)
v2 <- 100
nodedf <- data.frame(v1, v2)
colnames(nodedf) <- ("nid", "extra_variable")
v3 <- c(1, 2)
v4 <- c(1, 5)
v5 <- c(2, 6)
v6 <- c(3, 7)
v7 <- c(4, 9)
elementdf <- data.frame(v3, v4, v5, v6, v7)
colnames(elementdf) <- c("eid", "n1", "n2", "n3", "n4")
Basically, I want any row from elementdfdeleted if it references a node ids (n1, n2, n3, n4) that does not exist in nodedf. I know it's probably a rather simple problem, but I am really not so great at this kind of stuff. Thanks.
EDIT: now I am looking to do the reverse, where I want to delete rows of nodedf that make reference to nodes that do not exist in elementdf.
At first I tried to just re-arrange the old code chunk like so:
orphannodesbye<- nodedf[apply(nodedf[,1], 1, function(x) all(x %in% elementdf[,2:5])),]
However, I get an error message:
Error in apply(nodedf[, 1], 1, function(x) all(x %in% elementdf[, 2:5])) :
dim(X) must have a positive length
I would like the output to be the whole df with both fields (or more, as my actual dataset has more) nid and extra_variable.
Here's a base R solution
elementdf[apply(elementdf[,-1], 1, function(x) all(x %in% nodedf$nid)),]
Explanation:
The apply works by "applying" a function (a custom one in this case) to each row (the variable x in the function) of the object elementdf. If we wanted to do this by columns we would change the 1 to a 2.
The function we are using looks at each element in x (a row in elementdf) and tests if it is also in nodedf. The %in% is a special function which returns a vector of logicals, an element for each in x. The all function returns TRUE if all elements are TRUE (meaning all of them are in nodedf) and FALSE otherwise.
So in the end, the apply statement will return a vector of logicals, depending on whether each row has elements found in nodedf.
To get the values in each row that are not in nodedf, you could do
apply(elementdf[,-1], 1, function(x) x[!(x %in% nodedf$nid)])
which you'll notice is already pretty similar to the line of code above. Except in this case, the apply statement will return a list. From the example you gave, it will a list of length 2 where the first element is numeric(0) and the second element is a vector containing 7. If you have multiple offenders in one row, each will be shown.
To remove the rows in nodedf which do not have references in elementdf, you could do
nodedf[nodedf$nid %in% unique(unlist(elementdf[,-1])),]
The unique(unlist(...)) part just grabs all the unique values in elementdf[,-1], converting them to a numeric vector.
I have a dataset where I am trying to, by row, check about 25 columns to see if they contain a value from a list. I am not having a problem referencing the list of values to search for, but I am having trouble searching multiple columns at once. I initially thought to create a list of columns to reference, but that doesn't see to be working because you can't use a list.
Right now, I am checking each column individually for a set of values, but I was hoping to do this with less code because I will want to reference this set of columns more than once while cleaning these data. This is what I am currently using:
Dx.Elem<-list(c("DX1", "DX2", "DX3", "DX4", "DX5", "DX6", "DX7", "DX8", "DX9", "DX10", "DX11", "DX12", "DX13", "DX14", "DX15", "DX16", "DX17", "DX18",
"DX19", "DX20", "DX21", "DX22", "DX23", "DX24", "DX25"))
Dx.Panc9<-list("86384", "86394", "86382", "86392", "86381", "86391", "86383", "86393")
mydata2$Panc9<-0
mydata2$Panc9[mydata2$DX1 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX2 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX3 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX4 %in% Dx.Panc9]<-1
The assignment of 1s actually goes to referencing mydata2$DX25, I just cut it off here to spare redundancy.
I have tried substituting referencing a list, but that doesn't work because it can't use a list.
mydata2$Panc9[mydata2[, Dx.Elem] %in% Dx.Panc9]<-1
and I get this error
Error in .subset(x, j) : invalid subscript type 'list'
Is there a way to use a list to achieve what I am trying to achieve?
Thank you for any help.
For your specific case:
lapply(mydata2[Dx.Elem], `%in%`, Dx.Panc9)
With some example data:
# create example data
set.seed(1234)
df <- data.frame(
x1 = round(runif(100, 1, 10)),
x2 = round(runif(100, 1, 10)),
x3 = round(runif(100, 1, 10)),
x4 = round(runif(100, 1, 10)),
x5 = round(runif(100, 1, 10))
)
# vector of numbers to search for (like Dx.Panc9)
numcheck <- c(2, 4)
# columns of data.frame in which to search (like Dx.Elem)
mycols <- c("x2", "x3", "x4", "x5")
# perform the check
result_list <- lapply(df[mycols], `%in%`, numcheck)
This returns a list where each element is a vector of length nrow(df). If your question is whether any column contains any the desired numbers, you can do something like this:
result_df <- data.frame(result_list)
rowSums(result_df) > 0
I am trying to append a column of null values to a SparkR DataFrame with the following code:
w <- rbind(3, 0, 2, 3, NA, 1)
z <- rbind("a", "b", "c", "d", "e", "f")
x <- rbind(3, 3, 3, 3, 3, 3)
d <- cbind.data.frame(w, z, x)
B <- as.DataFrame(sqlContext, d)
B1 <- sample(B, withReplacement = FALSE, fraction = 0.5)
B2 <- except(B, B1)
col_sub <- c("z", "x")
B2 <- select(B2, col_sub)
B2 <- withColumn(B2, "w", lit(NA))
But, the last expression returns the error: Error in FUN(X[[i]], ...) : Unsupported data type: null. I have used the lit operation to produce a column of null values before, but I'm not sure why it won't work this time.
Also, this has been discussed on SE before, see this question. I'm completely clueless as to why my expression yields that error. For reference, I'm using SparkR 1.6.1.
No matter if it works or not adding column this way is not a good practice. Since the only practical reason to add column which contains only undefined values is enforcing specific schema for unions or external writes you should always use columns of specific type.
For example:
withColumn(B2, "w", cast(lit(NULL), "double"))
Spark columns can have types numeric, character. My understanding is that it is intended that columns of other data types are illegal.
NA is not recognized by SparkR in the same way that R recognizes it as being an indicator of a missing value. SparkR sees NA as being a value of type logical. For example:
dtypes(NA)
unable to find an inherited method for function ‘dtypes’ for signature ‘"logical"’
If you try to add a column of NA's, Spark tries to create a column of type logical, which is not a valid data type for a column. Hence the error.
There are a couple of places where SparkR (1.6.2) is inconsistent in trapping errors around creating illegal column types. As you found, SparkR throws an error if you use lit(NA), but SparkR will let you convert an R data.frame with a column of NAs and it successfully creates an illegal column of type "logical"
x <- c(NA,NA,NA, NA, NA)
dfX <- data.frame(x)
colnames(dfX) <- c("Empty")
sdfX <- createDataFrame(sqlContext, dfX)
str(sdfX)
'DataFrame': 1 variables:
$ Empty: logi NA NA NA NA NA
I want to make the value of a list element equal to another list, like so...
list_one <- as.list(c(A = NA, B = NA))
list_two <- as.list(c(C = 4, D = 5))
list_one['A'] <- list_two
This is throwing the following warning:
Warning message:
In list_one["A"] <- list_two :
number of items to replace is not a multiple of replacement length
How do I properly make list_two a sub-list of list_one so that I don't get this warning?
Use [[ instead of [
list_one[['A']] <- list_two