R typecast NA vs string - r

I had a problem with some of my code and I fixed it but don't fully understand why the error was an error
The code looked like:
for(i in 1:3){df = rbind.fill(z, data.frame(id=i,
data=if(is.null(x$results[[i]]$synopsis$data))
{NA}else{x$results[[i]]$synopsis$data}))}
The issue I had was, if the first data value was indeed null, I would get NA but then for the second and third I would either get another NA or if there was data I wouldn't get it, I would get 1.
If the first value was data, then I would get the data and for the other two I would either get NA or the correct data.
I'm not a computer scientist but a dev who sits near me (but doesnt know R) suggested it was something to do with different typecasts of NA and a string. To solve the issue I changed the NA to "0" (I suppose "NA" would work too).
I'd just like a more thorough explanation into what was happening. My layman's understanding is if NA was the first result, then every result is in that "format" where something is either NA or not and not is handled as 1 which is kinda of like a Boolean response?
Example:
my.list <- list(list(),structure(
list(
experience = structure(
list(
start = "Hi"
),.Names = c("start")),
`_meta` = structure(
list(weight = 1L, `_sources` = list(structure(
list(`_origin` = "a"), .Names = "_origin"
))),.Names = c("weight", "_sources"))),.Names = c("experience", "_meta")))
my.list[[1]]$experience$start
NULL
my.list[[2]]$experience$start
[1] "Hi"
df <- NULL
for(i in 1:2){df = rbind.fill(df, data.frame(id=i,
data=if(is.null(my.list[[i]]$experience$start))
{NA}else{my.list[[i]]$experience$start}))}
Then
df2 <- NULL
for(i in 1:2){df2 = rbind.fill(df2, data.frame(id=i,
data=if(is.null(my.list[[i]]$experience$start))
{"NA"}else{my.list[[i]]$experience$start}))}
Results:
df: df2:
id data id data
1 NA 1 NA
2 1 2 HI

Olivia, thanks for clarifications.
You are nearly there. As you loop, indeed the first iteration will determine the class of the column data of your output data.frame df.
In scenario 1, you can have a better idea by going through the loop step by step:
df <- NULL
i=1
df = rbind.fill(df, data.frame(id=i,
data=if(is.null(my.list[[i]]]$experience$start)) {NA}
else{my.list[[i]]$experience$start}))
df
id data
1 1 NA
Then, have a look at class of df$data
class(df$data)
[1] "logical"
Which is derived from: mode(NA) (logical).
As an alternative way, when you store data related in your set of experiments in a list, you should try to use a "R-ish" way to manipulate this list.
For instance, you can try:
sapply(my.list, FUN=function(element)element$experience$start)
[[1]]
NULL
[[2]]
[1] "Hi"
Which highlights that you tries to gather together sets of incompatibles contents: simplification can't go simpler than this list -- if you unlist you would dismiss this meaningful NULL

Related

Getting last character/number of data frame column

I'm trying to get the last character or number of a series of symbols on data frame so I can filter some categories after. But I'm not getting the expected result.
names = as.character(c("ABC Co","DEF Co","XYZ Co"))
code = as.character(c("ABCN1","DEFMO2","XYZIOIP4")) #variable length
my_df = as.data.frame(cbind(names,code))
First Approach:
my_df[,3] = substr(my_df[,2],length(my_df[,2]),length(my_df[,2]))
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("C","F","Z")
Then, I realized that length(my_df[,2]) is the number of rows of my data frame, and not the length of each cell. So, I decided to create this loop:
for (i in length(nrow(my_df))){
my_df[i,3] = substr(my_df[i,2],length(my_df[i,2]),length(my_df[i,2]))
}
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("A","F","Z")
So then I tried:
for (i in length(nrow(my_df))){
my_df[i,3] = substr(my_df[i,2],-1,-1)
}
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("","F","Z")
Not getting any luck, any thoughts of what am I missing? Thank you very much!
length is a vector (or list) property, whereas in substr you probably need a string property. Base R's nchar works.
my_df = as.data.frame(cbind(names, code), stringsAsFactors = FALSE)
substr(my_df[,2], nchar(my_df[,2]), nchar(my_df[,2]))
# [1] "1" "2" "4"
(I added stringsAsFactors = FALSE, otherwise you'll need to add as.character.)
If the last character is always a number you can do:
library(stringr)
str_extract(my_df$code, "\\d$")
[1] "1" "2" "4"
If the last character can be anything you can do this:
str_extract(my_df$code, ".$")
You can use substr:
my_df$last_char <- substr(code, nchar(code), nchar(code))
# or my_df$last_char <- substr(my_df$code, nchar(my_df$code), nchar(my_df$code))
Output
my_df
# names code last_char
# 1 ABC Co ABCN1 1
# 2 DEF Co DEFMO2 2
# 3 XYZ Co XYZIOIP4 4
We can use sub
sub(".*(\\d+$)", "\\1", my_df$code)

Complex sorting by single colum in R

I am trying to sort a data frame by codes contained in one column.
The logic behind these code is:
S/number/number/number/digit (e.g. S120B). The numbers are not always 3 (e.g. S10K) and the letters are not always present (e.g. S2).
The first code is S1, and the list goes until S999, where it turns to S1A. Then it goes to S999A and then turns to S1B, and so on.
Furthermore, there are also codes inside thare are totally different, as W23, E100, etc that should go together.
How can I order the dataframe according to this pretty sick ordering scheme?
MWE: codes <- c(S1, S20D, S550C, S88A, S420K, E44, W22)
Following your directions, this is a customized function:
codes <- c("S1", "S20D", "E44", "S550C", "S88A", "S420K", "W22")
complex_order <- function(codes) {
# Create empty order vector
final_order <- rep(NA,length(codes))
# First into account codes that do not match the S convention
not_in_convention <- !tolower(substr(codes,1,1)) == "s"
final_order[(length(codes)-sum(not_in_convention)+1):length(codes)] <- which(not_in_convention)
# Then check the ones that has a letter at the end
letter_at_end <- tolower(substr(codes,nchar(codes),nchar(codes))) %in% letters & !not_in_convention
for (idx in which(letter_at_end)) {
lettr <- tolower(substr(codes[idx],nchar(codes[idx]),nchar(codes[idx])))
lettr_value <- which(lettr == letters) * 1000 # Every letter means 1000 positions ahead
codes[idx] <- paste0("S",as.character(lettr_value))
}
# Now that we have all in the same code, order the values
values <- as.numeric(tolower(substr(codes[!not_in_convention],2,nchar(codes[!not_in_convention]))))
final_order[order(values)] <- which(!not_in_convention)
final_order
}
codes[complex_order(codes)]
[1] "S1" "S88A" "S550C" "S20D" "S420K" "E44" "W22"
Hope it helps!
1.Create minimal reproducible example ;)
mre <- data.frame(ID = c("S1", "S20D", "S550C", "S88A", "S420K", "E44", "W22"),
stringsAsFactors = FALSE)
Now, I am not sure what you mean by:
Furthermore, there are also codes inside thare are totally different, as W23, E100, etc that should go together.
If you mean that "W23" should be read an sorted totally different than "S999" we need some additional information on how to distinguish between the two cases. Otherwise this should work:
2.Suggested solution alphabetical sorting:
library(dplyr)
mre %>%
arrange(ID)
ID
1 E44
2 S1
3 S20D
4 S420K
5 S550C
6 S88A
7 W22
Or using only base R:
mre[order(mre$ID),]

Extraction operator `$`() returns zero-length vectors within function

I am encountering an issue when I use the extraction operator `$() inside of a function. The problem does not exist if I follow the same logic outside of the loop, so I assume there might be a scoping issue that I'm unaware of.
The general setup:
## Make some fake data for your reproducible needs.
set.seed(2345)
my_df <- data.frame(cat_1 = sample(c("a", "b"), 100, replace = TRUE),
cat_2 = sample(c("c", "d"), 100, replace = TRUE),
continuous = rnorm(100),
stringsAsFactors = FALSE)
head(my_df)
This process I am trying to dynamically reproduce:
index <- which(`$`(my_df, "cat_1") == "a")
my_df$continuous[index]
But once I program this logic into a function, it fails:
## Function should take a string for the following:
## cat_var - string with the categorical variable name as it appears in df
## level - a level of cat_var appearing in df
## df - data frame to operate on. Function assumes it has a column
## "continuous".
extract_sample <- function(cat_var, level, df = my_df) {
index <- which(`$`(df, cat_var) == level)
df$continuous[index]
}
## Does not work.
extract_sample(cat_var = "cat_1", level = "a")
This is returning numeric(0). Any thoughts on what I'm missing? Alternative approaches are welcome as well.
The problem isn't the function, it's the way $ handles the input.
cat_var = "cat_1"
length(`$`(my_df,"cat_1"))
#> [1] 100
length(`$`(my_df,cat_var))
#> [1] 0
You can instead use [[ to achieve your desired outcome.
cat_var = "cat_1"
length(`[[`(my_df,"cat_1"))
#> [1] 100
length(`[[`(my_df,cat_var))
#> [1] 100
UPDATE
It's been noted that using [[ this way is ugly. And it is. It's useful when you want to write something like lapply(stuff,'[[',1)
Here, you should probably be writing it as my_df[[cat_var]].
Also, this question/answer goes into a little more detail about why $ doesn't work the way you want it to.
The problem is that the $ is non-standard, in the sense that when you don't quote the parameter input, it still tries to parse it and use what you typed, even if that was meant to refer to another variable.
Or more simply, as #42 put it in the first comment in the linked question:
The "$" function does not evaluate its arguments, whereas "[[" does`.
Here's a much simpler data set as an example.
my_df <- data.frame(a=c(1,2))
v <- "a"
Compare the usual usage; the first two give the same result, if you don't quote it, it parses it. So the third one (now) clearly doesn't work properly.
my_df$"a"
## [1] 1 2
my_df$a
## [1] 1 2
my_df$v
## NULL
That's exactly what's happening to you:
`$`(my_df, "a")
## [1] 1 2
`$`(my_df, v)
## NULL
Instead we need to evaluate v before sending to $ by using do.call.
do.call(`$`, list(my_df, v))
## [1] 1 2
Or, more appropriately, use the [[ version which does evaluate the parameters first.
`[[`(my_df, v)
## [1] 1 2
Problem lies in the way you are indexing to the column. This works just making a slight tweak to yours:
extract_sample <- function(cat_var, level, df = my_df) {
index <- df[, cat_var] == level
df$continuous[index]
}
Using it dynamically:
> extract_sample(cat_var = "cat_2", level = "d")
[1] -0.42769207 -0.75650031 0.64077840 -1.02986889 1.34800344 0.70258431 1.25193247
[8] -0.62892048 0.48822673 0.10432070 1.11986063 -0.88222370 0.39158408 1.39553002
[15] -0.51464283 -1.05265106 0.58391650 0.10555913 0.16277385 -0.55387829 -1.07822831
[22] -1.23894422 -2.32291394 0.11118881 0.34410388 0.07097271 1.00036812 -2.01981056
[29] 0.63417799 -0.53008375 1.16633422 -0.57130500 0.61614135 1.06768285 0.74182293
[36] 0.56538633 0.16784205 -0.14757303 -0.70928924 -1.91557732 0.61471302 -2.80741967
[43] 0.40552376 -1.88020372 -0.38821089 -0.42043745 1.87370600 -0.46198139 0.10788358
[50] -1.83945868 -0.11052531 -0.38743950 0.68110902 -1.48026285

Comparing two lists [R]

I have two rather long lists (both are 232000 rows). When trying to run analyses using both, R is giving me an error that some elements in one lists are not in the other (for a particular code to run, both lists need to be exactly the same). I have done the following to try and decipher this:
#In Both
both <- varss %in% varsg
length(both)
#What is in Both
int <- intersect(varss,varsg)
length(int)
#What is different in varss
difs <- setdiff(varss,varsg)
length(difs)
#What is different in varsg
difg <- setdiff(varsg,varss)
length(difg)
I think I have the code right, but my problem is that the results from the code above are not yielding what I need. For instance, for both <- varss %in% varsg I only get a single FALSE. Do both my lists need to be in a specific class in order for this to work? I've tried data.frame, list and character. Not sure whether anything major like a function needs to be applied.
Just to give a little bit more information about my lists, both are a list of SNP names (genetic data)
Edit:
I have loaded these two files as readRDS() and not sure whether this might be causing some problems. When trying to just use varss[1:10,] i get the following info:
[1] rs41531144 rs41323649 exm2263307 rs41528348 exm2216184 rs3901846
[7] exm2216185 exm2216186 exm2216191 exm2216198
232334 Levels: exm1000006 exm1000025 exm1000032 exm1000038 ... rs9990343
I have little experience with RData files, so not sure whether this is a problem or not...
Same happens with using varsg[1:10,] :
[1] exm2268640 exm41 exm1916089 exm44 exm46 exm47
[7] exm51 exm53 exm55 exm56
232334 Levels: exm1000006 exm1000025 exm1000032 exm1000038 ... rs999943
All of the functions you have shown do not play well with lists or data.frames, e.g:
varss <- list(a = 1:8)
varsg <- list(a = 2:9)
both <- varss %in% varsg
both
# [1] FALSE
#What is in Both
int <- intersect(varss,varsg)
int
# list()
#What is different in varss
difs <- setdiff(varss,varsg)
difs
# [[1]]
# [1] 1 2 3 4 5 6 7 8
#What is different in varsg
difg <- setdiff(varsg,varss)
difg
# [[1]]
# [1] 2 3 4 5 6 7 8 9
I suggest you switch to vectors by doing:
varss <- unlist(varss)
varsg <- unlist(varsg)

Assigning NULL to a list element in R?

I found this behaviour odd and wanted more experienced users to share their thoughts and workarounds.
On running the code sample below in R:
sampleList <- list()
d<- data.frame(x1 = letters[1:10], x2 = 1:10, stringsAsFactors = FALSE)
for(i in 1:nrow(d)) {
sampleList[[i]] <- d$x1[i]
}
print(sampleList[[1]])
#[1] "a"
print(sampleList[[2]])
#[1] "b"
print(sampleList[[3]])
#[1] "c"
print(length(sampleList))
#[1] 10
sampleList[[2]] <- NULL
print(length(sampleList))
#[1] 9
print(sampleList[[2]])
#[1] "c"
print(sampleList[[3]])
#[1] "d"
The list elements get shifted up.
Maybe this is as expected, but I am trying to implement a function where I merge two elements of a list and drop one. I basically want to lose that list index or have it as NULL.
Is there any way I can assign NULL to it and not see the above behaviour?
Thank you for your suggestions.
Good question.
Check out the R-FAQ:
In R, if x is a list, then x[i] <- NULL and x[[i]] <- NULL remove the specified elements from x. The first of these is incompatible with S, where it is a no-op. (Note that you can set elements to NULL using x[i] <- list(NULL).)
consider the following example:
> t <- list(1,2,3,4)
> t[[3]] <- NULL # removing 3'd element (with following shifting)
> t[2] <- list(NULL) # setting 2'd element to NULL.
> t
[[1]]
[2] 1
[[2]]
NULL
[[3]]
[3] 4
UPDATE:
As the author of the R Inferno commented, there can be more subtle situations when dealing with NULL. Consider pretty general structure of code:
# x is some list(), now we want to process it.
> for (i in 1:n) x[[i]] <- some_function(...)
Now be aware, that if some_function() returns NULL, you maybe will not get what you want: some elements will just disappear. you should rather use lapply function.
Take a look at this toy example:
> initial <- list(1,2,3,4)
> processed_by_for <- list(0,0,0,0)
> processed_by_lapply <- list(0,0,0,0)
> toy_function <- function(x) {if (x%%2==0) return(x) else return(NULL)}
> for (i in 1:4) processed_by_for[[i]] <- toy_function(initial[[i]])
> processed_by_lapply <- lapply(initial, toy_function)
> processed_by_for
[[1]]
[1] 0
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
> processed_by_lapply
[[1]]
NULL
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
Your question is a bit confusing to me.
Assigning null to an existing object esentially deletes that object (this can be very handy for instance if you have a data frame and wish to delete specific columns). That's what you've done. I am unable to determine what it is that you want though. You could try
sampleList[[2]] <- NA
instead of NULL, but if by "I want to lose" you mean delete it, then you've already succeeded. That's why, "The list elements get shifted up."
obj = list(x = "Some Value")
obj = c(obj,list(y=NULL)) #ADDING NEW VALUE
obj['x'] = list(NULL) #SETTING EXISTING VALUE
obj
If you need to create a list of NULL values which later you can populate with values (dataframes, for example) here is no complain:
B <-vector("list", 2)
a <- iris[sample(nrow(iris), 10), ]
b <- iris[sample(nrow(iris), 10), ]
B[[1]]<-a
B[[2]]<-b
The above answers are similar, but I thought this was worth posting.
Took me a while to figure this one out for a list of lists. My solution was:
mylist[[i]][j] <- list(double())

Resources