Change default NA from logical to character - r

Is there a way to change the default NA (missing) from logical to character (NA_character_) for an entire R session?
For example, if you load a CSV where one column is empty, it will be filled with NA, and the class of that NA will be logical. For this question, we want a way to ensure that it will always be NA_character_. Not to be confused with the literal string "NA".
More examples:
> class(NA)
"logical" # No!
> class(NA_character_)
"character" # Yes! but for NA!

Not sure if I understand but you could specify na.strings argument.
Ex:
df <- read.table(text='
a b c d e
1 56 43.0 12 1 NA
2 23 NA 7 2 45
3 15 90.7 10 3 2
4 10 30.5 2 4 NA', na.strings="", as.is=T)
And:
> class(df$b)
[1] "character"
>

As far as I can see, the answer is no:
from the documentation of NA
Details
The NA of character type is distinct from the string "NA". Programmers who need to specify an explicit missing string should use NA_character_ (rather than "NA") or set elements to NA using is.na<-.
I browsed thought the list of input parameters to the 'options' function and nothing seem to apply here.
I think the best way and safest is to explicitly define the NAs as they are likely to be encountered. For the csv-case in your example I would recommend the readr package where the 'col_types' is used to define classes.

Related

strpslit a character array and convert to dataframe simultaneously

I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.

Determining NA and Infinite (INF) values from a list that is split and making elements containing INF, NA values NULL

Hi I have a list which has been split by user. Structure of List is
> lst
$A
timestamp user value
2011-01-01 A 1184437
2011-02-01 A 1197000
2011-03-01 A 1483965
2011-04-01 A 1248051
2011-05-01 A 1285838
$B
timestamp user value
2011-01-01 B 12315
2011-02-01 B 12325345
2011-03-01 B 1235223
2011-04-01 B Inf
2011-05-01 B Inf
$C
timestamp user value
2011-01-01 C NA
2011-02-01 C NA
2011-03-01 C 1181080
2011-04-01 C 1326289
2011-05-01 C 1264455
During runtime I want to determine whether any element in the list contains INF or NA value. If yes, then store the name of element someplace else and make that element in the list NULL. I have been trying to use is.infinite() for catching INF values but it's not working stating error
invalid subscript type 'list'
Code Used:
NA_names <- names(lst)[sapply(lst, function(x) sum(is.na(x)) > 0)]
inf_names <- names(lst)[sapply(lst, function(x) sum(is.infinite(x)) > 0)]
Any help or suggestion regarding this? Since sapply works with data frame I'm not sure which method to use.
Something like a nested sapply should work assuming that the list elements are composed of data.frames.
# get the list elements that have any infinite value within
keepers <- !sapply(myList, function(i) any(sapply(i, is.infinite)))
keepers
a b c
TRUE FALSE TRUE
# get new list
myNewList <- myList[keepers]
# print names of dropped list items
names(keepers)[keepers]
[1] "b"
You can do this with the purrr package:
library(purrr)
drops <- map(lst, 'value') %>% # extract the 'value' column from each data.frame
keep(~ any(!is.finite(.))) %>% # keep only items with non-finite values
names() # get the names of the remaining list items
lst[drops] <- NULL
purrr::map works just like lapply except that it gives you convenient shortcuts for extracting elements in your list (like using a string to extract columns from a data.frame like in the example). purrr::keep iterates over the list and only keeps the elements which satisfy the logical condition you specify.

R function to sort column by string length then by alphabet?

I would like to sort one column in my data frame by string length first then by alphabet, I tried code below:
#sort column by string length then alphabet
GSN[order(nchar(GSN[,3]),GSN[,3]),]
But I got error
Error in nchar(GSN[, 3]) : 'nchar()' requires a character vector
My data looks like:
Flowcell Lane barcode sample plate row column
314 NA NA AACAGACATT LD06_7620SDS GSN1_Hind384D B 4
307 NA NA AACAGCACT LG10_2688SDS GSN1_Hind384D C 3
289 NA NA AACCTC U09_105007SDS GSN1_Hind384D A 1
232 NA NA AACGACCACC 13_232 GSN1_Hind384C H 5
10 NA NA AACGCACATT 13_10 GSN1_Hind384A B 2
165 NA NA AACGG 13_165 GSN1_Hind384B E 9
I would like to sort "barcode" column.
Thanks for your time.
You can add another column to your data frame that contains the number of characters in the barcode, then sort in the usual way.
GSN <- transform(GSN, n=nchar(as.character(barcode)))
GSN[with(GSN, order(n, barcode)), ]
It appears that the issue you were having is because R thinks that barcode is a factor rather than a character vector, so nchar() is invalid. Converting it to character via as.character() solves this.
I wish to add a tidyverse solution
library(tidyverse)
GSN_sorted = GSN %>%
mutate(barcode = as.character(barcode)) %>%
arrange(str_length(barcode), barcode)
Note the factor to character conversion originally pointed out by Alex A.

R accessing variable column names for subsetting

The following works and does what I want it to do:
dat<-subset(data,NLI.1 %in% NLI)
However, I may need to subset via a different column (i.e. NLI.2 and NLI.3). I've tried
NLI_col<-"NLI.1"
NLI_col<-subset(data,select=NLI_col)
dat<-subset(data,NLI_col %in% NLI)
Unsurprisingly this doesn't work. How do I use NLI_col to achieve the result from the code that does work?
It was requested that I give an example of what data looks like. Here:
NLI.1<-c(NA,NA,NA,NA,NA,1,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,2,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,NA,NA,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,1,2,2,2,2,2,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,2,2,1,2,2,2)
NLI.2<-c(NA,NA,NA,NA,NA,NA,2,2,2,NA,NA,2,2,2,2,2,2,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,NA,NA,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,1,2,2,2,2,2,2,2)
NLI.3<-c(NA,35,40,NA,10,NA,31,NA,14,NA,NA,15,17,NA,NA,16,10,15,14,39,17,35,14,14,22,10,15,0,34,23,13,35,32,2,14,10,14,10,10,10,40,10,13,13,10,10,10,13,13,25,10,35,NA,13,NA,10,40,0,0,20,40,10,14,40,10,10,10,10,13,10,8,NA,NA,14,NA,10,28,10,10,15,15,16,10,10,35,16,NA,NA,NA,NA,30,19,14,30,10,10,8,10,21,10,10,35,15,34,10,39,NA,10,10,6,16,10,10,10,10,34,10)
other<-c(NA,NA,511,NA,NA,NA,NA,NA,849,NA,NA,NA,NA,1324,1181,832,1005,166,204,1253,529,317,294,NA,514,801,534,1319,272,315,572,96,666,236,842,980,290,843,904,528,27,366,540,560,659,107,63,20,1184,1052,214,46,139,310,872,891,651,687,434,1115,1289,455,764,938,1188,105,757,719,1236,982,710,NA,NA,632,NA,546,747,941,1257,99,133,61,249,NA,NA,1080,NA,645,19,107,486,1198,276,777,738,1073,539,1096,686,505,104,5,55,553,1023,1333,NA,NA,969,691,1227,1059,358,991,1019,NA,1216)
data<-cbind(NLI.1,NLI.2,NLI.3,other)
NLI<-c(10,13)
With this, after sub-setting I should get all the rows with tens and thirteens in data$NLI.3 if NLI_col <- "NLI.3"
Since this is relatively trivial I am guessing this is a duplicate question (my apologies), but the hours drag on and I still cant find a solution
Seems like you are unnecessarily using subset. Try this:
NLI_col <- 'NLI.3'
head(data[,NLI_col] %in% NLI)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
head(data[data[,NLI_col] %in% NLI, ])
## NLI.1 NLI.2 NLI.3 other
## 5 NA NA 10 NA
## 17 2 2 10 1005
## 26 2 2 10 801
## 31 2 2 13 572
## 36 2 2 10 980
## 38 2 2 10 843
I'm not sure I am following the question exactly. Are you asking to just subset the rows of NLI.3 that contain a 10 or a 13? Is it more complicated than that?
If you just want to get those rows....
df[ which(df$NLI.3==10 | df$NLI.3==13 ),]
Assuming your data is in a dataframe. Also, I changed the name of the dataframe from 'data' to 'df' - calling it 'data' can lead to issues.

Changing values when converting column type to numeric

I have a data file with the format from above.
I loaded it into R, and tried to plot a histogram with the values from the dist column and I have got the error "x must be numeric".Therefore I tried to change the format.
> head(data)
V1 V2
1 type gene_dist
2 A 64667
3 A 76486
4 A 97416
5 A 30876
6 A 88018
> summary(data)
V1 V2
A : 67 100 : 1
B :122 100906 : 1
type: 1 102349 : 1
1033 : 1
10544 : 1
10745 : 1
(Other):184
I tried to set the format for the column using sapply but the values are changed:
> data[,2]<-sapply(data[,2],as.numeric)
> head(data)
V1 V2
1 type 190
2 A 146
3 A 166
4 A 189
summary(data)
V1 V2
A : 67 Min. : 1.00
B :122 1st Qu.: 48.25
type: 1 Median : 95.50
Mean : 95.50
3rd Qu.:142.75
Max. :190.00
Does anyone know why is this happening?
It looks like your second column is a factor. You need to use as.character before as.numeric. This is because factors are stored internally as integers with a table to give the factor level labels. Just using as.numeric will only give the internal integer codes. There is no need to use sapply since these functions are vectorized.
data[,2] <- as.numeric(as.character(data[,2]))
It is likely that the column is a factor because there are some non-numeric characters in some of the entries. Any such entries will be converted to NA with the appropriate warning, but you may want to investigate this in your raw data.
As a side note, data is a poor (though not invalid) choice for a variable name since there is a base function of the same name.
I had the same issue, but as I found, the root cause was different, and so I share this as an answer but not a comment.
df <- read.table(doc.csv, header = TRUE, sep = ",", dec = ".")
df$value
# Results in
[1] 2254 1873 2201 2147 2456 1785
# So..
as.numeric(df$value)
[1] 26 14 22 20 32 11
In my case, the reason was that there were spaces with the values in the original csv document. Removing the spaces fixed the issue.
From the dput(df)
" 1178 ", " 1222 ", " 1223 ", " 1314 ", " 1462 ",
I had this same issue for a matrix containing 'list' values, when an object data was read in from read.csv. as.character() does not work here, and as.numeric() and data.matrix() changed the values in the matrix. Instead you need to use the following:
matrix_numeric[1:m,1:n] <- as.numeric(as.matrix(data[1:m,1:n]))
First converting to a character then to a double. For matrix dimensions data[m,n]. (you need to create the object matrix_numeric first before assigning values... matrix_numeric <- matrix(0,m,n) )
For a vector vec1 in list format, I use the following:
out1 <- as.numeric(unlist(vec1));
It's probably much better to fix it when reading the file than by using as.numeric() or as.character(). When reading your file, make sure to have:
header=TRUE if first row is header
NA and not Na or NaN (ctrl+H and replace by NA in your datafile)
no other character strings in your numeric columns
Then R will automatically consider them as numeric.

Resources