apply function yields the wrong answer - r

I am trying to replace all NAs for those columns with 0 or 1 only. However, I found that apply failed to deal with the NAs. If I replace the NAs with an arbitrary string i.e. "Unknown". Then lapply and apply yield the same result. Any explanation would be greatly appreciated.
Here is an example.
df<-data.frame(a=c(0,1,NA),b=c(0,1,0),c=c('d',NA,'c'))
apply(df,2,function(x){all(x %in% c(0,1,NA)) })
unlist(lapply(df,function(x){all(x %in% c(0,1,NA))}))

It is not recommended to use apply on a data.frame with different classes. The recommended option is lapply. Issue is that with apply, it converts to matrix and this can result in some issues especially when there are missing values involved i.e. creating extra spaces.
apply(df, 2, I)
# a b c
#[1,] " 0" "0" "d"
#[2,] " 1" "1" NA
#[3,] NA "0" "c"
If instead if the first column was already character, then the NA conversion from NA_real_ to NA_character_ wouldn't occur i.e.
df1 <- df
df1$a <- as.character(c(0, 1, NA))
apply(df1, 2, I)
# a b c
#[1,] "0" "0" "d"
#[2,] "1" "1" NA
#[3,] NA "0" "c"
An option is to wrap with trimws to remove the leading spaces
apply(df,2,function(x){all(trimws(x) %in% c(0,1,NA)) })
# a b c
# TRUE TRUE FALSE
NOTE: For testing the presence of NA, it is recommended to use is.na instead of %in%

Related

sapply not applying a function created to all rows in R dataframe

I have the following dataframe in R and am trying to use a stringsplit function to the same to yield a different dataframe
DF
A B C
"1,2,3" "1,2"
"2" "1"
The cells of the dataframe are filled with characters. The empty spaces are blank values. I have created the following function
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
The function works neatly when i use it on a single column
sapply(DF$A, sepfunc)
[1] "1" "2"
However, the following command yields only a single row
sapply(DF, sepfunc)
A B C
"1" NA "1"
The second row is not displayed. I know I must be missing something rudimentary. I request someone to help.
The expected output is
A B C
"1" NA "1"
"2" "1" "NA"
When we do the strsplit, the output is a list of vectors. If we just subset the first list element with [[1]], then the rest of the elements are skipped. Here the first element corresponds to the first row. But, when we do the same on a single column, it is looping through each element and then do the strsplit. It will not hurt by taking the first element [[1]] because the list is of length 1. Here, the case is different. The number of list elements are the same as the number of rows for each of the columns. So, we need to loop through the list (either with sapply/lapply - former gives a vector depends on the case, while latter always return list)
sapply(DF, function(x) sapply(strsplit(as.character(x), ","), `[`, 1))
# A B C
#[1,] "1" NA "1"
#[2,] "2" "1" NA
Let's look this more closely by splitting the codes into chunks. On each column, we can find the output as list of splitted vectors
lapply(DF, function(x) strsplit(as.character(x), ","))
#$A
#$A[[1]]
#[1] "1" "2" "3"
#$A[[2]]
#[1] "2"
#$B
#$B[[1]]
#[1] NA
#$B[[2]]
#[1] "1"
#$C
#$C[[1]]
#[1] "1" "2"
#$C[[2]]
#character(0)
When we do [[1]], the first element is extracted i.e. the first row of 'A', 'B', 'C'
lapply(DF, function(x) strsplit(as.character(x), ",")[[1]])
#$A
#[1] "1" "2" "3"
#$B
#[1] NA
#$C
#[1] "1" "2"
If we again subset on the above, i.e. the first element, the output will be 1 NA 1.
Instead we want to loop through the list and get the first element of each list
As you only want to extract the first part before the , you can also do
sapply(DF, function(x) gsub("^([^,]*),.*$", "\\1", x))
# A B C
# [1,] "1" NA "1"
# [2,] "2" NA "1"
This extracts the the first group (\\1) which is here marked with brackets. ([^,]*)
Or with stringr:
library(stringr)
sapply(DF, function(x) str_extract(x, "^([^,]*)"))
Here is another version of this
lapply(X = df, FUN = function(x) sapply(strsplit(x = as.character(x), split = ","), FUN = head, n=1))
First of all, notice that your sepfun should always give an error:
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
split should go with strsplit, not as.character, so what you meant is probably:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
Second, the question of data sanity. You have character variables stored as factors, and missing data stored as empty strings. I would recommend dealing with these issues before trying to do anything else. (Why do I say NA is more sensible here than an empty string? Because you told me so. You want NA's in the output, so I guess this means that if there are no numbers in the string, it means that something is missing. Missing = NA. There is also a technical reason which would take a bit longer to explain.)
So in the following, I'm just using an altered version of your DF:
DF <- data.frame(A=c("1,2,3", "2"), B=c(NA, "1"), C=c("1,2", NA), stringsAsFactors=FALSE)
(If DF comes from a file, then you could use read.csv("file", as.is=TRUE). And then DF[DF==""] <- NA.)
The output of strsplit is a list so you'll need sapply to get something useful out from it. And another sapply to apply it to all columns in a data frame.
sapply(DF, function(x) sapply(strsplit(x, ","), head, 1))
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
Or step by step. Before you can sapply a function over all columns of a data frame, you need it to give meaningful results for all the columns. Let's try:
sf <- function(x) sapply(strsplit(x, ","), head, 1)
# and sepfunc as defined above:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
sf(DF$A)
# [1] "1" "2"
# as expected
sepfunc(DF$A)
# [1] "1"
Notice that sepfunc uses only the first element (as you told it to!) of each column, and the rest is discarded. You need sapply or something similar to use all elements. So as a consequence, you get this:
sapply(DF, sepfunc)
# A B C
# "1" NA "1"
(It works, because we've redefined empty strings as NA. But you get the results only for the first row of each variable.)
sapply(DF, sf)
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA

Extracting one column based on max of other columns of a Dataframe in R

I am trying to fetch the value in column in 'a' corresponding to the max values od columns 'c','d' and 'e' and then store it in a vector.
I have written below code which gives column 'a' data along with two NA.
Can somebody help me to fetch the exact data using sapply.
a<-c('A','B','C','D','E')
b<-c(10,30,45,25,40)
c<-c(19,23,25,37,39)
d<-c(43,21,17,14,26)
e<-c(NA,23,45,32,NA)
df<-data.frame(a,b,c,d,e)
A1<-vector("character",3)
for (i in 3:5){
A1[i]<-c(df[which(df[,i]==max(df[,i],na.rm = TRUE)),1])
A1
}
Actual Result: > A1
[1] "" "" "E" "A" "C"
Expected Result: A1 should have "E" "A" "C"
Please suggest a solution using sapply.
Thanks
We can use mapply
unname(mapply(function(x, y) x[which(y == max(y, na.rm = TRUE))], df[1], df[3:5]))
#[1] "E" "A" "C"
In the loop, the indexing starts from 3:5 which is the index for the columns while the 'A1' vector object is initialized to 3 elements. If the assignment starts from the 3rd element onwards, the vector just appends new elements while keeping the first 2 elements untouched.
A1<-vector("character",3)
A1
#[1] "" "" ""
A2 <- A1
A2[3:5] <- 15
A2
#[1] "" "" "15" "15" "15" #### this is the same thing happening in the loop
Instead, we can loop over the sequence and then assign
i1 <- 3:5
for(i in seq_along(i1)) {
A1[i] <- df[which(df[,i1[i]]==max(df[,i1[i]],na.rm = TRUE)),1]
}
A1
#[1] "E" "A" "C"

How can I keep NA when I change levels

I build a vector of factors containing NA.
my_vec <- factor(c(NA,"a","b"),exclude=NULL)
levels(my_vec)
# [1] "a" "b" NA
I change one of those levels.
levels(my_vec)[levels(my_vec) == "b"] <- "c"
NA disappears.
levels(my_vec)
# [1] "a" "c"
How can I keep it ?
EDIT
#rawr gave a nice solution that can work most of the time, it works for my previous specific example, but not for the one I'll show below
#Hack-R had a pragmatic option using addNA, I could make it work with that but I'd rather a fully general solution
See this generalized issue
my_vec <- factor(c(NA,"a","b1","b2"),levels = c("a",NA,"b1","b2"),exclude=NULL)
levels(my_vec)
[1] "a" NA "b1" "b2"
levels(my_vec)[levels(my_vec) %in% c("b1","b2")] <- "c"
levels(my_vec)
[1] "a" "c" # NA disppeared
#rawr's solution:
my_vec <- factor(c(NA,"a","b1","b2"),levels = c("a",NA,"b1","b2"),exclude=NULL)
levels(my_vec)
[1] "a" NA "b1" "b2"
attr(my_vec, 'levels')[levels(my_vec) %in% c("b1","b2")] <- "c"
levels(my_vec)
droplevels(my_vec)
[1] "a" NA "c" "c" # c is duplicated
#Hack-R's solution:
my_vec <- factor(c(NA,"a","b1","b2"),levels = c("a",NA,"b1","b2"),exclude=NULL)
levels(my_vec)
[1] "a" NA "b1" "b2"
levels(my_vec)[levels(my_vec) %in% c("b1","b2")] <- "c"
my_vec <- addNA(my_vec)
levels(my_vec)
[1] "a" "c" NA # NA is in the end
I want levels(my_vec) == c("a",NA,"c")
You have to quote NA, otherwise R treats it as a null value rather than a factor level. Factor levels sort alphabetically by default, but obviously that's not always useful, so you can specify a different order by passing a new list order to levels()
require(plyr)
my_vec <- factor(c("NA","a","b1","b2"))
vec2 <- revalue(my_vec,c("b1"="c","b2"="c"))
#now reorder levels
my_vec2 <- factor(vec2, levels(vec2)[c(1,3,2)])
Levels: a NA c
I finally created a function that first replaces the NA value with a temp one (inspired by #lmo), then does the replacement I wanted the standard way, then puts NA back in its place using #rawr's suggestion.
my_vec <- factor(c(NA,"a","b1","b2"),levels = c("a",NA,"b1","b2"),exclude=NULL)
my_vec <- level_sub(my_vec,c("b1","b2"),"c")
my_vec
# 1] <NA> a c c
# Levels: a <NA> c
As a bonus level_sub can be used with na_rep = NULL which will remove the NA, and it will look good in pipe chains :).
level_sub <- function(x,from,to,na_rep = "NA"){
if(!is.null(na_rep)) {levels(x)[is.na(levels(x))] <- na_rep}
levels(x)[levels(x) %in% from] <- to
if(!is.null(na_rep)) {attr(x, 'levels')[levels(x) == na_rep] <- NA}
x
}
Nevertheless it seems that R really doesn't want you to add NA to factors.
levels(my_vec) <- c(NA,"a") will have a strange behavior but that doesn't stop here. While subset will keep NA levels in your columns, rbind will quietly remove them! I wouldn't be surprised if further investigation revealed that half R functions remove NA factors, making them very unsafe to work with...

Converting a list of lists of strings to a data frame of numbers in R

I have a list of lists of strings as follows:
> ll
[[1]]
[1] "2" "1"
[[2]]
character(0)
[[3]]
[1] "1"
[[4]]
[1] "1" "8"
The longest list is of length 2, and I want to build a data frame with 2 columns from this list. Bonus points for also converting each item in the list to a number or NA for character(0). I have tried using mapply() and data.frame to convert to a data frame and fill with NA's as follows.
# Find length of each list element
len = sapply(awards2, length)
# Number of NAs to fill for column shorter than longest
len = 2 - len
df = data.frame(mapply( function(x,y) c( x , rep( NA , y ) ) , ll , len))
However, I do not get a data frame with 2 columns (and NA's as fillers) using the code above.
Thanks for the help.
We can use stri_list2matrix from stringi. As the list elements are all character vectors, it seems okay to use this function
library(stringi)
t(stri_list2matrix(ll))
# [,1] [,2]
#[1,] "2" "1"
#[2,] NA NA
#[3,] "1" NA
#[4,] "1" "8"
If we need to convert to data.frame, wrap it with as.data.frame

R: are there built-in functions to sort lists?

in R I have produced the following list L:
>L
[[1]]
[1] "A" "B" "C"
[[2]]
[1] "D"
[[3]]
[1] NULL
I would like to manipulate the list L arriving at a database df like
>df
df[,1] df[,2]
"A" 1
"B" 1
"C" 1
"D" 2
where the 2nd column gives the position in the list L of the corresponding element in column 1.
My question is: is(are) there a() built-in R function(s) which can do this manipulation quickly? I can do it using "brute force", but my solution does not scale well when I consider much bigger lists.
I thank you all!
You'll get a warning because of your NULL value, but you can use stack if you give your list items names:
L <- list(c("A", "B", "C"), "D", NULL)
stack(setNames(L, seq_along(L)))
# values ind
# 1 A 1
# 2 B 1
# 3 C 1
# 4 D 2
# Warning message:
# In stack.default(setNames(L, seq_along(L))) :
# non-vector elements will be ignored
If the warning displeases you, you can, of course, run stack on the non-NULL elements, but do it after you name your list elements so that the "ind" column reflects the correct value.
I'll show in 2 steps just for clarity:
names(L) <- seq_along(L)
stack(L[!sapply(L, is.null)])
Similarly, if you've gotten rid of the NULL list elements, you can use melt from "reshape2". You don't gain anything in brevity, and I'm not sure that you gain anything in efficiency either, but I thought I'd share it as an option.
library(reshape2)
names(L) <- seq_along(L)
melt(L[!sapply(L, is.null)])
Ananda's answer is seemingly better than this, but I'll put it up anyway:
> cbind(unlist(L), rep(1:length(L), sapply(L, length)))
[,1] [,2]
[1,] "A" "1"
[2,] "B" "1"
[3,] "C" "1"
[4,] "D" "2"

Resources