Is there a faster/ shorter way to set values after and including match to NA ?
vec <- 1:10;vec[c(3,5,7)]<-c(NA,NaN,"remove")
#"1" "2" NA "4" "NaN" "6" "remove" "8" "9" "10"
Desired Outcome:
#"1" "2" NA "4" "NaN" "6" NA NA NA NA
My code:
vec[{grep("^remove$",vec)[1]}:length(vec)]<-NA
Please note:
In that case, we assume there will be a "remove" element prominent. So the solution does not have to take care of the case that there isn't any.
You can use match to stop searching after the first match is found:
m = match("remove", vec) - 1L
if (is.na(m)){
vec
} else {
c(head(vec, m), rep(vec[NA_integer_], length(vec)-m))
}
You'd have to have a pretty large vector to notice a speed difference, though, I guess. Alternately, this might prove faster:
m = match("remove", vec)
if (!is.na(m)){
vec[m:length(vec)] <- NA
}
Not sure if this is shorter or faster but here is one alternative :
vec[which.max(vec == "remove"):length(vec)] <- NA
vec
#[1] "1" "2" NA "4" "NaN" "6" NA NA NA NA
Here , we find the first occurrence of "remove" using which.max and then add NA's till the end of the vector.
OP has mentioned that there is a "remove" element always present so we need not take care of other case however, in case we still want to keep a check we can add an additional condition.
inds <- vec == "remove"
if (any(inds)) {
vec[which.max(inds) : length(vec)] <- NA
}
We can use cumsum on a logical vector
vec[cumsum(vec %in% "remove") > 0] <- NA
We can also just extend the vec to the desired length:
`length<-`(vec[1:(which(vec=="remove")-1)],length(vec))
[1] "1" "2" NA "4" "NaN" "6" NA NA NA NA
Related
I am trying to replace all NAs for those columns with 0 or 1 only. However, I found that apply failed to deal with the NAs. If I replace the NAs with an arbitrary string i.e. "Unknown". Then lapply and apply yield the same result. Any explanation would be greatly appreciated.
Here is an example.
df<-data.frame(a=c(0,1,NA),b=c(0,1,0),c=c('d',NA,'c'))
apply(df,2,function(x){all(x %in% c(0,1,NA)) })
unlist(lapply(df,function(x){all(x %in% c(0,1,NA))}))
It is not recommended to use apply on a data.frame with different classes. The recommended option is lapply. Issue is that with apply, it converts to matrix and this can result in some issues especially when there are missing values involved i.e. creating extra spaces.
apply(df, 2, I)
# a b c
#[1,] " 0" "0" "d"
#[2,] " 1" "1" NA
#[3,] NA "0" "c"
If instead if the first column was already character, then the NA conversion from NA_real_ to NA_character_ wouldn't occur i.e.
df1 <- df
df1$a <- as.character(c(0, 1, NA))
apply(df1, 2, I)
# a b c
#[1,] "0" "0" "d"
#[2,] "1" "1" NA
#[3,] NA "0" "c"
An option is to wrap with trimws to remove the leading spaces
apply(df,2,function(x){all(trimws(x) %in% c(0,1,NA)) })
# a b c
# TRUE TRUE FALSE
NOTE: For testing the presence of NA, it is recommended to use is.na instead of %in%
I have a list named mylist that has character elements in it which I'm trying to merge and save in another object.
The following piece of code:
result <- c()
for (i in length(mylist)) {
temp <- paste(mylist[[i]][2], mylist[[i]][3], mylist[[i]][4], sep="")
result[i] <- temp
}
result
Results in the following output:
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
Why am I getting NA's instead of the merged characters for EVERY result[i]?
The reason for the unexpected result has already been explained by Brent and damir.
However, I suggest to use seq_along(mylist) as it is more safe than 1:length(mylist) in case mylist is empty for some reason.
result <- c()
for (i in seq_along(mylist)) {
result[i] <- paste(mylist[[i]][2:4], collapse = "")
}
result
[1] "BCD" "CDE" "DEF" "EFG" "FGH"
If mylist is empty, length(mylist) would become 0 but the loop would be executed twice for 1:0.
In addition, the collapse parameter tells paste() to concatenate the elements of a vector thereby saving a lot of typing.
By the way, the same result can be achieved by using sapply():
sapply(mylist, function(x) paste(x[2:4], collapse = ""))
[1] "BCD" "CDE" "DEF" "EFG" "FGH"
Data
The OP has not provided a reproducible example but says that he has "a list named mylist that has character elements". So, here are some made-up data:
mylist <- lapply(1:5, function(i) LETTERS[i + (0:3)])
mylist
[[1]]
[1] "A" "B" "C" "D"
[[2]]
[1] "B" "C" "D" "E"
[[3]]
[1] "C" "D" "E" "F"
[[4]]
[1] "D" "E" "F" "G"
[[5]]
[1] "E" "F" "G" "H"
length(myList) is just a scalar value (i.e., a single value). You need to add the starting value like this: 1:length(myList). I am guessing that the last value in myList has some result?
Your loop only starts and ends at length(myList), because the value of i is always length(myList). So, your loop "loops" only once.
To change this, you need to specify the start and end value of i. Something like,
for (i in 1:length(myList))
I have the following dataframe in R and am trying to use a stringsplit function to the same to yield a different dataframe
DF
A B C
"1,2,3" "1,2"
"2" "1"
The cells of the dataframe are filled with characters. The empty spaces are blank values. I have created the following function
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
The function works neatly when i use it on a single column
sapply(DF$A, sepfunc)
[1] "1" "2"
However, the following command yields only a single row
sapply(DF, sepfunc)
A B C
"1" NA "1"
The second row is not displayed. I know I must be missing something rudimentary. I request someone to help.
The expected output is
A B C
"1" NA "1"
"2" "1" "NA"
When we do the strsplit, the output is a list of vectors. If we just subset the first list element with [[1]], then the rest of the elements are skipped. Here the first element corresponds to the first row. But, when we do the same on a single column, it is looping through each element and then do the strsplit. It will not hurt by taking the first element [[1]] because the list is of length 1. Here, the case is different. The number of list elements are the same as the number of rows for each of the columns. So, we need to loop through the list (either with sapply/lapply - former gives a vector depends on the case, while latter always return list)
sapply(DF, function(x) sapply(strsplit(as.character(x), ","), `[`, 1))
# A B C
#[1,] "1" NA "1"
#[2,] "2" "1" NA
Let's look this more closely by splitting the codes into chunks. On each column, we can find the output as list of splitted vectors
lapply(DF, function(x) strsplit(as.character(x), ","))
#$A
#$A[[1]]
#[1] "1" "2" "3"
#$A[[2]]
#[1] "2"
#$B
#$B[[1]]
#[1] NA
#$B[[2]]
#[1] "1"
#$C
#$C[[1]]
#[1] "1" "2"
#$C[[2]]
#character(0)
When we do [[1]], the first element is extracted i.e. the first row of 'A', 'B', 'C'
lapply(DF, function(x) strsplit(as.character(x), ",")[[1]])
#$A
#[1] "1" "2" "3"
#$B
#[1] NA
#$C
#[1] "1" "2"
If we again subset on the above, i.e. the first element, the output will be 1 NA 1.
Instead we want to loop through the list and get the first element of each list
As you only want to extract the first part before the , you can also do
sapply(DF, function(x) gsub("^([^,]*),.*$", "\\1", x))
# A B C
# [1,] "1" NA "1"
# [2,] "2" NA "1"
This extracts the the first group (\\1) which is here marked with brackets. ([^,]*)
Or with stringr:
library(stringr)
sapply(DF, function(x) str_extract(x, "^([^,]*)"))
Here is another version of this
lapply(X = df, FUN = function(x) sapply(strsplit(x = as.character(x), split = ","), FUN = head, n=1))
First of all, notice that your sepfun should always give an error:
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
split should go with strsplit, not as.character, so what you meant is probably:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
Second, the question of data sanity. You have character variables stored as factors, and missing data stored as empty strings. I would recommend dealing with these issues before trying to do anything else. (Why do I say NA is more sensible here than an empty string? Because you told me so. You want NA's in the output, so I guess this means that if there are no numbers in the string, it means that something is missing. Missing = NA. There is also a technical reason which would take a bit longer to explain.)
So in the following, I'm just using an altered version of your DF:
DF <- data.frame(A=c("1,2,3", "2"), B=c(NA, "1"), C=c("1,2", NA), stringsAsFactors=FALSE)
(If DF comes from a file, then you could use read.csv("file", as.is=TRUE). And then DF[DF==""] <- NA.)
The output of strsplit is a list so you'll need sapply to get something useful out from it. And another sapply to apply it to all columns in a data frame.
sapply(DF, function(x) sapply(strsplit(x, ","), head, 1))
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
Or step by step. Before you can sapply a function over all columns of a data frame, you need it to give meaningful results for all the columns. Let's try:
sf <- function(x) sapply(strsplit(x, ","), head, 1)
# and sepfunc as defined above:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
sf(DF$A)
# [1] "1" "2"
# as expected
sepfunc(DF$A)
# [1] "1"
Notice that sepfunc uses only the first element (as you told it to!) of each column, and the rest is discarded. You need sapply or something similar to use all elements. So as a consequence, you get this:
sapply(DF, sepfunc)
# A B C
# "1" NA "1"
(It works, because we've redefined empty strings as NA. But you get the results only for the first row of each variable.)
sapply(DF, sf)
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
I have a vector,
myvector <- c("a","b","c","cat","4","dog","cat","f"). I would like to select out those elements that immediately follow elements containing the string "cat".
I.e., I want myvector2 containing only "4" and "f". I'm not sure where to begin.
myvector <- c("a","b","c","cat","4","dog","cat","f")
where_is_cat <- which(myvector == "cat")
# [1] 4 7
myvector[where_is_cat + 1]
# [1] "4" "f"
myvector2 <- myvector[where_is_cat + 1]
Try this:
x[grep('cat',x)+1]
#[1] "4" "f"
You can subset list minus its first element (list[-1]) by indices where list minus its last element (list[-length(list)]) equals "cat"
list[-1][list[-length(list)]=="cat"]
# [1] "4" "f"
I have some data that is currently in character form, and I need to put it into numeric form so that I can get the mean. I'm new to R so any help will be much appreciated. My initial thought was that the missing data is causing it to not be read as num, but could it be because the numbers are "3" instead of 3?
Here's what I have:
X
chr [1:1964] "3", "4", "4", "1", NA
I've tried different methods of converting X from chr to num:
X <- na.omit(Y, Z, as.numeric)
mean(X)
# [1] NA
# Warning message:
# In mean.default(X) :
# argument is not numeric or logical: returning NA
X <- c(Y, Z, na.rm=TRUE)
mean(X, na.rm=TRUE)
# [1] NA
# Warning message:
# In mean.default(X, na.rm = TRUE) :
# argument is not numeric or logical: returning NA
X <- c(Y, Z, na.rm=TRUE)
str(X)
# Named chr [1:1965] "3" "4" "4" "1" "5" "7" NA "6" NA "5" ...
# - attr(*, "names")= chr [1:1965] "" "" "" "" ...
As always, an example of your actual data is helpful. I think I can answer anyway, though. If your data are character data, then converting to numeric like this will work most of the time:
X2 <- as.numeric(X)
If you have missing values, are they showing up as NA? Or did you write something else there to indicate missingness such as "missing"? If you've got something other than NA in your original data, then when you do the as.numeric(X) conversion, R will convert those values to NA and give you a warning message.
To take the mean of a numeric object that has missing values, use:
mean(X2, na.rm=TRUE)
This should work:
mean(as.numeric(X), na.rm=TRUE)
Doing the as.numeric() will introduce an NA for values like "X" and many of the summary functions have a na.rm parameter to ignore NA values in the vector.
But of course taking the mean of a list of chromosomes is a pretty weird operation.