I have the following dataframe in R and am trying to use a stringsplit function to the same to yield a different dataframe
DF
A B C
"1,2,3" "1,2"
"2" "1"
The cells of the dataframe are filled with characters. The empty spaces are blank values. I have created the following function
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
The function works neatly when i use it on a single column
sapply(DF$A, sepfunc)
[1] "1" "2"
However, the following command yields only a single row
sapply(DF, sepfunc)
A B C
"1" NA "1"
The second row is not displayed. I know I must be missing something rudimentary. I request someone to help.
The expected output is
A B C
"1" NA "1"
"2" "1" "NA"
When we do the strsplit, the output is a list of vectors. If we just subset the first list element with [[1]], then the rest of the elements are skipped. Here the first element corresponds to the first row. But, when we do the same on a single column, it is looping through each element and then do the strsplit. It will not hurt by taking the first element [[1]] because the list is of length 1. Here, the case is different. The number of list elements are the same as the number of rows for each of the columns. So, we need to loop through the list (either with sapply/lapply - former gives a vector depends on the case, while latter always return list)
sapply(DF, function(x) sapply(strsplit(as.character(x), ","), `[`, 1))
# A B C
#[1,] "1" NA "1"
#[2,] "2" "1" NA
Let's look this more closely by splitting the codes into chunks. On each column, we can find the output as list of splitted vectors
lapply(DF, function(x) strsplit(as.character(x), ","))
#$A
#$A[[1]]
#[1] "1" "2" "3"
#$A[[2]]
#[1] "2"
#$B
#$B[[1]]
#[1] NA
#$B[[2]]
#[1] "1"
#$C
#$C[[1]]
#[1] "1" "2"
#$C[[2]]
#character(0)
When we do [[1]], the first element is extracted i.e. the first row of 'A', 'B', 'C'
lapply(DF, function(x) strsplit(as.character(x), ",")[[1]])
#$A
#[1] "1" "2" "3"
#$B
#[1] NA
#$C
#[1] "1" "2"
If we again subset on the above, i.e. the first element, the output will be 1 NA 1.
Instead we want to loop through the list and get the first element of each list
As you only want to extract the first part before the , you can also do
sapply(DF, function(x) gsub("^([^,]*),.*$", "\\1", x))
# A B C
# [1,] "1" NA "1"
# [2,] "2" NA "1"
This extracts the the first group (\\1) which is here marked with brackets. ([^,]*)
Or with stringr:
library(stringr)
sapply(DF, function(x) str_extract(x, "^([^,]*)"))
Here is another version of this
lapply(X = df, FUN = function(x) sapply(strsplit(x = as.character(x), split = ","), FUN = head, n=1))
First of all, notice that your sepfun should always give an error:
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
split should go with strsplit, not as.character, so what you meant is probably:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
Second, the question of data sanity. You have character variables stored as factors, and missing data stored as empty strings. I would recommend dealing with these issues before trying to do anything else. (Why do I say NA is more sensible here than an empty string? Because you told me so. You want NA's in the output, so I guess this means that if there are no numbers in the string, it means that something is missing. Missing = NA. There is also a technical reason which would take a bit longer to explain.)
So in the following, I'm just using an altered version of your DF:
DF <- data.frame(A=c("1,2,3", "2"), B=c(NA, "1"), C=c("1,2", NA), stringsAsFactors=FALSE)
(If DF comes from a file, then you could use read.csv("file", as.is=TRUE). And then DF[DF==""] <- NA.)
The output of strsplit is a list so you'll need sapply to get something useful out from it. And another sapply to apply it to all columns in a data frame.
sapply(DF, function(x) sapply(strsplit(x, ","), head, 1))
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
Or step by step. Before you can sapply a function over all columns of a data frame, you need it to give meaningful results for all the columns. Let's try:
sf <- function(x) sapply(strsplit(x, ","), head, 1)
# and sepfunc as defined above:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
sf(DF$A)
# [1] "1" "2"
# as expected
sepfunc(DF$A)
# [1] "1"
Notice that sepfunc uses only the first element (as you told it to!) of each column, and the rest is discarded. You need sapply or something similar to use all elements. So as a consequence, you get this:
sapply(DF, sepfunc)
# A B C
# "1" NA "1"
(It works, because we've redefined empty strings as NA. But you get the results only for the first row of each variable.)
sapply(DF, sf)
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
Related
I am trying to replace all NAs for those columns with 0 or 1 only. However, I found that apply failed to deal with the NAs. If I replace the NAs with an arbitrary string i.e. "Unknown". Then lapply and apply yield the same result. Any explanation would be greatly appreciated.
Here is an example.
df<-data.frame(a=c(0,1,NA),b=c(0,1,0),c=c('d',NA,'c'))
apply(df,2,function(x){all(x %in% c(0,1,NA)) })
unlist(lapply(df,function(x){all(x %in% c(0,1,NA))}))
It is not recommended to use apply on a data.frame with different classes. The recommended option is lapply. Issue is that with apply, it converts to matrix and this can result in some issues especially when there are missing values involved i.e. creating extra spaces.
apply(df, 2, I)
# a b c
#[1,] " 0" "0" "d"
#[2,] " 1" "1" NA
#[3,] NA "0" "c"
If instead if the first column was already character, then the NA conversion from NA_real_ to NA_character_ wouldn't occur i.e.
df1 <- df
df1$a <- as.character(c(0, 1, NA))
apply(df1, 2, I)
# a b c
#[1,] "0" "0" "d"
#[2,] "1" "1" NA
#[3,] NA "0" "c"
An option is to wrap with trimws to remove the leading spaces
apply(df,2,function(x){all(trimws(x) %in% c(0,1,NA)) })
# a b c
# TRUE TRUE FALSE
NOTE: For testing the presence of NA, it is recommended to use is.na instead of %in%
I am trying to fetch the value in column in 'a' corresponding to the max values od columns 'c','d' and 'e' and then store it in a vector.
I have written below code which gives column 'a' data along with two NA.
Can somebody help me to fetch the exact data using sapply.
a<-c('A','B','C','D','E')
b<-c(10,30,45,25,40)
c<-c(19,23,25,37,39)
d<-c(43,21,17,14,26)
e<-c(NA,23,45,32,NA)
df<-data.frame(a,b,c,d,e)
A1<-vector("character",3)
for (i in 3:5){
A1[i]<-c(df[which(df[,i]==max(df[,i],na.rm = TRUE)),1])
A1
}
Actual Result: > A1
[1] "" "" "E" "A" "C"
Expected Result: A1 should have "E" "A" "C"
Please suggest a solution using sapply.
Thanks
We can use mapply
unname(mapply(function(x, y) x[which(y == max(y, na.rm = TRUE))], df[1], df[3:5]))
#[1] "E" "A" "C"
In the loop, the indexing starts from 3:5 which is the index for the columns while the 'A1' vector object is initialized to 3 elements. If the assignment starts from the 3rd element onwards, the vector just appends new elements while keeping the first 2 elements untouched.
A1<-vector("character",3)
A1
#[1] "" "" ""
A2 <- A1
A2[3:5] <- 15
A2
#[1] "" "" "15" "15" "15" #### this is the same thing happening in the loop
Instead, we can loop over the sequence and then assign
i1 <- 3:5
for(i in seq_along(i1)) {
A1[i] <- df[which(df[,i1[i]]==max(df[,i1[i]],na.rm = TRUE)),1]
}
A1
#[1] "E" "A" "C"
I have a list of lists of strings as follows:
> ll
[[1]]
[1] "2" "1"
[[2]]
character(0)
[[3]]
[1] "1"
[[4]]
[1] "1" "8"
The longest list is of length 2, and I want to build a data frame with 2 columns from this list. Bonus points for also converting each item in the list to a number or NA for character(0). I have tried using mapply() and data.frame to convert to a data frame and fill with NA's as follows.
# Find length of each list element
len = sapply(awards2, length)
# Number of NAs to fill for column shorter than longest
len = 2 - len
df = data.frame(mapply( function(x,y) c( x , rep( NA , y ) ) , ll , len))
However, I do not get a data frame with 2 columns (and NA's as fillers) using the code above.
Thanks for the help.
We can use stri_list2matrix from stringi. As the list elements are all character vectors, it seems okay to use this function
library(stringi)
t(stri_list2matrix(ll))
# [,1] [,2]
#[1,] "2" "1"
#[2,] NA NA
#[3,] "1" NA
#[4,] "1" "8"
If we need to convert to data.frame, wrap it with as.data.frame
NOTE: I have updated the question to reflect specific patterns in the data.
Say that I have two vectors.
names_data <- c('A', 'B', 'C', 'D', 'E', 'F')
levels_selected <- c('A1','A3', 'Blow', 'Bhigh', 'D(4.88e+03,9.18+e+04]', 'F')
I want to know how to get a vector, a data frame, a list, or whatever, that checks on the levels vector and returns which levels of which variables where selected. Something that says:
A: 1, 3
B: low, high
D: (4.88e+03,9.18e+04]
Ultimately, there is a data frame X for which names_data = names(data) and levels_selected are some, but not all, of the levels in each of the variables. In the end what I want to do is to make a matrix (for, say for example, a random forest) using model.matrix where I want to include only the variables AND levels in levels_selected. Is there a straightforward way of doing so?
We can create a grouping variable after keeping the substring that contains the "names_data" in the "levels_selected" ('grp'), split the substring with prefix removed using the 'grp' to get a list.
grp <- sub(paste0("^(", paste(names_data, collapse="|"), ").*"), "\\1", levels_selected)
value <- gsub(paste(names_data, collapse="|"), "",
levels_selected)
lst <- split(value, grp)
lst
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "x"
If we meant something like
library(qdapTools)
mtabulate(lst)
# 1 3 high low x
#A 1 1 0 0 0
#B 0 0 1 1 0
#D 0 0 0 0 1
Or another option is using strsplit
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
aggregate(V2~V1, d1, FUN= toString)
# V1 V2
#1 A 1, 3
#2 B low, high
#3 D x
and possibly the model.matrix would be
model.matrix(~V1+V2-1, d1)
Update
By using the OP's new example
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
split(d1$V2, d1$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
It is also working with the first method.
Update2
If there are no characters that succeed the elements in 'names_data', we can filter them out
lst <- strsplit(levels_selected, paste0("(?<=(", paste(names_data,
collapse="|"), "))"), perl = TRUE)
d2 <- as.data.frame(do.call(rbind,lst[lengths(lst)==2]), stringsAsFactors=FALSE)
split(d2$V2, d2$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
An option that returns a list with levels as a vector stored under each corresponding name:
> setNames(lapply(names_data, function(x) gsub(x, "", levels_selected[grepl(x, levels_selected)])), names_data)
$A
[1] "1" "3"
$B
[1] "low" "high"
$C
character(0)
$D
[1] "x"
$E
character(0)
So this is a handy little function I extended from the regexpr help example, using perl-style regex
parseAll <- function(data, pattern) {
result <- gregexpr(pattern, data, perl = TRUE)
do.call(rbind,lapply(seq_along(data), function(i) {
if(any(result[[i]] == -1)) return("")
st <- data.frame(attr(result[[i]], "capture.start"))
le <- data.frame(attr(result[[i]], "capture.length") - 1)
mapply(function(start,leng) substring(data[i], start, start + leng), st, le)
}))
}
EDIT: It's extended because this one will find multiple matches of the patterns, allowing you to look for say, multiple patterns per line. so a pattern like: "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)" (from the original regexpr help) finds all instances of the pattern in each string, rather than just one.
suppose I had data that looked like this:
dat <- c('A1','A2','A3','B3')
I could then search for this data via
parseAll(z,'A(?<A>.*)|B(?<B>.*)') to get a data.frame with the levels selected:
parseAll(dat,'A(?<A>.*)|B(?<B>.*)')
A B
[1,] "1" ""
[2,] "2" ""
[3,] "3" ""
[4,] "" "3"
and which selection had each level (though that may not be useful to you), I can programmatically generate these patterns as well from your vectors:
pattern <- paste(paste0(names_data,'(?<',names_data,'>.*)'),collapse = '|')
then your selected levels are the unique elements of each column, (it's in data.frame, so the conversion to list is easy enough)
This is my omnitool for this kinda stuff, hope it's handy
I have a vector,
myvector <- c("a","b","c","cat","4","dog","cat","f"). I would like to select out those elements that immediately follow elements containing the string "cat".
I.e., I want myvector2 containing only "4" and "f". I'm not sure where to begin.
myvector <- c("a","b","c","cat","4","dog","cat","f")
where_is_cat <- which(myvector == "cat")
# [1] 4 7
myvector[where_is_cat + 1]
# [1] "4" "f"
myvector2 <- myvector[where_is_cat + 1]
Try this:
x[grep('cat',x)+1]
#[1] "4" "f"
You can subset list minus its first element (list[-1]) by indices where list minus its last element (list[-length(list)]) equals "cat"
list[-1][list[-length(list)]=="cat"]
# [1] "4" "f"