i want to convert a character variable to a list of list. My character looks like as follows:
"[["a",2],["b",5]]"
The expected list should contain two lists with a character and a number for each
Looks like a JSON list to me, which will make your parsing job pretty simple:
x <- '[["a",2],["b",5]]'
library(jsonlite)
fromJSON(x, simplifyVector=FALSE)
#[[1]]
#[[1]][[1]]
#[1] "a"
#
#[[1]][[2]]
#[1] 2
#
#
#[[2]]
#[[2]][[1]]
#[1] "b"
#
#[[2]][[2]]
#[1] 5
If you want it combined back to columns instead, just let the simplification occur by default:
fromJSON(x)
# [,1] [,2]
#[1,] "a" "2"
#[2,] "b" "5"
Here is one possibility via base R,
xx <- '[[a, 2], [b, 5]]'
lapply(split(matrix(gsub('[[:punct:]]', '', unlist(strsplit(xx, ','))),
nrow = 2, byrow = T), 1:2),
function(i) list(i[[1]], as.numeric(i[[2]])))
#$`1`
#$`1`[[1]]
#[1] "a"
#$`1`[[2]]
#[1] 2
#$`2`
#$`2`[[1]]
#[1] " b"
#$`2`[[2]]
#[1] 5
Related
I have a data frame "dfx" like below. I need to convert values in "COUNTY_ID" to a vector to provide to function.
dfx:
STATE COUNTY_ID
KS 15,21,33,101
OH 133,51,12
TX 15,21,37,51,65
I have converted the STATE to a vector like below:
st = as.vector(as.character(dfx$STATE))
But, I need to convert each row in "COUNTY_ID" column to a number/numeric vector. For example, c(15,21,33,101)
How can I achieve this in R?
Any help is appreciated.
cty_id <- lapply(strsplit(as.character(dfx$COUNTY_ID), ","), as.numeric)
DOES NOT work:
mclapply(cty_id[1], FUN = each_cty, st = st[1], mc.cores = detectCores() - 1)
DOES works:
mclapply(c(15,21,33,101), FUN = each_cty, st = st[1], mc.cores = detectCores() - 1)
Is this what you're after?
strsplit(as.character(dfx$COUNTY_ID), ",")
#[[1]]
#[1] "15" "21" "33" "101"
#
#[[2]]
#[1] "133" "51" "12"
#
#[[3]]
#[1] "15" "21" "37" "51" "65"
Explanation: strsplit(..., ",") splits every entry based on ",", and stores the result in a list of character vectors.
Or to get a list of numeric vectors:
lapply(strsplit(as.character(dfx$COUNTY_ID), ","), as.numeric);
#[[1]]
#[1] 15 21 33 101
#
#[[2]]
#[1] 133 51 12
#
#[[3]]
#[1] 15 21 37 51 65
How do you want to handle situations like the one in your example data, when KS has four distinct values of county_id, but OH has only three? If you seek to get one column per county_id, and you're ok with missing values in some of the cells, then the easiest thing is to use stringr::str_split_fixed().
> result <- stringr::str_split_fixed(dfx$COUNTY_ID, ",", n=5)
> result
[,1] [,2] [,3] [,4] [,5]
[1,] "15" "21" "33" "101" ""
[2,] "133" "51" "12" "" ""
[3,] "15" "21" "37" "51" "65"
Note that you need to know the max number of county_ids per row, and put this as an argument n above. You can be conservative and just drop columns full of NAs later.
What you get out of this is a matrix of characters. You can then convert it to numeric as follows: class(result) <- 'numeric'. After that, each row of the result matrix gives you the vector of interest, you may have to wrap it in na.omit() to be sure you only get numbers.
Consider the following list:
temp <- list(1, "a", TRUE)
We can use sapply to replicate the list:
> ts <- sapply(1:5, function(x) temp)
> ts
[,1] [,2] [,3] [,4] [,5]
id 1 1 1 1 1
grade "a" "a" "a" "a" "a"
alive TRUE TRUE TRUE TRUE TRUE
If I inspect the result using typeof, I obtain list. However, if I inspect it with sapply, I get this:
> sapply(ts, function(x) print(x))
[1] 1
[1] "a"
[1] TRUE
[1] 1
[1] "a"
[1] TRUE
[1] 1
[1] "a"
[1] TRUE
[1] 1
[1] "a"
[1] TRUE
[1] 1
[1] "a"
[1] TRUE
That is, when I inspect the same result with sapply, this vector of lists is treated as a matrix. Is there any workaround, or does R disallow a vector of lists in general? If the latter is the case, why do I get "list" from typeof?
PS: For my specific question, I understand the obvious solution of using lapply to switch to a list of lists. I am just curious and confused by R’s behavior.
The return of sapply(ts, function(x) print(x)) is still a list. Actually a list of 15 variables as 3 members of temp has been simplified and returned as 3 items (times 5 iterations). If you want something like lapply like output please try:
>ts <- sapply(1:5, function(x) temp, simplify = FALSE)
> ts
#[[1]]
#[[1]][[1]]
#[1] 1
#
#[[1]][[2]]
#[1] "a"
#
#[[1]][[3]]
#[1] TRUE
#.......
#.......
Or even you can try:
>ts <- sapply(1:5, function(x) as.data.frame(temp))
Given a list of matrices with different number of columns:
set.seed(123)
a <- replicate(5, matrix(runif(25*30), ncol=25) , simplify=FALSE)
b <- replicate(5, matrix(runif(30*30), ncol=30) , simplify=FALSE)
list.of.matrices <- c(a,b)
How can I apply functional programming principles (i.e. using the purrr package) to operate on a specific range of columns (i.e. 8th row, and from 2nd to the end of columns)?
map(list.of.matrices[8, 2:ncol(list.of.matrices)], mean)
The above returns:
Error in 2:ncol(list.of.matrices) : argument of length 0
map_dbl makes sure the returned values are numeric and double. ~ and . is a simplified way to specify the function to apply.
library(purrr)
map_dbl(list.of.matrices, ~mean(.[8, 2:ncol(.)]))
[1] 0.4377532 0.5118923 0.5082115 0.4749039 0.4608980 0.4108388 0.4832585 0.4394764 0.4975212 0.4580137
The base R equivalent is
sapply(list.of.matrices, function(x) mean(x[8, 2:ncol(x)]))
[1] 0.4377532 0.5118923 0.5082115 0.4749039 0.4608980 0.4108388 0.4832585 0.4394764 0.4975212 0.4580137
Base R solution using the Map function in base-R:
Map(function(x){mean(x[8,2:ncol(x)])},list.of.matrices)
#[[1]]
#[1] 0.4377532
#[[2]]
#[1] 0.5118923
#[[3]]
#[1] 0.5082115
#[[4]]
#[1] 0.4749039
#[[5]]
#[1] 0.460898
#[[6]]
#[1] 0.4108388
#[[7]]
#[1] 0.4832585
#[[8]]
#[1] 0.4394764
#[[9]]
#[1] 0.4975212
#[[10]]
#[1] 0.4580137
NOTE: I have updated the question to reflect specific patterns in the data.
Say that I have two vectors.
names_data <- c('A', 'B', 'C', 'D', 'E', 'F')
levels_selected <- c('A1','A3', 'Blow', 'Bhigh', 'D(4.88e+03,9.18+e+04]', 'F')
I want to know how to get a vector, a data frame, a list, or whatever, that checks on the levels vector and returns which levels of which variables where selected. Something that says:
A: 1, 3
B: low, high
D: (4.88e+03,9.18e+04]
Ultimately, there is a data frame X for which names_data = names(data) and levels_selected are some, but not all, of the levels in each of the variables. In the end what I want to do is to make a matrix (for, say for example, a random forest) using model.matrix where I want to include only the variables AND levels in levels_selected. Is there a straightforward way of doing so?
We can create a grouping variable after keeping the substring that contains the "names_data" in the "levels_selected" ('grp'), split the substring with prefix removed using the 'grp' to get a list.
grp <- sub(paste0("^(", paste(names_data, collapse="|"), ").*"), "\\1", levels_selected)
value <- gsub(paste(names_data, collapse="|"), "",
levels_selected)
lst <- split(value, grp)
lst
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "x"
If we meant something like
library(qdapTools)
mtabulate(lst)
# 1 3 high low x
#A 1 1 0 0 0
#B 0 0 1 1 0
#D 0 0 0 0 1
Or another option is using strsplit
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
aggregate(V2~V1, d1, FUN= toString)
# V1 V2
#1 A 1, 3
#2 B low, high
#3 D x
and possibly the model.matrix would be
model.matrix(~V1+V2-1, d1)
Update
By using the OP's new example
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
split(d1$V2, d1$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
It is also working with the first method.
Update2
If there are no characters that succeed the elements in 'names_data', we can filter them out
lst <- strsplit(levels_selected, paste0("(?<=(", paste(names_data,
collapse="|"), "))"), perl = TRUE)
d2 <- as.data.frame(do.call(rbind,lst[lengths(lst)==2]), stringsAsFactors=FALSE)
split(d2$V2, d2$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
An option that returns a list with levels as a vector stored under each corresponding name:
> setNames(lapply(names_data, function(x) gsub(x, "", levels_selected[grepl(x, levels_selected)])), names_data)
$A
[1] "1" "3"
$B
[1] "low" "high"
$C
character(0)
$D
[1] "x"
$E
character(0)
So this is a handy little function I extended from the regexpr help example, using perl-style regex
parseAll <- function(data, pattern) {
result <- gregexpr(pattern, data, perl = TRUE)
do.call(rbind,lapply(seq_along(data), function(i) {
if(any(result[[i]] == -1)) return("")
st <- data.frame(attr(result[[i]], "capture.start"))
le <- data.frame(attr(result[[i]], "capture.length") - 1)
mapply(function(start,leng) substring(data[i], start, start + leng), st, le)
}))
}
EDIT: It's extended because this one will find multiple matches of the patterns, allowing you to look for say, multiple patterns per line. so a pattern like: "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)" (from the original regexpr help) finds all instances of the pattern in each string, rather than just one.
suppose I had data that looked like this:
dat <- c('A1','A2','A3','B3')
I could then search for this data via
parseAll(z,'A(?<A>.*)|B(?<B>.*)') to get a data.frame with the levels selected:
parseAll(dat,'A(?<A>.*)|B(?<B>.*)')
A B
[1,] "1" ""
[2,] "2" ""
[3,] "3" ""
[4,] "" "3"
and which selection had each level (though that may not be useful to you), I can programmatically generate these patterns as well from your vectors:
pattern <- paste(paste0(names_data,'(?<',names_data,'>.*)'),collapse = '|')
then your selected levels are the unique elements of each column, (it's in data.frame, so the conversion to list is easy enough)
This is my omnitool for this kinda stuff, hope it's handy
UPDATE: FIXED
This is fixed in the upcoming release of R 3.1.0. From the CHANGELOG:
combn(x, simplify = TRUE) now gives a factor result for factor input
x (previously user error).
Related to PR#15442
I just noticed a curious thing. Why does combn appear to unclass factor variables to their underlying numeric values for all except the first combination?
x <- as.factor( letters[1:3] )
combn( x , 2 )
# [,1] [,2] [,3]
#[1,] "a" "1" "2"
#[2,] "b" "3" "3"
This doesn't occur when x is a character:
x <- as.character( letters[1:3] )
combn( x , 2 )
# [,1] [,2] [,3]
#[1,] "a" "a" "b"
#[2,] "b" "c" "c"
Reproducible on R64 on OS X 10.7.5 and Windows 7.
I think it is due to the conversion to matrix done by the simplify parameter. If you don't use it you get:
combn( x , 2 , simplify=FALSE)
[[1]]
[1] a b
Levels: a b c
[[2]]
[1] a c
Levels: a b c
[[3]]
[1] b c
Levels: a b c
The fact that the first column is OK is due to the way combn works: the first column is specified separately and the other columns are then changed from the existing matrix using [<-. Consider:
m <- matrix(x,3,3)
m[,2] <- sample(x)
m
[,1] [,2] [,3]
[1,] "a" "1" "a"
[2,] "b" "3" "b"
[3,] "c" "2" "c"
I think the offending function is therefore [<-.
As Konrad said, the treatment of factors is often odd, or at least inconsistent. In this case I think the behaviour is weird enough to constitute a bug. Try submitting it, and see what the response is.
Since the result is a matrix, and there is no factor matrix type, I think that the correct behaviour would be to convert factor inputs to character somewhere near the start of the function.
I had the same problem. Coercing back to a character vector inside the combn command seems to work:
> combn(as.character(x),2)
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "b" "c" "c"