Pivot table of concatenated string in r - r

I have the following dataset:
mydata<- data.frame(Factors= c("a,b" , "c,d" , "a,c"), Valu = c ("2,3" , "7,8" , "9,1"))
Factors Valu
1 a,b 2,3
2 c,d 7,8
3 a,c 9,1
and I wish to convert to the following which has all the values that happend with a factor:
My ideal output
a b c d
2 2 7 7
3 3 8 8
9 9
1 1
I need a pivot table. However I need to prepare the data and then use melt and dcast have my desirable output: one of fail tries for preparing data is :
mydata2 <- cSplit(mydata, c("Factors","Valu") , ",", "long")
But they lose their connections.

Here is an one-line code with cSplit
library(splitstackshape)
with(cSplit(cSplit(mydata, 1, ",", "long"), 2, ",", "long"), split(Valu, Factors))
#$a
#[1] 2 3 9 1
#$b
#[1] 2 3
#$c
#[1] 7 8 9 1
#$d
#[1] 7 8
If we need a data.table/data.frame, use dcast to convert the 'long' format to 'wide'.
dcast(cSplit(cSplit(mydata, 1, ",", "long"), 2, ",", "long"),
rowid(Factors)~Factors, value.var="Valu")[, Factors := NULL][]
# a b c d
#1: 2 2 7 7
#2: 3 3 8 8
#3: 9 NA 9 NA
#4: 1 NA 1 NA
NOTE: splitstackshape loads the data.table. Here, we used data.table_1.10.0. The dcast from data.table is also very fast

Using a couple of *applys, strsplit and grep
## convert columns to characters so you can use strsplit
mydata$Factors <- as.character(mydata$Factors)
mydata$Valu <- as.character(mydata$Valu)
## get all the unique factor values by splitting them
f <- unique(unlist(strsplit(unique(mydata$Factors), split = ",")))
## filter 'mydata' by using 'grep' to search for each individual factor value
## (using sapply for one at a time)
l <- sapply(f, function(x) mydata[grep(x, mydata$Factors), "Valu"])
This gives a list, where each element is named by the 'Factor' value, and it contains all the 'Valu' values associated with it
l
# $a
# [1] "2,3" "9,1"
#
# $b
# [1] "2,3"
#
# $c
# [1] "7,8" "9,1"
#
# $d
# [1] "7,8"
Another lapply on this list will split the 'Valu's
result <- lapply(l, function(x) unlist(strsplit(x, split = ",")))
result
# $a
# [1] "2" "3" "9" "1"
#
# $b
# [1] "2" "3"
#
# $c
# [1] "7" "8" "9" "1"
#
# $d
# [1] "7" "8"
Edit
To get the result in a data.frame, you can make each list element the same length (by filling with NA), then call data.frame on the result
## the number of rows required for each column
maxLength <- max(sapply(result, length))
## append 'NA's to list with fewer than maxLenght lements
result <- data.frame(sapply(result, function(x) c(x, rep(NA, maxLength - length(x))) ))
result
# a b c d
# 1 2 2 7 7
# 2 3 3 8 8
# 3 9 <NA> 9 <NA>
# 4 1 <NA> 1 <NA>
Edit
In response to the comment, if you have 'similar' strings, you can make your grep regex explicit by using ( ) (see any regex cheatsheet for explanations)
mydata<- data.frame(Factors= c("a,b" , "c,d" , "a,c", "bo,ao"), Valu = c ("2,3" , "7,8" , "9,1", "x,y"))
mydata$Factors <- as.character(mydata$Factors)
mydata$Valu <- as.character(mydata$Valu)
f <- unique(unlist(strsplit(unique(mydata$Factors), split = ",")))
## filter 'mydata' by using 'grep' to search for each individual factor value
## (using sapply for one at a time)
l <- sapply(f, function(x) mydata[grep(paste0("(",x,")"), mydata$Factors), "Valu"])

Another base R attempt:
# character conversion first
mydata[] <- lapply(mydata, as.character)
long <- do.call(rbind,
do.call(Map, c(expand.grid, lapply(mydata, strsplit, ","), stringsAsFactors=FALSE))
)
split(long$Valu, long$Factors)
#$a
#[1] "2" "3" "9" "1"
#
#$b
#[1] "2" "3"
#
#$c
#[1] "7" "8" "9" "1"
#
#$d
#[1] "7" "8"

I misunderstood in my comment above; if you want every Factor to match every Valu, you need to separate the columns independently to get the combinations. If you add indices to spread by, it's not too bad:
library(tidyverse)
mydata %>%
separate_rows(Factors) %>% separate_rows(Valu, convert = TRUE) %>%
# add indices to give row order when spreading
group_by(Factors) %>% mutate(i = row_number()) %>%
spread(Factors, Valu) %>%
select(-i) # clean up extra column
## # A tibble: 4 × 4
## a b c d
## * <int> <int> <int> <int>
## 1 2 2 7 7
## 2 3 3 8 8
## 3 9 NA 9 NA
## 4 1 NA 1 NA

Related

How to change the class of a column in a list of a list from character to numeric in r?

The codes for producing sample dataset and converting from character to numeric is as below:
ff = data.frame(a = c('1','2','3'),b = 1:3, c = 5:7)
#data.frame is a type of list.
fff = list(ff,ff,ff,ff)
k = fff %>% map(~map(.x,function(x){x['a'] %<>% as.numeric
return(x)}))
However, the result is something like this...:
There are 3 lists appear in each of the nested list ==> 33 = 9, which is very strange.
I think the result should have 3 lists in a nested list.==> 31 = 3
what I want is to convert every a in each dataframe to be numeric.
> k
[[1]]
[[1]]$a
a
"1" "2" "3" NA
[[1]]$b
a
1 2 3 NA
[[1]]$c
a
5 6 7 NA
[[2]]
[[2]]$a
a
"1" "2" "3" NA
[[2]]$b
a
1 2 3 NA
[[2]]$c
a
5 6 7 NA
[[3]]
[[3]]$a
a
"1" "2" "3" NA
[[3]]$b
a
1 2 3 NA
[[3]]$c
a
5 6 7 NA
[[4]]
[[4]]$a
a
"1" "2" "3" NA
[[4]]$b
a
1 2 3 NA
[[4]]$c
a
5 6 7 NA
I cannot understand why I cannot convert a into numeric...
Like this, with mutate:
fff %>%
map(~ mutate(.x, a = as.numeric(a)))
Or, more base R style:
fff %>%
map(\(x) {x$a <- as.numeric(x$a); x})
You should use map only once, because you don't have a nested list. With the first map, you access to each dataframe, and then you can convert to numeric. With a second map, you are accessing the columns of each data frame (which you don't want).
With two maps, it's also preferable to use \ or function rather than ~ because it becomes confusing to use .x and x for different objects. In your question, .x is the dataframe, while x are columns of it.

Automatically strip trailing whitespace when fetching data with `DBI::dbGetQuery` in R?

I work with a database (of which I am not the DBA) that has character columns of length greater than the actual data.
Is it possible to automatically strip trailing whitespace when fetching data with DBI::dbGetQuery? (i.e. something similar to utils::read.table(*, strip.white = TRUE))
# connect
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# generate fake data
mytable <- data.frame(x = 1, y = LETTERS[1:3], z = paste(LETTERS[1:3], " "))
dbWriteTable(con, "mytable", mytable)
# fetch data
(a <- dbGetQuery(con, "select * from mytable"))
# x y z
# 1 1 A A
# 2 1 B B
# 3 1 C C
# trailing space are kept
sapply(a, nchar)
# x y z
# [1,] 1 1 5
# [2,] 1 1 5
# [3,] 1 1 5
I hope I can avoid something like:
idx <- sapply(a, is.character)
a[idx] <- lapply(a[idx], trimws, which = "left", whitespace = "[ ]")
sapply(a, nchar)
# x y z
# [1,] 1 1 1
# [2,] 1 1 1
# [3,] 1 1 1
If not, is it a good approach?
As long as you're using select *, there is nothing SQL is going to do for this. If you select them by-name (which is a "best practice" and in many areas the industry-standard), you can use TRIM:
sqldf::sqldf("select x, y, trim(z) as z from mytable") |>
str()
# 'data.frame': 3 obs. of 3 variables:
# $ x: num 1 1 1
# $ y: chr "A" "B" "C"
# $ z: chr "A" "B" "C"
There are also rtrim and ltrim for limiting which side of the string you trim trailing/leading blank space.

How to extract the mimimum values from the list of list and give them a name in r

I have a list of a list with high complicated data. I would like to compare the values of each list and extract the smallest values. For simplicity, I provide a similar example.
s <- c(1,2,3)
ss <- c(4,5,6)
S <- list(s,ss)
h <- c(4,8,7)
hh <- c(0,3,4)
H <- list(h,hh)
HH <- list(S,H)
I would like to compare the element of each list with the element of the corresponding list and extract the smallest values. For example, the following are the values of HH list.
> HH
[[1]]
[[1]][[1]]
[1] 1 2 3
[[1]][[2]]
[1] 4 5 6
[[2]]
[[2]][[1]]
[1] 4 8 7
[[2]][[2]]
[1] 0 3 4
Now, I would like to compare
[[1]]
[[1]][[1]]
[1] 1 2 3
with
[[2]]
[[2]][[1]]
[1] 4 8 7
For example, 1 < 4, so I will select 1. For the second element, 2 < 8, so I will select 2. So, I would like to compare the elements of [[1]][[1]] with the elements of [[2]][[1]], and [[1]][[2]] with [[2]][[2]].
Then, I would like to print the name of the list. For example,
I expected to have similar to the following:
1 < 4, the first element of the first model is selected.
We could use a general solution (i.e. if there are many list elements) transpose from purrr to rearrange the list elements, and then use max.col to get the index
library(magrittr)
library(purrr)
HH %>%
transpose %>%
map(~ .x %>%
invoke(cbind, .) %>%
multiply_by(-1) %>%
max.col )
#[[1]]
#[1] 1 1 1
#[[2]]
#[1] 2 2 2
Or using base R
do.call(Map, c(f = function(...) max.col(-1 * cbind(...)), HH))
#[[1]]
#[1] 1 1 1
#[[2]]
#[1] 2 2 2
Maybe you can try this -
Map(function(x, y) as.integer(x > y) + 1, HH[[1]], HH[[2]])
#[[1]]
#[1] 1 1 1
#[[2]]
#[1] 2 2 2
This gives the position of the element selected.

Split to list() based on condition, omiting the False elements

What is the most elegant way to split a vector into n-Elements based on a condition?
Every separate true-block should go into its own list element. All the false elements get thrown away.
example1:
vec <- c(1:3,NA,NA,NA,4:6,NA,NA,NA,7:9,NA)
cond <- !is.na(vec)
result = list(1:3,4:6,7:9)
example2:
vec_2 <- c(3:1,11:13,6:4,14:16,9:7,20)
cond_2 <- vec_2 < 10
results_2 = list(3:1,6:4,9:7)
It would be great to have a general solution for a vector vec and a relating condition cond.
My best try:
res <- split(vec,data.table::rleidv(cond))
odd <- as.logical(seq_along(res)%%2)
res[if(cond[1])odd else !odd]
I guess this should work generally:
> split(vec[cond], data.table::rleid(cond)[cond])
$`1`
[1] 1 2 3
$`3`
[1] 4 5 6
$`5`
[1] 7 8 9
Let's make it a function:
> f <- function(vec, cond) split(vec[cond], data.table::rleid(cond)[cond])
> f(vec_2, cond_2)
$`1`
[1] 3 2 1
$`3`
[1] 6 5 4
$`5`
[1] 9 8 7
Here is a base R option with rle
grp <- with(rle(cond), rep(seq_along(values) * NA^ !values, lengths))
split(vec[cond], grp[cond])
#$`1`
#[1] 1 2 3
#$`3`
#[1] 4 5 6
#$`5`
#[1] 7 8 9
Similarly with 'vec_2'
grp <- with(rle(cond_2), rep(seq_along(values) * NA^ !values, lengths))
split(vec_2[cond_2], grp[cond_2])
#$`1`
#[1] 3 2 1
#$`3`
#[1] 6 5 4
#$`5`
#[1] 9 8 7
Or create a grouping variable with cumsum and diff
grp <- cumsum(c(TRUE, diff(cond) < 0)) * NA^ is.na(vec)

Split a vector by its sequences [duplicate]

This question already has answers here:
Create grouping variable for consecutive sequences and split vector
(5 answers)
Closed 5 years ago.
The following vector x contains the two sequences 1:4 and 6:7, among other non-sequential digits.
x <- c(7, 1:4, 6:7, 9)
I'd like to split x by its sequences, so that the result is a list like the following.
# [[1]]
# [1] 7
#
# [[2]]
# [1] 1 2 3 4
#
# [[3]]
# [1] 6 7
#
# [[4]]
# [1] 9
Is there a quick and simple way to do this?
I've tried
split(x, c(0, diff(x)))
which gets close, but I don't feel like appending 0 to the differenced vector is the right way to go. Using findInterval didn't work either.
split(x, cumsum(c(TRUE, diff(x)!=1)))
#$`1`
#[1] 7
#
#$`2`
#[1] 1 2 3 4
#
#$`3`
#[1] 6 7
#
#$`4`
#[1] 9
Just for fun, you can make use of Carl Witthoft's seqle function from his "cgwtools" package. (It's not going to be anywhere near as efficient as Roland's answer.)
library(cgwtools)
## Here's what seqle does...
## It's like rle, but for sequences
seqle(x)
# Run Length Encoding
# lengths: int [1:4] 1 4 2 1
# values : num [1:4] 7 1 6 9
y <- seqle(x)
split(x, rep(seq_along(y$lengths), y$lengths))
# $`1`
# [1] 7
#
# $`2`
# [1] 1 2 3 4
#
# $`3`
# [1] 6 7
#
# $`4`
# [1] 9

Resources