Automatically strip trailing whitespace when fetching data with `DBI::dbGetQuery` in R? - r

I work with a database (of which I am not the DBA) that has character columns of length greater than the actual data.
Is it possible to automatically strip trailing whitespace when fetching data with DBI::dbGetQuery? (i.e. something similar to utils::read.table(*, strip.white = TRUE))
# connect
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# generate fake data
mytable <- data.frame(x = 1, y = LETTERS[1:3], z = paste(LETTERS[1:3], " "))
dbWriteTable(con, "mytable", mytable)
# fetch data
(a <- dbGetQuery(con, "select * from mytable"))
# x y z
# 1 1 A A
# 2 1 B B
# 3 1 C C
# trailing space are kept
sapply(a, nchar)
# x y z
# [1,] 1 1 5
# [2,] 1 1 5
# [3,] 1 1 5
I hope I can avoid something like:
idx <- sapply(a, is.character)
a[idx] <- lapply(a[idx], trimws, which = "left", whitespace = "[ ]")
sapply(a, nchar)
# x y z
# [1,] 1 1 1
# [2,] 1 1 1
# [3,] 1 1 1
If not, is it a good approach?

As long as you're using select *, there is nothing SQL is going to do for this. If you select them by-name (which is a "best practice" and in many areas the industry-standard), you can use TRIM:
sqldf::sqldf("select x, y, trim(z) as z from mytable") |>
str()
# 'data.frame': 3 obs. of 3 variables:
# $ x: num 1 1 1
# $ y: chr "A" "B" "C"
# $ z: chr "A" "B" "C"
There are also rtrim and ltrim for limiting which side of the string you trim trailing/leading blank space.

Related

R - identify sequences in a vector

Suppose I have a vector ab containing A's and B's. I want to identify sequences and create a vector v with length(ab) that indicates the sequence length at the beginning and end of a given sequence and NA otherwise.
I have however the restriction that another vector x with 0/1 will indicate that a sequence ends.
So for example:
rep("A", 6)
"A" "A" "A" "A" "A" "A"
x <- c(0,0,1,0,0,0)
0 0 1 0 0 0
should give
v <- c(3 NA 3 3 NA 3)
An example could be the following:
ab <- c(rep("A", 5), "B", rep("A", 3))
"A" "A" "A" "A" "A" "B" "A" "A" "A"
x <- c(rep(0,3),1,0,1,rep(0,3))
0 0 0 1 0 1 0 0 0
Here the output should be:
4 NA NA 4 1 1 3 NA 3
(without the restriction it would be)
5 NA NA NA 5 1 3 NA 3
So far, my code without the restriction looks like this:
ab <- c(rep("A", 5), "B", rep("A", 3))
x <- c(rep(0,3),1,0,1,rep(0,3))
cng <- ab[-1L] != ab[-length(ab)] # is there a change in A and B w.r.t the previous value?
idx <- which(cng) # where do the changes take place?
idx <- c(idx,length(ab)) # include the last value
seq_length <- diff(c(0, idx)) # how long are the sequences?
# create v
v <- rep(NA, length(ab))
v[idx] <- seq_length # sequence end
v[idx-(seq_length-1)] <- seq_length # sequence start
v
Does anyone have an idea how I can implement the restriction? (And since my vector has 2 Millions of observations, I wonder whether there would be a more efficient way than my approach)
I would appreciate any comments! Many thanks in advance!
You may do something like this
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
#creating result of lengths
res <- as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
> res
[1] 4 4 4 4 1 1 1 3 3 3
#creating intermediate NAs
replace(res, with(rle(res), setdiff(seq_along(res), c(length(res) + 1 - cumsum(rev(lengths)),
cumsum(lengths),
which(res == 1)))), NA)
[1] 4 NA NA 4 1 1 1 3 NA 3
As per edited scenario
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
ab[3] <- 'B'
as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
[1] 2 2 1 1 1 1 1 3 3 3
ab
[1] "A" "A" "B" "A" "A" "B" "A" "A" "A" "A"

Find all combinations of the numbers in a vector. R programming

Are there any direct functions that can be used to get the combinations of all the items in the vector?
myVector <- c(1,2,3)
for (i in myVector)
for (j in myVector)
for (k in myVector)
print(paste(i,j,k,sep=","))
The screenshot of the first part of the output look like this. As there are three values 1,2,3 there will be
3 * 3 * 3 = 27 lines
I tried to get the permutations using the function permn() as,
permn(myVector)
But is giving only the 9 different values.
Screenshot of the output :
Is there any direct function that can produce such a result as shown in the first?
Using RcppAlgos::permuteGeneral.
r <- RcppAlgos::permuteGeneral(myVector, length(myVector), repetition=TRUE)
head(r, 3)
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 2
# [3,] 1 1 3
If you want the comma separated strings, do
apply(r, 1, paste, collapse=",")
# [1] "1,1,1" "1,1,2" "1,1,3" "1,2,1" "1,2,2" "1,2,3" "1,3,1"
# [8] "1,3,2" "1,3,3" "2,1,1" "2,1,2" "2,1,3" "2,2,1" "2,2,2"
# [15] "2,2,3" "2,3,1" "2,3,2" "2,3,3" "3,1,1" "3,1,2" "3,1,3"
# [22] "3,2,1" "3,2,2" "3,2,3" "3,3,1" "3,3,2" "3,3,3"
Or the list output, you've also shown
RcppAlgos::permuteGeneral(myVector, length(myVector), FUN=function(x)
paste(x, collapse=","), repetition=TRUE)
# [[1]]
# [1] "1,1,1"
#
# [[2]]
# [1] "1,1,2"
#
# [[3]]
# [1] "1,1,3"
#
# [[4]]
# [1] "1,2,1"
# ...
You may decide on your own :)
Use expand.grid :
tmp <- expand.grid(myVector, myVector, myVector)
tmp
# Var1 Var2 Var3
#1 1 1 1
#2 2 1 1
#3 3 1 1
#4 1 2 1
#5 2 2 1
#6 3 2 1
#...
#...
If you want to do this automatically for the length of myVector without manually specifying it 3 times you can use replicate.
tmp <- do.call(expand.grid, replicate(length(myVector),
myVector, simplify = FALSE))
To paste the values together you can do :
do.call(paste, c(tmp, sep = ','))
# [1] "1,1,1" "2,1,1" "3,1,1" "1,2,1" "2,2,1" "3,2,1" "1,3,1" "2,3,1"
# [9] "3,3,1" "1,1,2" "2,1,2" "3,1,2" "1,2,2" "2,2,2" "3,2,2" "1,3,2"
#[17] "2,3,2" "3,3,2" "1,1,3" "2,1,3" "3,1,3" "1,2,3" "2,2,3" "3,2,3"
#[25] "1,3,3" "2,3,3" "3,3,3"
Note that there is a permutations function in the gtools package that allows you to generalize permutation outputs:
library(gtools)
permutations(3, 3, 1:3, repeats.allowed = TRUE)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 2
[3,] 1 1 3
[4,] 1 2 1
[5,] 1 2 2
[6,] 1 2 3
[7,] 1 3 1
[8,] 1 3 2
[9,] 1 3 3
[10,] 2 1 1
The function help describes the parameter settings.
It appears that pracma::combs does exactly this. That, and pracma::perms generate output sets which treat every element of the input as distinct, regardless of whether a value is repeated.

Pivot table of concatenated string in r

I have the following dataset:
mydata<- data.frame(Factors= c("a,b" , "c,d" , "a,c"), Valu = c ("2,3" , "7,8" , "9,1"))
Factors Valu
1 a,b 2,3
2 c,d 7,8
3 a,c 9,1
and I wish to convert to the following which has all the values that happend with a factor:
My ideal output
a b c d
2 2 7 7
3 3 8 8
9 9
1 1
I need a pivot table. However I need to prepare the data and then use melt and dcast have my desirable output: one of fail tries for preparing data is :
mydata2 <- cSplit(mydata, c("Factors","Valu") , ",", "long")
But they lose their connections.
Here is an one-line code with cSplit
library(splitstackshape)
with(cSplit(cSplit(mydata, 1, ",", "long"), 2, ",", "long"), split(Valu, Factors))
#$a
#[1] 2 3 9 1
#$b
#[1] 2 3
#$c
#[1] 7 8 9 1
#$d
#[1] 7 8
If we need a data.table/data.frame, use dcast to convert the 'long' format to 'wide'.
dcast(cSplit(cSplit(mydata, 1, ",", "long"), 2, ",", "long"),
rowid(Factors)~Factors, value.var="Valu")[, Factors := NULL][]
# a b c d
#1: 2 2 7 7
#2: 3 3 8 8
#3: 9 NA 9 NA
#4: 1 NA 1 NA
NOTE: splitstackshape loads the data.table. Here, we used data.table_1.10.0. The dcast from data.table is also very fast
Using a couple of *applys, strsplit and grep
## convert columns to characters so you can use strsplit
mydata$Factors <- as.character(mydata$Factors)
mydata$Valu <- as.character(mydata$Valu)
## get all the unique factor values by splitting them
f <- unique(unlist(strsplit(unique(mydata$Factors), split = ",")))
## filter 'mydata' by using 'grep' to search for each individual factor value
## (using sapply for one at a time)
l <- sapply(f, function(x) mydata[grep(x, mydata$Factors), "Valu"])
This gives a list, where each element is named by the 'Factor' value, and it contains all the 'Valu' values associated with it
l
# $a
# [1] "2,3" "9,1"
#
# $b
# [1] "2,3"
#
# $c
# [1] "7,8" "9,1"
#
# $d
# [1] "7,8"
Another lapply on this list will split the 'Valu's
result <- lapply(l, function(x) unlist(strsplit(x, split = ",")))
result
# $a
# [1] "2" "3" "9" "1"
#
# $b
# [1] "2" "3"
#
# $c
# [1] "7" "8" "9" "1"
#
# $d
# [1] "7" "8"
Edit
To get the result in a data.frame, you can make each list element the same length (by filling with NA), then call data.frame on the result
## the number of rows required for each column
maxLength <- max(sapply(result, length))
## append 'NA's to list with fewer than maxLenght lements
result <- data.frame(sapply(result, function(x) c(x, rep(NA, maxLength - length(x))) ))
result
# a b c d
# 1 2 2 7 7
# 2 3 3 8 8
# 3 9 <NA> 9 <NA>
# 4 1 <NA> 1 <NA>
Edit
In response to the comment, if you have 'similar' strings, you can make your grep regex explicit by using ( ) (see any regex cheatsheet for explanations)
mydata<- data.frame(Factors= c("a,b" , "c,d" , "a,c", "bo,ao"), Valu = c ("2,3" , "7,8" , "9,1", "x,y"))
mydata$Factors <- as.character(mydata$Factors)
mydata$Valu <- as.character(mydata$Valu)
f <- unique(unlist(strsplit(unique(mydata$Factors), split = ",")))
## filter 'mydata' by using 'grep' to search for each individual factor value
## (using sapply for one at a time)
l <- sapply(f, function(x) mydata[grep(paste0("(",x,")"), mydata$Factors), "Valu"])
Another base R attempt:
# character conversion first
mydata[] <- lapply(mydata, as.character)
long <- do.call(rbind,
do.call(Map, c(expand.grid, lapply(mydata, strsplit, ","), stringsAsFactors=FALSE))
)
split(long$Valu, long$Factors)
#$a
#[1] "2" "3" "9" "1"
#
#$b
#[1] "2" "3"
#
#$c
#[1] "7" "8" "9" "1"
#
#$d
#[1] "7" "8"
I misunderstood in my comment above; if you want every Factor to match every Valu, you need to separate the columns independently to get the combinations. If you add indices to spread by, it's not too bad:
library(tidyverse)
mydata %>%
separate_rows(Factors) %>% separate_rows(Valu, convert = TRUE) %>%
# add indices to give row order when spreading
group_by(Factors) %>% mutate(i = row_number()) %>%
spread(Factors, Valu) %>%
select(-i) # clean up extra column
## # A tibble: 4 × 4
## a b c d
## * <int> <int> <int> <int>
## 1 2 2 7 7
## 2 3 3 8 8
## 3 9 NA 9 NA
## 4 1 NA 1 NA

Generalize R %in% operator to match tuples

I spent a while the other day looking for a way to check if a row vector is contained in some set of row vectors in R. Basically, I want to generalize the %in% operator to match a tuple instead of each entry in a vector. For example, I want:
row.vec = c("A", 3)
row.vec
# [1] "A" "3"
data.set = rbind(c("A",1),c("B",3),c("C",2))
data.set
# [,1] [,2]
# [1,] "A" "1"
# [2,] "B" "3"
# [3,] "C" "2"
row.vec %tuple.in% data.set
# [1] FALSE
for my made-up operator %tuple.in% because the row vector c("A",3) is not a row vector in data.set. Using the %in% operator gives:
row.vec %in% data.set
# [1] TRUE TRUE
because "A" and 3 are in data.set, which is not what I want.
I have two questions. First, are there any good existing solutions to this?
Second, since I couldn't find them (even if they exist), I tried to write my own function to do it. It works for an input matrix of row vectors, but I'm wondering if any experts have proposed improvements:
is.tuple.in <- function(matrix1, matrix2){
# Apply rbind() so that matrix1 has columns even if it is a row vector.
matrix1 = rbind(matrix1)
if(ncol(matrix1) != ncol(matrix2)){
stop("Matrices must have the same number of columns.") }
# Now check for the first row and handle other rows recursively
row.vec = matrix1[1,]
tuple.found = FALSE
for(i in 1:nrow(matrix2)){
# If we find a match, then this row exists in matrix 2 and we can break the loop
if(all(row.vec == matrix2[i,])){
tuple.found = TRUE
break
}
}
# If there are more rows to be checked, use a recursive call
if(nrow(matrix1) > 1){
return(c(tuple.found, is.tuple.in(matrix1[2:nrow(matrix1),],matrix2)))
} else {
return(tuple.found)
}
}
I see a couple problems with that that I'm not sure how to fix. First, I'd like the base case to be clear at the start of the function. I didn't manage to do this because I pass matrix1[2:nrow(matrix1),] in the recursive call, which produces an error if matrix1 has one row. So instead of getting to a case where matrix1 is empty, I have an if condition at the end deciding if more iterations are necessary.
Second, I think the use of rbind() at the start is sloppy, but I needed it for when matrix1 had been reduced to a single row. Without using rbind(), ncol(matrix1) produced an error in the 1-row case. I figure my trouble here has to do with a lack of knowledge about R data types.
Any help would be appreciated.
I'm wondering if you have made this a bit more complicated than it is. For example,
set.seed(1618)
vec <- c(1,3)
mat <- matrix(rpois(1000,3), ncol = 2)
rownames(mat) <- 1:nrow(mat)
mat[sapply(1:nrow(mat), function(x) all(vec %in% mat[x, ])), ]
# gives me this
# [,1] [,2]
# 6 3 1
# 38 3 1
# 39 3 1
# 85 1 3
# 88 1 3
# 89 1 3
# 95 3 1
# 113 1 3
# ...
you could subset this further if you care about the order
or you could modify the function slightly:
mat[sapply(1:nrow(mat), function(x)
all(paste(vec, collapse = '') %in% paste(mat[x, ], collapse = ''))), ]
# [,1] [,2]
# 85 1 3
# 88 1 3
# 89 1 3
# 113 1 3
# 133 1 3
# 139 1 3
# 187 1 3
# ...
another example with a longer vector
set.seed(1618)
vec <- c(1,4,5,2)
mat <- matrix(rpois(10000, 3), ncol = 4)
rownames(mat) <- 1:nrow(mat)
mat[sapply(1:nrow(mat), function(x) all(vec %in% mat[x, ])), ]
# [,1] [,2] [,3] [,4]
# 57 2 5 1 4
# 147 1 5 2 4
# 279 1 2 5 4
# 303 1 5 2 4
# 437 1 5 4 2
# 443 1 4 5 2
# 580 5 4 2 1
# ...
I see a couple that match:
mat[sapply(1:nrow(mat), function(x)
all(paste(vec, collapse = '') %in% paste(mat[x, ], collapse = ''))), ]
# [,1] [,2] [,3] [,4]
# 443 1 4 5 2
# 901 1 4 5 2
# 1047 1 4 5 2
but only three
for your single row case:
vec <- c(1,4,5,2)
mat <- matrix(c(1,4,5,2), ncol = 4)
rownames(mat) <- 1:nrow(mat)
mat[sapply(1:nrow(mat), function(x)
all(paste(vec, collapse = '') %in% paste(mat[x, ], collapse = ''))), ]
# [1] 1 4 5 2
here is a simple function with the above code
is.tuplein <- function(vec, mat, exact = TRUE) {
rownames(mat) <- 1:nrow(mat)
if (exact)
tmp <- mat[sapply(1:nrow(mat), function(x)
all(paste(vec, collapse = '') %in% paste(mat[x, ], collapse = ''))), ]
else tmp <- mat[sapply(1:nrow(mat), function(x) all(vec %in% mat[x, ])), ]
return(tmp)
}
is.tuplein(vec = vec, mat = mat)
# [1] 1 4 5 2
seems to work, so let's make our own %in% operator:
`%tuple%` <- function(x, y) is.tuplein(vec = x, mat = y, exact = TRUE)
`%tuple1%` <- function(x, y) is.tuplein(vec = x, mat = y, exact = FALSE)
and try her out
set.seed(1618)
c(1,2,3) %tuple% matrix(rpois(1002,3), ncol = 3)
# [,1] [,2] [,3]
# 133 1 2 3
# 190 1 2 3
# 321 1 2 3
set.seed(1618)
c(1,2,3) %tuple1% matrix(rpois(1002,3), ncol = 3)
# [,1] [,2] [,3]
# 48 2 3 1
# 64 2 3 1
# 71 1 3 2
# 73 3 1 2
# 108 3 1 2
# 112 1 3 2
# 133 1 2 3
# 166 2 1 3
Does this do what you want (even for more than 2 columns)?
paste(row.vec,collapse="_") %in% apply(data.set,1,paste,collapse="_")

Naming elements of matrix dimensions one at a time, when dimname is NULL

When dimnames is currently NULL, is it possible to re-name a matrix's dimestions one at a time?
For example, this fails:
mtx <- matrix(1:16,4)
dimnames(mtx)[[2]][1] <- 'col1'
with Error in dimnames(mtx)[[2]][1] <- "col1" : 'dimnames' must be a list
However this works:
mtx <- matrix(1:16,4)
dimnames(mtx)[[1]] <- letters[1:4]
dimnames(mtx)[[2]] <- LETTERS[1:4]
dimnames(mtx)[[2]][1] <- 'col1'
dimnames(mtx)[[2]][2] <- 'col2'
My objective is to seperately replace dimnames(mtx)[[2]][1] and dimnames(mtx)[[2]][2] etc ... if this is not possible, i can re-write the loop.
Thanks folks, I have ended up with the below -- I pass the names in via prepend:
mtxNameSticker <- function(mtx, prepend = NULL, MARGIN=2)
{
if (MARGIN == 1) max <- nrow(mtx) else
max <- ncol(mtx)
if (is.null(prepend)) if (MARGIN == 2) prepend <- 'C' else
prepend <- 'R'
if (length(prepend) == 1) prepend <- paste0(prepend, 1:dim(mtx)[[MARGIN]])
dimnames(mtx)[[MARGIN]] <- seq(from=1, by=1, length.out=dim(mtx)[[MARGIN]])
for (i in 1:max){
dimnames(mtx)[[MARGIN]][i] <- prepend[i]
}
return(mtx)
}
For as long as dimnames is NULL and not an appropriate list, you cannot make assignments to it at particular positions. One easy way to create a dummy but complete list of dimnames is to run:
dimnames(mtx) <- lapply(dim(mtx), seq_len)
mtx
# 1 2 3 4
# 1 1 5 9 13
# 2 2 6 10 14
# 3 3 7 11 15
# 4 4 8 12 16
Then, you can make assignments one at a time like you were wishing:
dimnames(mtx)[[2]][1] <- 'col1'
mtx
# col1 2 3 4
# 1 1 5 9 13
# 2 2 6 10 14
# 3 3 7 11 15
# 4 4 8 12 16
You are assigning a vector even though you are asked to supply a list.
Try this:
R> M <- matrix(1:4,2,2)
R> M
[,1] [,2]
[1,] 1 3
[2,] 2 4
R>
Columns:
R> M1 <- M; dimnames(M1) <- list(NULL, c("a","b")); M1
a b
[1,] 1 3
[2,] 2 4
R>
Rows:
R> M2 <- M; dimnames(M2) <- list(c("A","B"), NULL); M2
[,1] [,2]
A 1 3
B 2 4
R>
In response to your comment. #DirkEddelbuettel is correct, you are assigning a vector to what should be a list.
The reason for this is that you are assigning dimnames when the dimnames are NULL (not assigned yet)
The way R evaluates the following
x <- NULL
x[[2]][1] <- 'col1'
str(x)
## chr [1:2] NA "col1"
R returns a vector of length 2, not a list of length 2.
For your assignment to work, R would have to evaluate
x <- NULL
x[[2]][1] <- 'col1'
str(x)
to give
## List of 2
## $ : NULL
## $ : chr "col1"
Which is what would happen if x was originally defined as x <- list(NULL,NULL)
however, the dimnames must be NULL or a list of appropriate length vectors
The following does work (and is really #flodel solution)
dimnames(mtx) <- list(character(nrow(mtx)), character(ncol(mtx)))
# or
# dimnames(mtx) <- lapply(dim(mtx), character)
dimnames(mtx)[[2]][1] <- 'col1'
It seems you are allowed to set the name of the dimension without actually having any names for the dimension:
dimnames(mtx) = list(NULL,col1=NULL)
mtx
# col1
# [,1] [,2] [,3] [,4]
# [1,] 1 5 9 13
# [2,] 2 6 10 14
# [3,] 3 7 11 15
# [4,] 4 8 12 16

Resources