Spacing vector by regular pattern - r

I have a vector
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
which generally follows the pattern of alternating between two different sequences of strings (the first sequence being all alphabetical, the second being numerical with the symbol #).
However there are cases where no # term appears: so in the above between mp and jq, and then again after ez. I would like to define a function which "fills the gaps" with the character string #, so that I would have the output:
[1] "ab" "#4" "gw" "#29" "mp" "#" "jq" "#35" "ez" "#"
which I would then convert to a data frame
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
My attempt so far is rather clunky and relies on looping through the vector and filling the gaps. I'd be interested to see more elegant solutions.
My Solution
greplSpace <- function(pattern, replacement, x){
j <- 1
while( j < length(x) ){
if(grepl(pattern, x[j+1]) ){
j <- j+2
} else {
x <- c( x[1:j], replacement, x[(j+1):length(x)] )
j <- j+2
}
}
if( ! grepl(pattern, tail(x,1) ) ){ x <- c(x, replacement) }
return(x)
}
library(magrittr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
vec %>% greplSpace("#", "#", . ) %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame

Start with your vec, we can create your expected data frame directly with some functions from the dplyr, tidyr, and stringr.
library(dplyr)
library(tidyr)
library(stringr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
dat <- data_frame(Value = vec)
dat2 <- dat %>%
mutate(String = !str_detect(vec, "#"),
Key = ifelse(String, "V1", "V2"),
Row = cumsum(String)) %>%
select(-String) %>%
spread(Key, Value, fill = "#") %>%
select(-Row)
dat2
# # A tibble: 5 x 2
# V1 V2
# <chr> <chr>
# 1 ab #4
# 2 gw #29
# 3 mp #
# 4 jq #35
# 5 ez #

Here is a base R option with split. Create a logical index by checking the "#" in each of the strings, get the cumulative sum and split the original vector by this grouping variable into a list ('lst'). For those list elements that don't have two (maximum length) elements are appended with NA at the end by assignment with length<-. Then, rbind, the list elements into a two column matrix. If needed, convert those NA to #
lst <- split(vec, cumsum(!grepl("#", vec)))
out <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
out[,2][is.na(out[,2])] <- "#" #not recommended though
out
# [,1] [,2]
#1 "ab" "#4"
#2 "gw" "#29"
#3 "mp" "#"
#4 "jq" "#35"
#5 "ez" "#"
Wrap it with as.data.frame if we need a data.frame output

You can use Base R:
First Collapse the vector into a string while replaceing # where needed.
Then just read using read.csv
vec1=gsub("([a-z]),\\s*([a-z])|$","\\1,#,\\2",toString(vec))
read.csv(text=gsub("(#.*?),","\\1\n",vec1),h=F)
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Explanation:
First collapse the vector into a string by toString
Then if there are alphabets on either side of the , ie [a-z],\s*[a-z] or at the end ie |$ you insert an #.
Then create line breaks after numbers or # and read in the data as a table
You can also do:
a=read.csv(h=F,text=toString(sub("([a-z]+)","\n\\1",vec)),na=c(" ",""))[1:2]
a
V1 V2
1 ab #4
2 gw #29
3 mp <NA>
4 jq #35
5 ez <NA>
data.frame(replace(as.matrix(a),is.na(a),"#"))
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #

Another base possibility:
do.call(rbind, tapply(vec, cumsum(!grepl("^#", vec)), FUN = function(x){
if(length(x) == 1) c(x, "#") else x}))
# [,1] [,2]
# 1 "ab" "#4"
# 2 "gw" "#29"
# 3 "mp" "#"
# 4 "jq" "#35"
# 5 "ez" "#"
Explanation:
Check if elements in vec starts with #, and negate it: !grepl("^#", vec); creates a logical vector.
Create a grouping variable by applying cumsum to the logical vector (note: 1 & 2 similar to #akrun).
Use tapply to apply a function to each subset of vec, defined by the grouping variable. Check if the length is 1. If so, pad by a trailing #, else just return the subset: if(length(x) == 1) c(x, "#") else x
Bind the resulting list together by row: do.call(rbind,
Another one:
# create a row index
ri <- cumsum(!grepl("^#", vec))
# create a column index
ci <- ave(ri, ri, FUN = seq_along)
# create an empty matrix of desired dimensions
m <- matrix(nrow = max(ri), ncol = 2)
# assign 'vec' to matrix at relevant indices
m[cbind(ri, ci)] <- vec
# replace NA with '#'
m[is.na(m)] <- "#"
Using data.table. Create a grouping variable as above, and reshape from long to wide.
library(data.table)
d <- data.table(vec)
d[ , g := cumsum(!grepl("^#", vec))]
dcast(d, g ~ rowid(g), value.var = "vec", fill = "#")
# g 1 2
# 1: 1 ab #4
# 2: 2 gw #29
# 3: 3 mp #
# 4: 4 jq #35
# 5: 5 ez #

Related

If string has a certain character, fill an empty cell in the same row with a certain value

Say I have the following data frame:
# S/N a b
# 1 L1-S2 <blank>
# 2 T1-T3 <blank>
# 3 T1-L2 <blank>
How do I turn the above data frame into this:
# S/N a b
# 1 L1-S2 LS
# 2 T1-T3 T
# 3 T1-L2 TL
I am thinking of writing a loop, where
For x in column a,
If first character in x == L AND 4th character in x == S,
fill the corresponding cell in b with LS
and so on...
However, I am not sure how to implement it, or if there is a more elegant way of doing this.
We can extract the upper case letters and remove the repeated ones
library(stringr)
library(dplyr)
df1 %>%
mutate(b = str_replace(str_replace(a, "^([A-Z])\\d+-([A-Z])\\d+",
"\\1\\2"), "(.)\\1+", "\\1"))
-output
# S_N a b
#1 1 L1-S2 LS
#2 2 T1-T3 T
#3 3 T1-L2 TL
Or another option is str_extract_all to extract the upper case letters, loop over the list with map, paste the unique elements
library(purrr)
df1 %>%
mutate(b = str_extract_all(a, "[A-Z]") %>%
map_chr(~ str_c(unique(.x), collapse="")))
Or using a corresponding base R option for the first tidyverse option
df1$b <- sub("(.)\\1+", "\\1", gsub("[0-9-]+", "", df1$a))
Or with strsplit
df1$b <- sapply(strsplit(df1$a, "[0-9-]+"),
function(x) paste(unique(x), collapse=""))
data
df1 <- structure(list(S_N = 1:3, a = c("L1-S2", "T1-T3", "T1-L2"),
b = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))

Subset a dataframe by matching it to a list and include non-match value too in the output using R

I have a dataframe (myDF) that has 2 columns "A" and "B" and a function (myfunc) which takes a list as an input and if it finds a match in column "A" then it returns a new dataframe that is a subset of myDF containing the value match and the corresponding "B" column.
But I want the function to also return the non-matching value in column A and NULL string in column B.
myDF:
A B
1 11
2 22
3 33
myfunc:
myfunc <- function(x) {
r<- with(myDF, myDF[a %in% x, c("a", "b")])
return(data.frame(r))
}
Input: mylist = c(1,2,"E")
Expected Output:
A B
1 11
2 22
E NULL
We create a logical index and assign
i1 <- with(myDF, !A %in% mylist)
myDF$B[i1] <- "NULL"
myDF$A[i1] <- mylist[i1]
myDF
# A B
#1 1 11
#2 2 22
#3 E NULL
Note: By assigning a character string to 'B' column, it effectively changes the type from numeric to character. A better option would be to assign it to NA
myDF$B[i1] <- NA
Or
data.frame(A= mylist, B = myDF$B[match(mylist, myDF$A)])
This is a join operation, which can be done in base R with merge, if you make the list a data.frame first. The all.y = T argument includes rows of mylistDF with no matching rows in myDF in the output.
mylistDF <- data.frame(A = mylist, stringsAsFactors = F)
merge(myDF, mylistDF, by = 'A', all.y = T)
# A B
# 1 1 11
# 2 2 22
# 3 E NA
Since you tagged tidyr, here's a tidyverse solution (same output)
library(tidyverse)
mylistDF <- tibble(A = mylist)
myDF %>%
mutate_at('A', as.character) %>%
right_join(mylistDF, by = 'A')

integer type split into two columns

I've column with two alpha numeric characters separated by '->' I'm trying to split them into columns.
Df:
column e
1. asd1->ref2
2. fde4 ->fre4
3. dfgt-fgr ->frt5
4. ftr5 -> lkh-oiut
5. rey6->usre-lynng->usre-lkiujh->kiuj-bunny
6. dge1->fgt4->okiuj-dfet
Desired output
col 1 col 2
1. asd1 ref2
2. fde4 fre4
3. frt5
4. ftr5
5. rey6
6. dge1 fgt4
I tried using out <- strsplit(as.character(Df$column e),'_->_') with no output and used str_extract(m1$column e,"(?<=\\[)[[:alnum:]]")->m1$column f, also strsplit(as.character(Df$column e),' -> 'fixed=T)[[1]][[1]] but not getting the desired output.
The column if of integer type and all are capital letters(I'm not sure if this is imp.)
Here is one way with tidyverse
library(tidyverse)
df1 %>%
separate(columne, into = c('col1', 'col2'), sep = "->", extra = 'drop') %>%
mutate_all(funs(replace(., str_detect(., '-'), "")))
# col1 col2
#1 asd1 ref2
#2 fde4 fre4
#3 frt5
#4 ftr5
#5 rey6
#6 dge1 fgt4
A base R solution as well, though a fair bit less concise than #akrun's tidyverse one:
# split as appropriate
out <- strsplit( as.character( Df$column.e ), '->' )
out <- lapply( out, function(x) {
# I assume you don't want the white space
y <- trimws( x )
# take the first two "columns"
y <- y[1:2]
# remove any items containing a hyphen
y[ grepl( "-", y ) ] <- ""
y
}
)
# then bind it all rowwise
out <- do.call( rbind, out )
data.frame( out )
X1 X2
1 asd1 ref2
2 fde4 fre4
3 frt5
4 ftr5
5 rey6
6 dge1 fgt4

sorting list of dataframes by date

I have a list of dataframes in a structure similar to this:
`ID1_01/05/10` <- data.frame(c(1,1))
`ID1_21/02/10` <- data.frame(c(2,1))
`ID2_01/05/10` <- data.frame(c(3,1))
`ID2_21/02/10` <- data.frame(c(4,1))
lst <- list(mget ( ls ( pattern = 'ID\\d+')))
I'd like to order them in the list first by identity and then by date. I.e.:
`ID1_21/02/10`
`ID1_01/05/10`
`ID2_21/02/10`
`ID2_01/05/10`
Is there a way of doing this easily?
We extract the names, get the numeric part ('v1') and the Date part, and order based on it
nm1 <- sapply(lst, names)[,1]
v1 <- as.numeric(sub(".*(\\d+)_.*", "\\1", nm1))
d1 <- as.Date(sub(".*_", "", nm1), "%d/%m/%y")
nm1[order(v1, d1)]
#[1] "ID1_21/02/10" "ID1_01/05/10" "ID2_21/02/10" "ID2_01/05/10"
lapply(lst, function(x) x[order(v1, d1)])
#[[1]]
#[[1]]$`ID1_21/02/10`
# c.2..1.
#1 2
#2 1
#[[1]]$`ID1_01/05/10`
# c.1..1.
#1 1
#2 1
#[[1]]$`ID2_21/02/10`
# c.4..1.
#1 4
#2 1
#[[1]]$`ID2_01/05/10`
# c.3..1.
#1 3
#2 1
Update
In the OP's example, the mget was wrapped with list and it would create a list of lists. Instead it would be
lst <- mget ( ls ( pattern = 'ID\\d+'))
and if that is the case, then
nm1 <- names(lst)
lst[order(v1, d1)]

Expand an list in an R dataframe into additional rows of the dataframe?

In a separate question earlier today I asked how to flatten nested lists into a row in a dataframe. I wish to further understand how to manipulate my lists within a dataframe, this time by expanding the list vertically within the dataframe, adding new rows to accommodate the the data.
In this case I wish to go from this structure:
CAT COUNT TREAT
A 1,2,3 Treat-a, Treat-b
B 4,5 Treat-c,Treat-d,Treat-e
To this structure:
CAT COUNT TREAT
A 1 Treat-a
A 2 Treat-b
A 3 NA
B 4 Treat-c
B 5 Treat-d
B NA Treat-e
Code to generate the test (source) data:
df<-data.frame(CAT=c("A","B"))
df$COUNT <-list(1:3,4:5) # as numbers
df$TREAT <-list(paste("Treat-", letters[1:2],sep=""),paste("Treat-", letters[3:5],sep=""))
I tried using CBIND in an approach similar to the answer supplied to my earlier question but it failed due to the different number of values between the multiple lists. Thank you for your patience as I attempt to grasp these basic manipulation tasks.
Here's what I came up with, using a helper function cbind.fill: (cbind a df with an empty df (cbind.fill?))
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
#Split df by CAT
df.split <- split(df, df$CAT)
#Apply cbind.fill to make a matrix filled with NA where needed
rawlist <- lapply(df.split, function(x) cbind(as.character(x$CAT), cbind.fill(unlist(x$COUNT), unlist(x$TREAT) ) ))
#Bind rows and convert matrix to data.frame
df.new <- as.data.frame(do.call(rbind, rawlist))
#Column names
colnames(df.new) <- names(df)
df.new
CAT COUNT TREAT
1 A 1 Treat-a
2 A 2 Treat-b
3 A 3 <NA>
4 B 4 Treat-c
5 B 5 Treat-d
6 B <NA> Treat-e
Extending my answer from your previous question, you can create another function, let's call this one flattenLong that combines flatten and melt from "data.table" to get your desired output.
The function looks like this:
flattenLong <- function(indt, cols) {
ob <- setdiff(names(indt), cols)
x <- flatten(indt, cols, TRUE)
mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}
Usage is simply:
flattenLong(df, c("COUNT", "TREAT"))
## CAT variable COUNT TREAT
## 1: A 1 1 Treat-a
## 2: A 2 2 Treat-b
## 3: A 3 3 NA
## 4: B 1 4 Treat-c
## 5: B 2 5 Treat-d
## 6: B 3 NA Treat-e
For convenience, here's the flatten function again:
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = cols]
if (isTRUE(drop)) {
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE),
.SDcols = cols][, (cols) := NULL]
}
indt[]
}

Resources