Concatenate two strings with common elements - r

I am working on a simple problem in R (but I have not yet figured it out though;p):
Given a vector vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada", ..., "Amada + Steven", "Steven + Henry"). I want to create a new vector vect2 that contains all the elements in vect1 and new elements that share the following property: for every two strings "A+B" and "B+C", we concatenate it into "A+C" and add this new element into vect2. Can anyone please help me do this?
Also, I want to get all the elements standing in front of + in each string, is the following code correct?
for (i in length(vect1)){
vect3[i] <- regexpr(".*+", vect1[i])
}
3rd question: if I have a dataframe d with a Date column in the format %d-%b (for example, 01-Apr), how do I order this dataframe in an increasing order based on Date?? Let's just say d <- c(01-Apr,01-Mar,02-Jan,31-June,30-May).

I think you could (should) avoid both for loops and the use of external lib if not required.
So this might be a solution:
// create data
vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada", "Amada + Steven", "Steven + Henry")
// create a matrix of pairs with removed white spaces
pairsMatrix <- do.call(rbind, sapply(vect1, function(v) strsplit(gsub(pattern = " ", replacement = "", x = v), "\\+")))
// remove dimnames (not necessary though)
dimnames(pairsMatrix) <- NULL
// for all line of the pairsMatrix, find if second element is somewhere else first element. Bind that with the previous pairs
allPairs <- do.call(rbind, c(list(pairsMatrix), apply(pairsMatrix, 1, function(names) c(names[1], pairsMatrix[names[2]==pairsMatrix[,1], 2]))))
// filter for oneself-relationships
allPairs[allPairs[,1]!=allPairs[,2],]
[,1] [,2]
[1,] "Andy" "Pete"
[2,] "Mary" "Pete"
[3,] "Pete" "Amada"
[4,] "Amada" "Steven"
[5,] "Steven" "Henry"
[6,] "Andy" "Amada"
[7,] "Mary" "Amada"
[8,] "Pete" "Steven"
[9,] "Amada" "Henry"
Concerning your last point, I think a simple sort with proper Date object will do it.

I think this should do it but I did things I probably shouldn't do... like growing objects and nesting for loops. If you want to access all elements in front of the '+', just use name.matrix[,1].
vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada","Amada + Steven", "Steven + Henry")
library(stringr)
name.matrix <- matrix(do.call('rbind',str_split(vect1, pattern = "\\s?[+]\\s?")), ncol = 2)
new.stuff <- c()
for(x in unique(name.matrix[,2])){
sub.mat.1 <- matrix(name.matrix[name.matrix[,2] == x,], ncol = 2)
sub.mat.2 <- matrix(name.matrix[name.matrix[,1] == x,], ncol = 2)
if(length(sub.mat.1) && length(sub.mat.2)){
for(y in seq_along(sub.mat.1[,2])){
new.add <- paste0(sub.mat.1[y,1],'+', sub.mat.2[,2])
new.stuff <- c(new.stuff, new.add)
}
}
}
vect2 <- c(vect1, new.stuff)
vect2
#[1] "Andy+Pete" "Mary + Pete" "Pete+ Amada" "Amada + Steven" "Steven + Henry" "Andy+Amada"
#[7] "Mary+Amada" "Pete+Steven" "Amada+Henry"
Update:
Third question. Well there's only 30 days in June. So you're going to get an NA there. If it's a data.frame that you're trying to sort based on date, you'll need to use the format df[order(df$Date),]. The lubridate package also might be helpful when working with dates.
d <- c('01-Apr','01-Mar','02-Jan','31-June','30-May')
d.new <- as.Date(d, format = '%d-%b')
d.new <- d.new[order(d.new)]
d.new
#[1] "2018-01-02" "2018-03-01" "2018-04-01" "2018-05-30" NA

Related

Extract numbers after a pattern in vector of characters

I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:
"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".
I'm having trouble using gsub to accomplish this.
library(tidyverse)
Data <- c("PASTE YOUR WHOLE STRING")
str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"
Reference <- Reference %>% mutate(end = end +1)
YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])
for (i in 2:dim(Reference)[1]){
Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
YourNumbers <- paste(YourNumbers, Temp, sep = "")
}
YourNumbers #Returns "1234567"
We can use str_match_all from stringr to get all the numbers followed by "Sent".
str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"
A base R option using strsplit and sub
lapply(strsplit(ss, "\\|"), function(x)
sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"
Sample data
ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"

How to rename mutliples files in r

I need to convert over 100 images name into a format like: SITE_T001_L001.jpg, where Site is CGS1, T= TUBES, L= image number.
All those images are contain into a single file named CGS1 (the site), subdivided by file named accordingly to their tubes number. Then the images are organised by date. This order represents the image number. The first one is 1, the second one is two.(the alpahabetic order is not correct)
here, I have a graphical representation:
I found how to do it manually in R
file.rename("Snap_029.jpg",
paste("CGS1","T001","L003", ".jpg", sep = "_"))
but is there anyway to automate it with a loop?
In more details - as requested in the response:
I have this series of input filenames (including leading path)- ordered by dates of modification (important).
file_list
[1] "CGS1/1/Snap_001.jpg" "CGS1/1/Snap_002.jpg" "CGS1/1/Snap_005.jpg" "CGS1/2/Snap_006.jpg" "CGS1/2/Snap_007.jpg" "CGS1/2/Snap_082.jpg"
I am looking to modify the name of each images following the main folder CGS1, the subfolder from T001 to T002, and following the date of modification from L001 to L003 as this output filenames
new_file_list
[1] "CGS1_T001_L001.jpg" "CGS1_T001_L002.jpg" "CGS1_T001_L003.jpg" "CGS1_T002_L001.jpg" "CGS1_T002_L002.jpg" "CGS1_T002_L003.jpg"
Try this:
file_list <- list.files(path = "...", recursive = TRUE, pattern = "\\.jpg$")
### for testing
file_list <- c(
"CGS1/1/Snap_001.jpg", "CGS1/1/Snap_005.jpg", "CGS1/1/Snap_002.jpg",
"CGS1/2/Snap_006.jpg", "CGS1/2/Snap_007.jpg", "CGS1/2/Snap_0082.jpg"
)
spl <- strsplit(file_list, "[/\\\\]")
# ensure that all files are exactly two levels down
stopifnot(all(lengths(spl) == 3))
m <- do.call(rbind, spl)
m
# [,1] [,2] [,3]
# [1,] "CGS1" "1" "Snap_001.jpg"
# [2,] "CGS1" "1" "Snap_005.jpg"
# [3,] "CGS1" "1" "Snap_002.jpg"
# [4,] "CGS1" "2" "Snap_006.jpg"
# [5,] "CGS1" "2" "Snap_007.jpg"
# [6,] "CGS1" "2" "Snap_0082.jpg"
From this, we'll update the second/third columns to be what you expect.
# either one (not both), depending on if you are guaranteed integers
m[,2] <- sprintf("T%03.0f", as.integer(m[,2]))
# ... or if you may have non-numbers
m[,2] <- paste0("T", strrep("0", max(0, 3 - nchar(m[,2]))), m[,2])
# since we really don't care about 'Snap_001.jpg' (etc), we can discard the third column
new_file_list <- apply(m[,1:2], 1, paste, collapse = "_")
# back-street way of applying sequences to each CGS/T combination while preserving order
for (prefix in unique(new_file_list)) {
new_file_list[new_file_list == prefix] <- sprintf("%s_L%03d.jpg",
new_file_list[new_file_list == prefix],
seq_len(sum(new_file_list == prefix)))
}
new_file_list
# [1] "CGS1_T001_L001.jpg" "CGS1_T001_L002.jpg" "CGS1_T001_L003.jpg"
# [4] "CGS1_T002_L001.jpg" "CGS1_T002_L002.jpg" "CGS1_T002_L003.jpg"
Now it's a matter of renaming. Note that this will move all files into the current directory.
file.rename(file_list, new_file_list)

How to access variable length lists inside a list in R

When I call strsplit() on a column of a data frame, depending on the results of the strsplit(), I sometimes get one or two "sublists" as a result of splitting. For example,
v <- c("50", "1 h 30 ", "1 h", NA)
split <- strsplit(v, "h")
[[1]]
[1] "50"
[[2]]
[1] "1" " 30"
[[3]]
[1] "1 "
[[4]]
[1] NA
I know I can access the individual lists of split using '[]' and '[[]]' tells me the contents of those sublists, so I think I understand that. And that I can access the " 30" in [[2]] by doing split[[2]][2].
Unfortunately, I don't know how to access this programmatically over the entire column that I have. I am trying to convert the column to numeric data. But that "1 h 30" case is giving me a lot of trouble.
func1 <- function(x){
split.l <- strsplit(x, "h")
len <- lapply(split.l, length)
total <- ifelse(len == 2, as.numeric(split.l[2]) + as.numeric(split.l[1]) * 60, as.numeric(split.l[2]))
return(total)
}
v <- ifelse(grepl("h", v), func1(v), as.numeric(v))
I know len returns the vector of the length of the splits. But when it comes to actually accessing that individual sublist's second element, I simply don't know how to do it properly. This will generate an error because split.l[1] and split.l[2] will only return the first two elements of the entire original dataframe column every time. [[1]] and [[2]] won't work either. I need something like [[i]][1] and [[i]][2]. But I'm trying not to use a for loop and iterate.
To make a long story short, how do I access the inner list element programmatically
For reference, I did look at this which helped. But I still haven't been able to solve it. apply strsplit to specific column in a data.frame
I'm really struggling with lists and list processing in R so any help is appreciated.
A common idiom is lapply(l,[, 2), which applied to your example gives:
> lapply(split, `[`, 2)
[[1]]
[1] NA
[[2]]
[1] " 30 "
[[3]]
[1] NA
[[4]]
[1] NA
sapply() will collapse this to a vector if it can.
What is being done is lapply() takes each component of split in turn ā€” this is the [[i]] bit of your pseudo code ā€” and to each of those we want to extract the nth element. We do by applying the [ function with argument nā€” in this case 2L.
If you want the first element unless there is a second element, in which case take the second, you could just write a wrapper instead of using [ directly:
wrapper <- function(x) {
if(length(x) > 1L) {
x[2L]
} else {
x[1L]
}
}
lapply(split, wrapper)
which gives
> lapply(split, wrapper)
[[1]]
[1] "50"
[[2]]
[1] " 30 "
[[3]]
[1] "1 "
[[4]]
[1] NA
or perhaps
lens <- lengths(split)
out <- lapply(split, `[`, 2L)
ind <- lens == 1L
out[ind] <- lapply(split[ind], `[`, 1L)
out
but that loops over the output from strsplit() twice.

R programming : select element from split string based on value in another column

I have a data frame having one column of words, with syllables separated by hyphens. I want to extract the nth syllable, where n is given in another column. Like this:
word <- c("to-ma-to", "cheese", "ta-co")
whichSyl <- c(2, 1, 1)
mydf <- data.frame(word, whichSyl)
mydf$word <- as.character(mydf$word)
desired: a vector containing
ma
cheese
ta
If this were, say, awk, I would just do
'{split($1,a,"-"); print a[$2]}'
The words don't always have the same number of syllables.
It seems likely that there is a straightforward way to do this, but I'm not seeing it. Thanks
You can use mapply and strsplit to get,
mapply('[', strsplit(mydf$word, '-'), whichSyl)
#[1] "ma" "cheese" "ta"
Here I wrote a function that does one row at a time, and then uses lapply() to iterate over all rows and do.call(rbind()) to bind all of those responses together.
getSyl <- function(i){
strsplit(mydf$word[i], '-')[[1]][mydf$whichSyl[i]]
}
do.call(rbind, lapply(1:nrow(mydf), getSyl))
[,1]
[1,] "ma"
[2,] "cheese"
[3,] "ta"
We can use read.table and row/column indexing
read.table(text=mydf$word, sep="-", header=FALSE,
fill=TRUE)[cbind(1:nrow(mydf), mydf$whichSyl)]
#[1] "ma" "cheese" "ta"

Passing Along Column Values to Paste

This is a follow-up to Paste/Collapse in R
I assume its preferable to start a new question than to endlessly edit a previous question with new questions.
What I've got going on is some vectors that I want to simulate playing a game against. The goal is to randomly pick two strategies to play against each other, where afterwards the results matrix is made, a magical for loop will assign each strategy a score.
###Sample Strategies
whales <- c("C","D","C","D","D")
quails <- c("D","D","D","D","D")
snails <- c("C", "C", "C", "C", "C")
bales <- c("D", "D", "C", "D", "C")
####Combining into a matrix
gameboard<-cbind(whales, quails, bales, snails, deparse.level = 1)
####All of the names of the strategies/columns
colnames(gameboard)
####Randomly pick two random column names
game1<- colnames(gameboard)[sample(1:ncol(gameboard), 2, replace= FALSE)]
results <-paste(game1[1], game1[2], sep='')
Now this does work, except for I am actually accessing the column names, not the data in the columns. So I end up with results like 'whalesbales' when I want the actual concatenation of CD DD CC DD DC.
Maybe 'apply' or 'lapply'...apply here?
The inevitable follow up question is how can I get the last line where it says 'results' to instead say 'results_whalesVbales'?
because I assume
results"game1[1]", sep='V',game1[2]"
is not going to cut it, and there is some ugly way to do this with lots of parentheses and block quotes.
#
FOLLOW UP
Thanks in advance for advice.
Thanks Ferdinand for the response and thorough explanation-
A couple of follow ups:
(1) Is there a way to get the
paste(.Last.value, collapse=" ")
[1] "DC DD CC DD CD"
result to be a new object (vector?) that is named result_balesVwhales based on
paste0("results_", paste(colnames(gameboard [randompair],collapse="V"))
[1] "results_balesVwhales"
everything I've tried so far makes the vector have a value of results_balesVwhales.
(2) Can I force the new results_balesVwhales to have the "long" (columnar) format that bales and whales each have individually, w/o reshape?
Ferdinand has the first question answered. In regards to your second questions... the function you are asking about is assign.
x = 'foo'
assign(x, 2)
foo
# [1] 2
However, there be dragons... Instead, the R way of doing that would be to assign into an element of a named list:
game1 <- sample(colnames(gameboard), 2)
result <- list()
list_name <- paste0("results_", paste(colnames(gameboard)[game1], collapse="V"))
result[[list_name]] <- paste(gameboard[, game1[1]],
gameboard[, game1[2]],
sep='',
collapse=' ')
If you want the vector of pasted elements, just remove the collapse as I've done below.
Eventually, once I sorted out how I wanted this all to work, I would wrap it in a function or two. These can be much cleaner, but the verbosity illustrates the idea.
my_game <- function(trial, n_samples, dat) {
# as per my comment, generate the game result and name using the colnames directly
game <- sample(colnames(dat), n_samples)
list_name <- paste0("results_", paste(game, collapse="V"))
game_result <- paste(dat[, game[1]],
dat[, game[2]],
sep='')
# return both the name and the data in the format desired to parse out later
return(list(list_name, game_result))
}
my_game_wrapper <- function(trials, n_samples, dat) {
# for multiple trials we create a list of lists of the results and desired names
results <- lapply(1:trials, my_game, n_samples, dat)
# extract the names and data
result_names <- sapply(results, '[', 1)
result_values <- sapply(results, '[', 2)
# and return them once the list is pretty.
names(result_values) <- result_names
return(result_values)
}
my_game_wrapper(3, 2, gameboard)
# $results_quailsVwhales
# [1] "DC" "DD" "DC" "DD" "DD"
# $results_quailsVbales
# [1] "DD" "DD" "DC" "DD" "DC"
# $results_balesVquails
# [1] "DD" "DD" "CD" "DD" "CD"

Resources