Paste value after certain delimiter - r

I have data in the following format:
In Column A:
String1__String2__String3
In Column B:
Value
I would like to paste the Value into the String after the first delimiter like this:
String1__Value__String2__String3
The crucial part of the code I am using now (where I paste the value) is the following line:
df2 <-cbind(df[1],apply(df[,2:ncol(df)],2,function(i)ifelse(is.na(i), NA, paste(df[,1],i,sep="_"))))
With this code it append the value after the string, like this:
String1__String2__String3__Value
Is there an easy way to rearrange this so the Values will be pasted at the correct place. Or do I have to redo the complete code ?
Thanks
Update, Example:
Column A:
Jennifer__DoesSomething__inaCity
Column B:
2
Result now:
Jennifer__DoesSomething__inaCity__2
Desired result:
Jennifer__2__DoesSomething__inaCity
The strings Jennifer, DoesSomething, inaCity change and are not the same length. Only the delimiter stays the same. I want to paste after the first delimiter.
Thanks !

Here is an idea. Using sub we only replace the first seen pattern. So using mapply we replace all the numbers in one column with their corresponding strings on the second column.
mapply(function(x, y) sub('__', paste0('__', y, '__'), x), df$v1, df$v2)
# atsfs__dsfgg__sdgsdg eeee__FFFF__GGGG
#"atsfs__3__dsfgg__sdgsdg" "eeee__5__FFFF__GGGG"
DATA
dput(df)
structure(list(v1 = c("atsfs__dsfgg__sdgsdg", "eeee__FFFF__GGGG"
), v2 = c(3, 5)), .Names = c("v1", "v2"), row.names = c(NA, -2L
), class = "data.frame")

Related

start and end positions of a character; R

I am trying to get
start and end positions of "-" character in column V1
and its corresponding characters at these positions in column V2
Then length of it
Any help will be appreciated!
ip <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"),
V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz")), class = "data.frame", row.names = c(NA,
-3L))
I tried stringi_locate but it outputs for individual position. For example, For this "ab---cdef" instead of 3-5 it outputs 3-3, 4-4, 5-5.
Expected output:
op <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"),
V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz"), output = c("x:x-3:5-3",
"x:y-5:7-3", "x:x-2:3-2; y-z:6:7-2")), class = "data.frame", row.names = c(NA,
-3L))
the output column must have
The characters in V2 column with respect to start and end of "-" in V1
Then start and end position
Then its length
V1 V2 output
ab---cdef xxxxxxxyy x:x-3:5-3
Thanks!
Here's an example using grepexpr to get all the matches in a string.
x <- gregexpr("-+", ip$V1)
mapply(function(m, s, r) {
start <- m
len <- attr(m, "match.length")
end <- start + len-1
part <- mapply(substr, r, start, end)
paste0(part, "-", start, ":", end, "-", len, collapse=";")
}, x, ip$V1, ip$V2)
# [1] "xxx-3:5-3"
# [2] "xyy-5:7-3"
# [3] "xx-2:3-2;yz-6:7-2"
I'm not sure what your logic was for turning xxx into x:x or xyy to x-y or how that generalized to other sequences so feel free to change that part. But you can get the start and length of the matches using the attributes of the returned match object. It's just important to use -+ as the pattern so you match a run of dashes rather than just a single dash.

Search and replace file names according to strings in dataframe. In R

I have several files in a folder that look like "blabla_A1_bla.txt", "blabla_A1_bla.phd","blabla_B1_bla.txt", "blablabla_B1_bla.phd"...and all the way to H12.
Then I have a df that indicates which sample is each one.
well
sample
A1
F32-1
B1
F13-3
C1
B11-4
...
...
I want to rename the files in the folder according to the table. So that A1 gets replaces by F32-1, B1 by F13-3 and so on.
I have created a list of all the files in the directory with files<-list.files(directory). I know how to use the str_replace function of the stringr package to change them one by one, but I don't know how to make it automatic. I guess I need a loop that reads cell 1,1 of the dataframe, searches that string in "files" and replaces it with the value in cell 1,2. And then moves to cell 2,1 and so on. But I don't know how to code this. (Or if there is a better way to do it).
I'll appreciate your help with this.
You can create a named vector of replacement and pattern and use it in str_replace_all
files <- list.files(directory)
files <- stringr::str_replace_all(files, setNames(df$sample, df$well))
Using a reproducible example -
df <- structure(list(well = c("A1", "B1", "C1"), sample = c("F32-1",
"F13-3", "B11-4")), class = "data.frame", row.names = c(NA, -3L))
files <- c("blabla_A1_bla.txt", "blabla_A1_bla.phd","blabla_B1_bla.txt", "blablabla_B1_bla.phd")
stringr::str_replace_all(files, setNames(df$sample, df$well))
#[1] "blabla_F32-1_bla.txt" "blabla_F32-1_bla.phd" "blabla_F13-3_bla.txt"
#[4] "blablabla_F13-3_bla.phd"
I would first create a vector of new names and then use the function file.rename:
files = c("blabla_A1_bla.phd","blabla_B1_bla.txt", "blablabla_B1_bla.phd")
patterns = c('A1', 'B1')
replace = c('F22', 'G22')
new.name = c()
for (f in files){
# first identify which pattern corresponds to file f (sis it A1, B1, ...)
which.pattern = which(sapply(patterns, grepl, x = f))
# and then replace it by the correct string
new.name = c(new.name, gsub(patterns[which.pattern], replace[which.pattern], f))
}
file.rename(files, new.name)
replacing patterns and replace by df$well and df$sample should work for your case.

Write csv file with non-numeric columns quoted and no row names

I'm trying to write a csv file from a data frame, i.e:
Col_A Col_B Col_C
Hello World 4
Once More 21
Hi Data 23
So far I use this code:
ds = dataf
write.csv(ds,"test.csv", row.names = FALSE, quote = c(1,2), sep = ",")
However, the result is:
Col_A,"Col_B","Col_C"
Hello,"World",4
Once,"More",21
Hi,"Data",23
But I really need to have something like this:
"Col_A","Col_B","Col_C"
"Hello","World",4
"Once","More",21
"Hi","Data",23
Note that everything is between double quotes unless the numeric values, separated by commas. I can do that if I also write the rownames, but I really don't want them.
There no point in setting a "set" to "," because it's the default for write.csv.
Anyway, are you sure of your data.frame design ?
This seems to work :
df <- rbind(c("Hello", "Once", "Hi"), c("World", "More", "Data"), c(4,21,23))
df <- as.data.frame(t(df))
write.csv(df,"test.csv", row.names = FALSE, quote = c(1,2))

odd behavior when substituting parts of a string within a for loop

I'm trying to replace a series of numbers in a character string with information that comes from a dataframe.
My string comes from a text file that I imported using the readr package as follows: read_file("Human.txt")
I've checked the class, it is character. The string contains the following information (I've named it treeString):
"(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
My dataframe (labels.csv) was originally in factor format, but I changed the format of the second column to character using the following command: labels[,2] = as.character(labels[,2]). It looks like this
v1 v2
1 1 name1
2 2 name2
3 3 name3
My goal is to substitute every number in the string with the corresponding name (i.e. V2) in the dataframe. This should result in the following:
"(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Here is the code I am using to accomplish this:
for(i in 1:nrow(labels)){
gsub(as.character(i), labels[i,2], treeString)
}
The weird thing is that if I run the gsub() command on its own (with specified numbers - eg. 2) it does the substitution, however, when I run it in a loop it does not substitute the numbers.
As pointed out by Kumar Manglam in the comments, you forgot to assign the result of gsub() back to treeString.
There is something else you should be aware of: The way you specified the regular expression in your question it will also replace patterns like "(241)" with "(name24name1)". To avoid this behaviour, you should check whether the numbers you want to replace are preceded by a comma or opening parenthesis and succeeded by a comma or closing parenthesis:
# Option1
for(i in 1:nrow(labelnames)){
reg_pattern <- paste0("(?<=[(,])(", i, ")(?=[),])")
treeString <- gsub(reg_pattern, labelnames$v2[i], treeString, perl=T)
}
Another, nicer, option is drop the for-loop and do it all at once:
# Option2
reg_pattern <- paste0("(?<=[(,])([1-", nrow(labelnames), "])(?=[),])")
treeString <- gsub(reg_pattern, "name\\1", treeString, perl=T)
# Result
treeString
# "(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Data
treeString <- "(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
labelnames <- structure(list(v1 = 1:3, v2 = c("name1", "name2", "name3")), .Names = c("v1", "v2"), class = "data.frame", row.names = c(NA, -3L))

How can I use purrr to limit rows where column element is not a list

I have a data.frame,df, where one of the columns has entries which are either a character or list
I would like to use the purrr package, or other means, to eliminate the second row
df <- structure(list(member_id = c("1715", "2186", "2187"), date_of_birth = list(
"1953-12-15T00:00:00", structure(list(`#xsi:nil` = "true",
`#xmlns:xsi` = "http://www.w3.org/2001/XMLSchema-instance"), .Names = c("#xsi:nil",
"#xmlns:xsi")), "1941-02-16T00:00:00")), .Names = c("member_id",
"date_of_birth"), row.names = c(1L, 8L, 9L), class = "data.frame")
TIA
If you are looking to drop any row whose date_of_birth field is of type list, the following should be a decent solution:
df[sapply(df$date_of_birth, function(x) typeof(x)!="list"),]
Edit:
Imo's comment should shorten the above solution as follows:
df[!sapply(df$date_of_birth, is.list),]
I hope this helps.
Here is base R method using lengths and subsetting. Any element in the date_of_birth column that has more than one element is dropped
dfNew <- df[lengths(df$date_of_birth) < 2,]
which returns
dfNew
member_id date_of_birth
1 1715 1953-12-15T00:00:00
9 2187 1941-02-16T00:00:00
Note that dfNew$date_of_birth will still be of type list, which might cause problems down the line. You can fix this with unlist.
dfNew$date_of_birth <- unlist(dfNew$date_of_birth)

Resources