I asked a question and I received a great answer which solved my problem. However, I want to modify the code (here is my previous question).
finding similar strings in each row of two different data frame
I try to explain again the problem and how I tried to deal with it
The answer by Karsten W. gave me a normalised data (assign each string in each element a number of its position) as follow (I did not change it)
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
The second part was to apply the above function on each column separately, so if i need to do that on 1000 columns it is very time consuming. instead I do the following in comment , I tried to use lappy
# s1 <- normalize(df1[,1], ";")
# s2 <- normalize(df1[,2], ";")
I do like this
myS <- lapply(df1, normalize,";")
I keep the other part as it is
lookup <- normalize(df2[,1], ",")
Then to check between the two, I modified the function to only keep the row numbers of df2 (I removed (s[found] from it)
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
return(paste(pos, sep=""))
}
then whatever I do, I cannot get the output
process(myS$sample1) ...
At the end I need to have the data in a txt file or something which I can read. I used write.table but this does not work.
Is there any better way to do this? How to do it automatically?
It is a typo. process(myS$sample_1) instead of ...(myS$sample1)
I get:
> process(myS$sample_1)
[1] "4" "1" "4"
and
> lapply(myS, process)
$sample_1
[1] "4" "1" "4"
$sample_2
[1] "4" "15" "16"
IMHO for the function process() it would be better to return an integer vector:
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
names(pos) <- NULL
pos
}
For putting the result in a dataframe:
r <- lapply(myS, process)
m <- max(sapply(r, length))
r.matrix <- matrix(NA, m, length(r))
for (j in 1:length(r)) {
x <- r[[j]]
length(x) <- m
r.matrix[,j] <- x
}
colnames(r.matrix) <- names(r)
r.df <- as.data.frame(r.matrix)
Related
I have the following set of string:
core_string <- "AFFVQTCRE"
mask_string <- "*KKKKKKKK"
What I want to do is to mask core_string with mask_string.
Whenever the * coincide with character in core_string, we will keep that character,
otherwise replace it.
So the desired result is:
AKKKKKKKK
Other example
core_string <- "AFFVQTCRE"
mask_string <- "*KKKK*KKK"
# result AKKKKTKKK
The length of both strings is always the same.
How can I do that with R?
Here's a helper function that will do just that
apply_mask <- function(x, mask) {
unlist(Map(function(z, m) {
m[m=="*"] <- z[m=="*"]
paste(m, collapse="")
}, strsplit(x, ""), strsplit(mask, "")))
}
basically you just split up the string into characters and replace the characters that have a "*" then paste the strings back together.
I used the Map to make sure the function is still vectorized over the inputs. For example
core_string <- c("AFFVQTCRE", "ABCDEFGHI")
mask_string <- "*KKKK*KKK"
apply_mask(core_string, mask_string)
# [1] "AKKKKTKKK" "AKKKKFKKK"
regmatches in replacement form <- can be handy here:
regmatches(core_string, gregexpr("K", mask_string)) <- "K"
core_string
#[1] "AKKKKKKKK"
If it's a 1:1 match of characters rather than a constant, then it has to be changed up a little:
ss <- strsplit(mask_string, "")[[1]]
regmatches(core_string, gregexpr("[^*]", mask_string)) <- ss[ss != "*"]
I am trying to combine a few words so that they count as one.
In this example I want val and valuatin to be counted as valuation.
The code I have been using to try and do this is below:
#load in package
library(tm)
replaceWords <- function(x, from, keep){
regex_pat <- paste(from, collapse = "|")
gsub(regex_pat, keep, x)
}
oldwords <- c("val", "valuati")
newword <- c("valuation")
TextDoc2 <- tm_map(TextDoc, replaceWords, from=oldwords, keep=newword)
However this does not work as expected. Any time there is val in a word it is now being replaced with valuation. For example equivalent becomes equivaluation. How do I get around this error and achieved my desired result?
Try this function -
replaceWords <- function(x, from, keep){
regex_pat <- sprintf('\\b(%s)\\b', paste(from, collapse = '|'))
gsub(regex_pat, keep, x)
}
val matches with equivalent. Adding word boundaries stop that from happening.
grepl('val', 'equivalent')
#[1] TRUE
grepl('\\bval\\b', 'equivalent')
#[1] FALSE
I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
I have a few large character vectors of varying lengths that I need to break into smaller lengths for processing within spacyr. I'm currently using substr() within lapply() to split into a list where each list item is 500K characters long.
However, I would like to instead split on the next space after about 500K characters so as to avoid chopping a word in half. Not sure how to amend what I've come up with thus far. My current code sort of like so:
#Pretend 'text' is my list of words
chars = c(letters, " ", ".")
text<-paste0(sample(chars, 3000000, replace=TRUE), collapse="")
#split to list of smaller vectors
text_segments<-laply(seq(1,nchar(text),500000), function(i) substr(text, i, i+499999))
#do something with each
for(i in unique(text_segments)){
parsedtxt <- spacy_parse(i)
...
}
Each fake word in the above example is 3 letters long, but in my real files the words vary in length.
Any suggestions about approaching the space problem would be greatly appreciated. Code speed is not a concern, but I do appreciate efficiency suggestions nonetheless.
Maybe you can get inspiration from the code below and adapt it to your dataset.
First I will make up some data.
set.seed(1) # Make the results reproducible
y1 <- paste(sample(c(letters, " "), 1e3, TRUE), collapse = "")
y2 <- paste(sample(c(letters, " "), 1e3, TRUE), collapse = "")
str_list <- list(y1, y2)
Now, the function fun does the job. It uses gregexpr to get the locations of the spaces and then returns everything from the beginnig of the input string to the first space found.
fun <- function(x, threshold){
blanks <- gregexpr(" +", x)[[1]]
substr(x, 1, blanks[which(blanks > threshold)[1]] - 1)
}
thresh <- 100
sub <- lapply(str_list, fun, thresh)
lapply(sub, nchar)
#[[1]]
#[1] 103
#
#[[2]]
#[1] 154
I've gotten hold of some really messy data and I wrote a function to do some conversions (string to numeric), and I would love to improve it. Basically the function takes a vector of messy character data and converts the data to numeric.
for example:
## say you had this
df1 <- data.frame ( V1 = c(" $25.25", "4,828", " $7,253"), V2 = c( "THIS is bad data", "725", "*error"))
numconv <- function(vec){
vec <- str_trim(vec)
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print("!!ERROR STRANGE CHARACTERS!!")
}
}
df1$V1recode <- numconv(df1$V1)
df1$V2recode <- numconv(df1$V2)
[1] "!!ERROR STRANGE CHARACTERS!!"
How do can I assign the name of the original column name within the function so I can paste it to the error message within the function, so it instead reads:
!!ERROR STRANGE CHARACTER IN V2!!
I've tried calling names() and colnames() within the function, but this doesn't seem to work.
Thanks in advance,
C
The old deparse(substitute(.)) trick seems to work.
numconv <- function(vec){nam <- deparse(substitute(vec))
vec <- gsub(" ","", vec)
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print(paste("!!ERROR STRANGE CHARACTERS!!", nam) )
}
}
df1$V2recode <- numconv(df1$V2)
# [1] "!!ERROR STRANGE CHARACTERS!! df1$V2"
(I didn't load stringr since I thought a gsub call would be more efficient.)
I feel this is a somewhat hacky way to do this, but you could use substitue and then strsplit on the $, but this assumes you always call a column using its name with $. Anyway, you can get the column name using this and paste it into an error message as you wish...
x <- strsplit(as.character( substitute(vec) ) ,"$" )[[3]]
The key is to wrap the recoding up into the function as well. That way you can keep track of which columns you're working on and so get the column names to put in your warning message. The following function recodes whatever columns of a data frame are listed in the 'col_names' argument (if left null the function applies to all of them). The function returns the original data frame, plus the recoded columns with the string in flag added to the column names.
require(stringr)
df1 <- data.frame (
V1 = c(" $25.25", "4,828", " $7,253"),
V2 = c( "THIS is bad data", "725", "*error"))
numconv <- function(df, col_names = NULL, flag = "recode"){
if(is.null(col_names)) {
col_names <- colnames(df)
}
out <- lapply(1:length(col_names), function(i) {
vec <- str_trim(df[,col_names[i]])
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print(paste("!!ERROR STRANGE CHARACTERS in", col_names[i], "!!"))
}
vec
})
out <- data.frame(out, stringsAsFactors = FALSE)
colnames(out) <- paste(col_names, flag, sep = "")
cbind(df, out)
}
numconv(df1)
[1] "!!ERROR STRANGE CHARACTERS in V2 !!"
V1 V2 V1recode V2recode
1 $25.25 THIS is bad data 25.25 THIS is bad data
2 4,828 725 4828.00 725
3 $7,253 *error 7253.00 *error