simple character splitting is baffling me [duplicate] - r

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 4 years ago.
the split as shown below is driving me crazy...nee somehelp to spot where is the problem
> p5<-Data$poorcoverageusers[5]
> p5
[1] "405874050693761|405874004853834|405874056470063|405874055308702"
> strsplit(p5,"|")
[[1]]
[1] "4" "0" "5" "8" "7" "4" "0" "5" "0" "6" "9" "3" "7" "6" "1" "|" "4" "0" "5" "8" "7" "4" "0" "0" "4" "8" "5" "3" "8" "3" "4" "|" "4" "0" "5"
[36] "8" "7" "4" "0" "5" "6" "4" "7" "0" "0" "6" "3" "|" "4" "0" "5" "8" "7" "4" "0" "5" "5" "3" "0" "8" "7" "0" "2"
> typeof(Data$poorcoverageusers[5])
[1] "character"
i wanted it to be splitted by "|"... so output should have been 405874050693761 405874004853834 405874056470063 405874055308702
what is he mistake i m making..
thnks for help
r

library(stringr)
s <- "405874050693761|405874004853834|405874056470063|405874055308702"
str_split(s, fixed("|")) # returns a list of character vectors
# [[1]]
# [1] "405874050693761" "405874004853834" "405874056470063" "405874055308702"
str_split(s, fixed("|"), simplify = T) # returns a character matrix
# [,1] [,2] [,3] [,4]
# [1,] "405874050693761" "405874004853834" "405874056470063" "405874055308702"

Related

How can I create a new txt file by matching two different txt file and finding the same values in R?

I have 2 text files: File A and File B.
I will match the first column of File A and the first row of File B.
If the values of the first column in File A is in the first row of File B, I want to get those values along their all column values and the first row values that correspond to them.
File A:
"...1" "AZD5153" "I-BET-762" "I-BRD9" "JQ1" "OTX-015" "PFI-1" "RVX-208"
"1" "697" 0.155445 1.328728 7.6345 7.553337 0.496983 1.776878 24.540592
"2" "5637" 11.767517 66.561037 314.672133 3.891947 17.54448 10.27559 261.520227
"3" "22RV1" 2.144765 9.04165 193.4228 4.448654 19.315063 9.55938 72.036416
"4" "23132-87" 1.882177 41.26784 33.482054 10.959235 9.025218 19.621473 75.332425
"5" "42-MG-BA" 2.252297 26.56874 54.934795 7.92924 10.276993 7.937254 64.873664
"6" "639-V" 6.412568 16.979172 30.882936 12.444024 21.915518 6.449247 96.50391
File B:
"...1" "1321N1" "143B" "22RV1" "23132-87" "42-MG-BA"
"1" "100009676_at" 61161 62052 61249 66154 54236
"2" "10000_at" 81556 66152 45676 43519 66723
"3" "10001_at" 97864 99699 8872 91376 10029
"4" "10002_at" 37977 40304 38455 37085 36431
"5" "10003_at" 35458 38504 40458 39508 41589
"6" "100048912_at" 40034 37959 41465 39271 39157
"7" "100049716_at" 42744 46775 52087 47239 42522
Expected File:
"...1" "22RV1" "23132-87" "42-MG-BA"
"1" "100009676_at" 61249 66154 54236
"2" "10000_at" 45676 43519 66723
"3" "10001_at" 8872 91376 10029
"4" "10002_at" 38455 37085 36431
"5" "10003_at" 40458 39508 41589
"6" "100048912_at" 41465 39271 39157
"7" "100049716_at" 52087 47239 42522
First of all, ensure you have the correct paths to FILEA.txt and FILEB.txt, as well as the desired path to FILEC.txt. In my case, I did:
path_to_file_A <- path.expand("~/FILEA.txt")
path_to_file_B <- path.expand("~/FILEB.txt")
path_to_file_C <- path.expand("~/FILEC.txt")
Now the following code should work:
A <- read.table(path_to_file_A, header = TRUE, check.names = FALSE)
B <- read.table(path_to_file_B, header = TRUE, check.names = FALSE)
result <- cbind(B[1], B[na.omit(match(A[[1]], names(B)))])
write.table(result, path_to_file_C)
Which results in:
FILEC.txt
"...1" "22RV1" "23132-87" "42-MG-BA"
"1" "100009676_at" 61249 66154 54236
"2" "10000_at" 45676 43519 66723
"3" "10001_at" 8872 91376 10029
"4" "10002_at" 38455 37085 36431
"5" "10003_at" 40458 39508 41589
"6" "100048912_at" 41465 39271 39157
"7" "100049716_at" 52087 47239 42522

Separating A String Into Characters

I have some ordered test results encoded in a character string. The string can be of arbitrary length. Each digit in the string represents a test result. In the following, for example, there are four test results represented:
2069
I want to tidy these up in R by splitting the string into individual observations. No problem with strsplit or string::str_split, which returns four values that will become my observations.
strsplit("2069" %>% as.character(), split = "") %>% unlist()
[1] "2" "0" "6" "9"
Now, however, I have realized that some results are values greater than 9. These two-digit values have been encoded with parentheses to make clear they are not individual results.
For example, in the following case I still have four values, but some have been enclosed in parentheses to group the values larger than 9.
2(10)1(12)
I'm struggling with a way to break these up so that I get
[1] "2" "10" "1" "12"
Appreciate any guidance. Thanks.
Updated - pattern match based on the OP's new pattern showed in the comments. Here, we use str_extract to extract one or more digits that follow an open parentheses (regex lookaround ) or (|) any character that is not a parentheses ([^()])
library(stringr)
str_extract_all(str1, "(?<=[(])\\d+|[^()]")
[[1]]
[1] "2" "10" "1" "12"
[[2]]
[1] "2" "0" "6" "9"
[[3]]
[1] "2" "15"
[[4]]
[1] "2" "1" "3" "1"
-testing on the OP's extra pattern
str_extract_all(str2, "(?<=[(])\\d+|[^()]")
[[1]]
[1] "2" "10" "1" "12"
[[2]]
[1] "2" "0" "6" "9"
[[3]]
[1] "2" "15"
[[4]]
[1] "2" "1" "3" "1"
[[5]]
[1] "10" "0" "2" "0" "1"
-Earlier solutions (Based on the assumption that all the numbers that are greater than 9 will be wrapped inside the parentheses)
We may split on the parentheses in base R
unlist(strsplit(str1[1], "\\(|\\)"))
[1] "2" "10" "1" "12"
Assuming if there are both cases, then an option is to get the index of those elements have the parentheses and do this separately
i1 <- grepl("\\(|\\)", str1)
lst1 <- vector('list', length(str1))
lst1[i1] <- strsplit(str1[i1], "\\(|\\)")
lst1[!i1] <- strsplit(str1[!i1], "")
unlist(lst1)
[1] "2" "10" "1" "12" "2" "0" "6" "9" "2" "15" "2" "1" "3" "1"
or another option is ifelse with grepl to create a single delimiter and then use strsplit
lst1 <- strsplit(trimws(ifelse(grepl("\\(|\\)", str1),
gsub("\\(|\\)", ",", str1), gsub("(?<=.)(?=.)", "\\1,\\2",
str1, perl = TRUE)), whitespace = ","), ",")
lst1
[[1]]
[1] "2" "10" "1" "12"
[[2]]
[1] "2" "0" "6" "9"
[[3]]
[1] "2" "15"
[[4]]
[1] "2" "1" "3" "1"
data
str1 <- c("2(10)1(12)", "2069", "2(15)", "2131")
str2 <- c(str1, "(10)0201")
Maybe we can do like below (borrow str1 from #akrun)
> mapply(strsplit, str1, ifelse(grepl("[()]", str1), "\\(|\\)", ""))
$`2(10)1(12)`
[1] "2" "10" "1" "12"
$`2069`
[1] "2" "0" "6" "9"
$`2(15)`
[1] "2" "15"
$`2131`
[1] "2" "1" "3" "1"
Use
(?<=\()\d+(?=\))|\d
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\d digits (0-9)
R code:
library(stringr)
str1 <- c("2(10)1(12)", "2069", "2(15)", "2131")
str_extract_all(str1, "(?<=\\()\\d+(?=\\))|\\d")
Results:
[1] "2" "10" "1" "12"
[[2]]
[1] "2" "0" "6" "9"
[[3]]
[1] "2" "15"
[[4]]
[1] "2" "1" "3" "1"

How to create a dataframe with various length of rows in R?

I am having a list of list paths which shown below ?
The Code is :
for (each in paths)
{
print (each)
}
The output is :
[1] "1" "2"
[1] "1" "2" "3"
[1] "1" "2" "3" "5"
[1] "1" "2" "4"
[1] "1" "2" "4" "5"
[1] "1" "3"
[1] "1" "3" "5"
[1] "1" "4"
[1] "1" "4" "5"
[1] "1" "5"
[1] "2" "3"
[1] "2" "3" "5"
[1] "2" "4"
[1] "2" "4" "5"
[1] "3" "5"
[1] "4" "5"
How to append this all as a rows of a data frame. as.data.frame fails due to unequal rows length.
A data frame is rectangular by definition, with the same number of columns in each row. You could set the length of each of your rows to be the same (they will be filled in with NA), and then rbind them together:
maxlength = max(lengths(paths))
paths2 = lapply(paths, function(x) {length(x) = maxlength; return(x)})
paths_df = do.call(rbind, args = paths2)
That will give a matrix, but you can easily convert to data frame from there.
data.frame needs to be rectangular. Also all elements of a given column need to be the same type of object. Thus, you could have a data.frame column composed of object of type list which can vary in size.
paths=list(1,c(1,2))
df=data.frame("pathNumber"= 1:length(paths))
df$path=paths
The result looks like this
pathNumber path
1 1 1
2 2 1, 2
One option is to have the list as a column of a data frame. This may be desirable if you want to have some other columns.
df <- data.frame(paths = I(paths))

list of dates without commas

I have this txt file
"","x"
"1","2005-01-31"
"2","2005-03-31"
"3","2005-03-31"
"4","2005-05-31"
"5","2005-05-31"
"6","2005-07-31"
"7","2005-07-31"
"8","2005-08-31"
"9","2005-10-31"
"10","2005-10-31"
list of monthly dates. How can I get the same list but without commas, like this one:
"x"
"1" "2005-01-31"
"2" "2005-02-28"
"3" "2005-03-31"
"4" "2005-04-29"
"5" "2005-05-31"
"6" "2005-06-30"
"7" "2005-07-29"
"8" "2005-08-31"
"9" "2005-09-30"
"10" "2005-10-31"
Thank you!
All you have to do is specify separator while loading file
read.csv(file, header = TRUE, **sep = ","**)

recommenderlab, Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix

I am trying to change matrix into a structure that I can use in functions of the recommenderlab package.
datafile1 <- as(datafile1,"matrix")
datafile1
name1 name2 rating1 rating2 rating3 rating4 rating5 rating6
[1,] "1" "a" "0" "0" "1" "0" "0" "0"
[2,] "2" "d" "0" "0" "1" "0" "0" "0"
[3,] "3" "x" "1" "0" "1" "0" "0" "0"
[4,] "4" "b" "0" "1" "1" "0" "0" "0"
library(recommenderlab)
datafile1 <- as(datafile1, "realRatingMatrix")
This is the result:
Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix
Does anyone have an idea about what's going wrong here?
The problem is that the RealRatingMatrix class extends Matrix, and Matrix has not implemented matrices with characters in them. Convert your matrix to a numeric first, then convert.
# Recreate data
datafile1<-read.table(textConnection('
name1 name2 rating1 rating2 rating3 rating4 rating5 rating6
"1" "a" "0" "0" "1" "0" "0" "0"
"2" "d" "0" "0" "1" "0" "0" "0"
"3" "x" "1" "0" "1" "0" "0" "0"
"4" "b" "0" "1" "1" "0" "0" "0"
'),header=TRUE)
datafile1<-as.matrix(datafile1)
# Convert to numeric (by arbitrarily map the characters to numbers.)
datafile1<-sapply(data.frame(datafile1),as.numeric)
# Create real rating matrix
as(datafile1, "realRatingMatrix")
# 4 x 8 rating matrix of class ‘realRatingMatrix’ with 32 ratings.

Resources