Fail to split string at each instance of character in R

Fail to split string at each instance of character in R - r

I am using R to try and separate a long string of numbers all separated by the ";" character. The string looks like this:
";0,38;0,33;0,24;0,28; 0,33;0,33;0,38;0,23; 0,33;0,33; 0,38; 0,43; 0,51;0,56;0,33;0,56;0,33;0,43;0,51;0,56;\n\n0,61; 0,66;0,56; 0,66;0,56; 0,61; 0,66;0,61; 0,63; 0,66; 0,71;0,81;0,86; 0,99;0,86; 0,99; 1,12;1,27; 1,54; 1,57"
I have tried to do
strsplit(string,";")
and
str(string,";")
What is the quick way to do this so that I end up with a list of all the numbers in my list? Is there a way to do this with tidy verse?

The scan function allow using semicolons as separators and commas as decimal points (at least for input).
> vals <- scan(text=string, sep=";", dec=",")
Read 42 items
> vals
[1] NA 0.38 0.33 0.24 0.28 0.33 0.33 0.38 0.23 0.33 0.33 0.38 0.43 0.51 0.56 0.33 0.56 0.33
[19] 0.43 0.51 0.56 NA 0.61 0.66 0.56 0.66 0.56 0.61 0.66 0.61 0.63 0.66 0.71 0.81 0.86 0.99
[37] 0.86 0.99 1.12 1.27 1.54 1.57

Related

Grouped boxplot in R - simplest way

I have been struggling with creating a very simple grouped boxplot. My data looks as follows
> data
Wörter Sätze Text
P.01 0.15 0.24 0.34
P.02 0.10 0.15 0.08
P.03 0.05 0.18 0.16
P.04 0.55 0.60 0.44
P.05 0.00 0.06 0.26
P.06 0.20 0.65 0.68
P.07 0.15 0.31 0.47
P.08 0.35 0.87 0.69
P.09 0.35 0.75 0.76
N.01 0.40 0.78 0.59
N.02 0.55 0.95 0.76
N.03 0.65 0.96 0.83
N.04 0.60 0.90 0.77
N.05 0.50 0.95 0.82
If I simply execute boxplot(data) I obtain almost what I want. One plot with three boxes, each for one of the variables in my data.
Boxplot, almost
What I want is to separate these into two boxes per variable (one for the P-indexed, one for the N-indexed observations) for a total of six plots each.
I began by introducing a new variable
data$Gruppe <- c(rep("P",9), rep("N",5))
> data
Wörter Sätze Text Gruppe
P.01 0.15 0.24 0.34 P
P.02 0.10 0.15 0.08 P
P.03 0.05 0.18 0.16 P
P.04 0.55 0.60 0.44 P
P.05 0.00 0.06 0.26 P
P.06 0.20 0.65 0.68 P
P.07 0.15 0.31 0.47 P
P.08 0.35 0.87 0.69 P
P.09 0.35 0.75 0.76 P
N.01 0.40 0.78 0.59 N
N.02 0.55 0.95 0.76 N
N.03 0.65 0.96 0.83 N
N.04 0.60 0.90 0.77 N
N.05 0.50 0.95 0.82 N
Now that the data contains a non-numerical variable I cannot simply execute the boxplot() function as before. What would be a minimal alteration to make here to obtain the six plots that I want? (colour coding for the two groups would be nice)
I have encountered some solutions to a grouped boxplot, however the data from which others start tends to be organised differently than my (very simple) one.
Many thanks!

As #teunbrand already mentioned in the comments you could use pivot_longer to make your data in a longer format by Gruppe. You could use fill to make for each variable two boxplot in total 6 like this:
library(tidyr)
library(dplyr)
library(ggplot2)
data$Gruppe <- c(rep("P",9), rep("N",5))
data %>%
pivot_longer(cols = -Gruppe) %>%
ggplot(aes(x = name, y = value, fill = Gruppe)) +
geom_boxplot()
Created on 2023-01-10 with reprex v2.0.2
Data used:
data <- read.table(text = " Wörter Sätze Text
P.01 0.15 0.24 0.34
P.02 0.10 0.15 0.08
P.03 0.05 0.18 0.16
P.04 0.55 0.60 0.44
P.05 0.00 0.06 0.26
P.06 0.20 0.65 0.68
P.07 0.15 0.31 0.47
P.08 0.35 0.87 0.69
P.09 0.35 0.75 0.76
N.01 0.40 0.78 0.59
N.02 0.55 0.95 0.76
N.03 0.65 0.96 0.83
N.04 0.60 0.90 0.77
N.05 0.50 0.95 0.82", header = TRUE)

Different behavior of base R gsub and stringr::str_replace_all?

I would expect gsub and stringr::str_replace_all to return the same result in the following, but only gsub returns the intended result. I am developing a lesson to demonstrate str_replace_all so I would like to know why it returns a different result here.
txt <- ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n2017** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n2018** 0.70 0"
gsub(".*2017|2018.*", "", txt)
stringr::str_replace_all(txt, ".*2017|2018.*", "")
gsub returns the intended output (everything before and including 2017, and after and including 2018, has been removed).
output of gsub (intended)
[1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
However str_replace_all only replaces the 2017 and 2018 but leaves the rest, even though the same pattern is used for both.
output of str_replace_all (not intended)
[1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
Why is this the case?

Base R relies on two regex libraries.
As default R uses TRE.
We can specify perl = TRUE to use PCRE (perl like regular expressions).
The {stringr} package uses ICU (Java like regular expressions).
In your case the problem is that the dot . doesn’t match line breaks in PCRE and ICU, while it does match line breaks in TRE:
library(stringr)
txt <- ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n2017** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n2018** 0.70 0"
(base_tre <- gsub(".*2017|2018.*", "", txt))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(base_perl <- gsub(".*2017|2018.*", "", txt, perl = TRUE))
#> [1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(string_r <- str_replace_all(txt, ".*2017|2018.*", ""))
#> [1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
identical(base_perl, string_r)
#> [1] TRUE
We can use modifiers
to change the behavior of PCRE and ICU regex so that line breaks are matched
by .. This will produce the same output as with base R TRE:
(base_perl <- gsub("(?s).*2017|2018(?s).*", "", txt, perl = TRUE))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(string_r <- str_replace_all(txt, "(?s).*2017|2018(?s).*", ""))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
identical(base_perl, string_r)
#> [1] TRUE
Finally, unlike TRE, PCRE and ICU allow us to use look arounds which are also
an option to solve the problem
str_match(txt, "(?<=2017).*.(?=\\n2018)")
#> [,1]
#> [1,] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50"
Created on 2021-08-10 by the reprex package (v0.3.0)

R: Why can't use dimnames() to assign dim names

fg = read.table("fungus.txt", header=TRUE, row.names=1);fg
names(dimnames(fg)) = c("Temperature", "Area");names(dimnames(fg))#doesn't work
dimnames(fg) = list("Temperature"=row.names(fg), "Area"=colnames(fg));dimnames(fg)
#doesn't work
You can look at the picture of data I used below:
Using dimnames() to assign dim names to the data.frame doesn't work.
The two R command both do not work. The dimnames of fg didn't change, and the names of dimnames of fg is still NULL.
Why does this happen? How to change the dimnames of this data.frame?

Finally I found change the data frame to matrix works well.
fg = as.matrix(read.table("fungus.txt", header=TRUE, row.names=1))
dimnames(fg) = list("Temp"=row.names(fg), "Isolate"=1:8);fg
And got the output:
Isolate
Temp 1 2 3 4 5 6 7 8
55 0.66 0.67 0.43 0.41 0.69 0.63 0.46 0.52
60 0.82 0.81 0.80 0.79 0.85 0.91 0.53 0.66
65 0.91 1.09 0.81 0.86 0.95 0.93 0.64 1.10
70 1.02 1.22 1.03 1.08 1.10 1.13 0.80 1.17
75 1.06 1.17 0.89 1.02 1.06 1.29 0.94 1.01
80 0.80 0.81 0.73 0.77 0.80 0.79 0.59 0.95
85 0.26 0.40 0.36 0.53 0.67 0.53 0.57 0.18
Reply to the comments: if you do not know anything about the code, then do not ask me why I post such a question.

'x' must be numeric ERROR in R while trying to create a Leaf and Stem display

I am a beginner at R and I'm just trying to read a text file that contains values and create a stem display, but I keep getting an error. Here is my code:
setwd("C:/Users/Michael/Desktop/ch1-ch9 data/CH01")
gravity=read.table("C:ex01-11.txt", header=T)
stem(gravity)
**Error in stem(gravity) : 'x' must be numeric**
The File contains this:
'spec_gravity'
0.31
0.35
0.36
0.36
0.37
0.38
0.4
0.4
0.4
0.41
0.41
0.42
0.42
0.42
0.42
0.42
0.43
0.44
0.45
0.46
0.46
0.47
0.48
0.48
0.48
0.51
0.54
0.54
0.55
0.58
0.62
0.66
0.66
0.67
0.68
0.75
If you can help, I would appreciate it! Thanks!

gravity is a data frame. stem expects a vector. You need to select a column of your data set and pass to stem, i.e.
## The first column
stem(gravity[,1])

R: `unlist` using a lot of time when summing a subset of a matrix

I have a program which pulls data out of a MySQL database, decodes a pair of
binary columns, and then sums together a subset of of the rows within the pair
of binary columns. Running the program on a sample data set takes 12-14 seconds,
with 9-10 of those taken up by unlist. I'm wondering if there is any way to
speed things up.
Structure of the table
The rows I'm getting from the database look like:
| array_length | mz_array | intensity_array |
|--------------+-----------------+-----------------|
| 98 | 00c077e66340... | 002091c37240... |
| 74 | c04a7c7340... | db87734000... |
where array_length is the number of little-endian doubles in the two arrays
(they are guaranteed to be the same length). So the first row has 98 doubles in
each of mz_array and intensity_array. array_length has a mean of 825 and a
median of 620 with 13,000 rows.
Decoding the binary arrays
Each row gets decoded by being passed to the following function. Once the binary
arrays have been decoded, array_length is no longer needed.
DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
sapply(list(mz_array=mz_array, intensity_array=intensity_array),
readBin,
what="double",
endian="little",
n=array_length)
}
Summing the arrays
The next step is to sum the values in intensity_array, but only if their
corresponding entry in mz_array is within a certain window. The arrays are
ordered by mz_array, ascending. I am using the following function to sum up
the intensity_array values:
SumInWindow <- function(spectrum, lower, upper) {
sum(spectrum[spectrum[,1] > lower & spectrum[,1] < upper, 2])
}
Where spectrum is the output from DecodeSpectrum, a matrix.
Operating over list of rows
Each row is handled by:
ProcessSegment <- function(spectra, window_bounds) {
lower <- window_bounds[1]
upper <- window_bounds[2]
## Decode a single spectrum and sum the intensities within the window.
SumDecode <- function (...) {
SumInWindow(DecodeSpectrum(...), lower, upper)
}
do.call("mapply", c(SumDecode, spectra))
}
And finally, the rows are fetched and handed off to ProcessSegment with this
function:
ProcessAllSegments <- function(conn, window_bounds) {
nextSeg <- function() odbcFetchRows(conn, max=batchSize, buffsize=batchSize)
while ((res <- nextSeg())$stat == 1 && res$data[[1]] > 0) {
print(ProcessSegment(res$data, window_bounds))
}
}
I'm doing the fetches in segments so that R doesn't have to load the entire data
set into memory at once (it was causing out of memory errors). I'm using the
RODBC driver because the RMySQL driver isn't able to return unsullied binary
values (as far as I could tell).
Performance
For a sample data set of about 140MiB, the whole process takes around 14 seconds
to complete, which is not that bad for 13,000 rows. Still, I think there's room
for improvement, especially when looking at the Rprof output:
$by.self
self.time self.pct total.time total.pct
"unlist" 10.26 69.99 10.30 70.26
"SumInWindow" 1.06 7.23 13.92 94.95
"mapply" 0.48 3.27 14.44 98.50
"as.vector" 0.44 3.00 10.60 72.31
"array" 0.40 2.73 0.40 2.73
"FUN" 0.40 2.73 0.40 2.73
"list" 0.30 2.05 0.30 2.05
"<" 0.22 1.50 0.22 1.50
"unique" 0.18 1.23 0.36 2.46
">" 0.18 1.23 0.18 1.23
".Call" 0.16 1.09 0.16 1.09
"lapply" 0.14 0.95 0.86 5.87
"simplify2array" 0.10 0.68 11.48 78.31
"&" 0.10 0.68 0.10 0.68
"sapply" 0.06 0.41 12.36 84.31
"c" 0.06 0.41 0.06 0.41
"is.factor" 0.04 0.27 0.04 0.27
"match.fun" 0.04 0.27 0.04 0.27
"<Anonymous>" 0.02 0.14 13.94 95.09
"unique.default" 0.02 0.14 0.06 0.41
$by.total
total.time total.pct self.time self.pct
"ProcessAllSegments" 14.66 100.00 0.00 0.00
"do.call" 14.50 98.91 0.00 0.00
"ProcessSegment" 14.50 98.91 0.00 0.00
"mapply" 14.44 98.50 0.48 3.27
"<Anonymous>" 13.94 95.09 0.02 0.14
"SumInWindow" 13.92 94.95 1.06 7.23
"sapply" 12.36 84.31 0.06 0.41
"DecodeSpectrum" 12.36 84.31 0.00 0.00
"simplify2array" 11.48 78.31 0.10 0.68
"as.vector" 10.60 72.31 0.44 3.00
"unlist" 10.30 70.26 10.26 69.99
"lapply" 0.86 5.87 0.14 0.95
"array" 0.40 2.73 0.40 2.73
"FUN" 0.40 2.73 0.40 2.73
"unique" 0.36 2.46 0.18 1.23
"list" 0.30 2.05 0.30 2.05
"<" 0.22 1.50 0.22 1.50
">" 0.18 1.23 0.18 1.23
".Call" 0.16 1.09 0.16 1.09
"nextSeg" 0.16 1.09 0.00 0.00
"odbcFetchRows" 0.16 1.09 0.00 0.00
"&" 0.10 0.68 0.10 0.68
"c" 0.06 0.41 0.06 0.41
"unique.default" 0.06 0.41 0.02 0.14
"is.factor" 0.04 0.27 0.04 0.27
"match.fun" 0.04 0.27 0.04 0.27
$sample.interval
[1] 0.02
$sampling.time
[1] 14.66
I'm surprised to see unlist taking up so much time; this says to me that there
might be some redundant copying or rearranging going on. I'm new at R, so it's
entirely possible that this is normal, but I'd like to know if there's anything
glaringly wrong.
Update: sample data posted
I've posted the full version of the program
here and the sample data I use
here. The sample data is the
gziped output from mysqldump. You need to set the proper environment
variables for the script to connect to the database:
MZDB_HOST
MZDB_DB
MZDB_USER
MZDB_PW
To run the script, you must specify the run_id and the window boundaries. I
run the program like this:
Rscript ChromatoGen.R -i 1 -m 600 -M 1200
These window bounds are pretty arbitrary, but select roughly a half to a third
of the range. If you want to print the results, put a print() around the call
to ProcessSegment within ProcessAllSegments. Using those parameters, the
first 5 should be:
[1] 7139.682 4522.314 3435.512 5255.024 5947.999
You probably want want to limit the number of results, unless you want 13,000
numbers filling your screen :) The simplest way is just add LIMIT 5 at the end
of query.

I've figured it out!
The problem was in the sapply() call. sapply does a fair amount of
renaming and property setting which slows things down massively for arrays of
this size. Replacing DecodeSpectrum with the following code brought the sample
time from 14.66 seconds down to 3.36 seconds, a 4-fold increase!
Here's the new body of DecodeSpectrum:
DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
## needed to tell `vapply` how long the result should be. No, there isn't an
## easier way to do this.
resultLength <- rep(1.0, array_length)
vapply(list(mz_array=mz_array, intensity_array=intensity_array),
readBin,
resultLength,
what="double",
endian="little",
n=array_length,
USE.NAMES=FALSE)
}
The Rprof output now looks like:
$by.self
self.time self.pct total.time total.pct
"<Anonymous>" 0.64 19.75 2.14 66.05
"DecodeSpectrum" 0.46 14.20 1.12 34.57
".Call" 0.42 12.96 0.42 12.96
"FUN" 0.38 11.73 0.38 11.73
"&" 0.16 4.94 0.16 4.94
">" 0.14 4.32 0.14 4.32
"c" 0.14 4.32 0.14 4.32
"list" 0.14 4.32 0.14 4.32
"vapply" 0.12 3.70 0.66 20.37
"mapply" 0.10 3.09 2.54 78.40
"simplify2array" 0.10 3.09 0.30 9.26
"<" 0.08 2.47 0.08 2.47
"t" 0.04 1.23 2.72 83.95
"as.vector" 0.04 1.23 0.08 2.47
"unlist" 0.04 1.23 0.08 2.47
"lapply" 0.04 1.23 0.04 1.23
"unique.default" 0.04 1.23 0.04 1.23
"NextSegment" 0.02 0.62 0.50 15.43
"odbcFetchRows" 0.02 0.62 0.46 14.20
"unique" 0.02 0.62 0.10 3.09
"array" 0.02 0.62 0.04 1.23
"attr" 0.02 0.62 0.02 0.62
"match.fun" 0.02 0.62 0.02 0.62
"odbcValidChannel" 0.02 0.62 0.02 0.62
"parent.frame" 0.02 0.62 0.02 0.62
$by.total
total.time total.pct self.time self.pct
"ProcessAllSegments" 3.24 100.00 0.00 0.00
"t" 2.72 83.95 0.04 1.23
"do.call" 2.68 82.72 0.00 0.00
"mapply" 2.54 78.40 0.10 3.09
"<Anonymous>" 2.14 66.05 0.64 19.75
"DecodeSpectrum" 1.12 34.57 0.46 14.20
"vapply" 0.66 20.37 0.12 3.70
"NextSegment" 0.50 15.43 0.02 0.62
"odbcFetchRows" 0.46 14.20 0.02 0.62
".Call" 0.42 12.96 0.42 12.96
"FUN" 0.38 11.73 0.38 11.73
"simplify2array" 0.30 9.26 0.10 3.09
"&" 0.16 4.94 0.16 4.94
">" 0.14 4.32 0.14 4.32
"c" 0.14 4.32 0.14 4.32
"list" 0.14 4.32 0.14 4.32
"unique" 0.10 3.09 0.02 0.62
"<" 0.08 2.47 0.08 2.47
"as.vector" 0.08 2.47 0.04 1.23
"unlist" 0.08 2.47 0.04 1.23
"lapply" 0.04 1.23 0.04 1.23
"unique.default" 0.04 1.23 0.04 1.23
"array" 0.04 1.23 0.02 0.62
"attr" 0.02 0.62 0.02 0.62
"match.fun" 0.02 0.62 0.02 0.62
"odbcValidChannel" 0.02 0.62 0.02 0.62
"parent.frame" 0.02 0.62 0.02 0.62
$sample.interval
[1] 0.02
$sampling.time
[1] 3.24
It's possible that some additional performance could be squeezed out of messing
with the do.call('mapply', ...) call, but I'm satisfied enough with the
performance as is that I'm not willing to waste time on that.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fail to split string at each instance of character in R - r

Related

Grouped boxplot in R - simplest way

Different behavior of base R gsub and stringr::str_replace_all?

R: Why can't use dimnames() to assign dim names

'x' must be numeric ERROR in R while trying to create a Leaf and Stem display

R: `unlist` using a lot of time when summing a subset of a matrix

Categories

Resources