I have a vector of strings like this:
"1111111221111122111111UUUUUUUUUUUUUUUUUU"
"---1-1---1--111111"
"1111112111 1111" (with blank spaces)
everyone has different length and I want to extract the max value of the each string, for the three examples above the max values would be (2,1,2), but don't know how to do it with the letters or the dash or the blank spaces, all these three are the minimum, i.e., 1 is bigger than "U", "-" and " " and between them is the same.
Any advice?
Best regards
Decompose the problem into independent, solvable steps:
Transform the input into a suitable format
Find the maximum
The we get:
# Separate strings into individual characters
digits_str = strsplit(input, '')
# Convert to correct type
digits = lapply(digits_str, as.integer)
# Perform actual logic, on each input string in turn.
result = vapply(digits, max, integer(1L), na.rm = TRUE)
This uses the lapply and vapply functions which allow you to perform an operation (here first as.integer and then max) on all values in a vector/list.
Related
Is it possible to pass column indices to read_csv?
I am passing many CSV files to read_csv with different header names so rather than specifying names I wish to use column indices.
Is this possible?
df.list <- lapply(myExcelCSV, read_csv, skip = headers2skip[i]-1)
Alternatively, you can use a compact string representation
where each character represents one column: c = character, i
= integer, n = number, d = double, l = logical, f = factor, D
= date, T = date time, t = time, ? = guess, or ‘_’/‘-’ to
skip the column.
If you know the total number of columns in the file you could do it like this:
my_read <- function(..., tot_cols, skip_cols=numeric(0)) {
csr <- rep("?",tot_cols)
csr[skip_cols] <- "_"
csr <- paste(csr,collapse="")
read_csv(...,col_types=csr)
}
If you don't know the total number of columns in advance you could add code to this function to read just the first line of the file and count the number of columns returned ...
FWIW the skip argument might not do what you think it does (it skips rows rather than selecting/deselecting columns): as I read ?readr::read_csv() there doesn't seem to be any convenient way to skip and/or include particular columns (by name or by index) except by some ad hoc mechanism such as suggested above; this might be worth a feature request/discussion on the readr issues list? (e.g. add cols_include and/or cols_exclude arguments that could be specified by name or position?)
I have a vector of strings ("sentences"), where each sentence has different number of different words in them:
sentences <- c("word01 word02",
"word01 word04 word03",
"word10",
"",
"word02 word07 word08 word09",
...)
I also have a vector of words of interest:
wordsOfInterest <- c("word01", "word02", ...)
I want to know if at least one of the wordsOfInterest is found in each of the sentences. The output should be a logical vector with length identical to that of the sentences vector. Thus, given the vectors above, the output vector should have values
TRUE TRUE FALSE FALSE TRUE ...
The number of sentences depends on the dataset and can be anything from few to around one hundred thousand, the number of words in each sentence can be anything from zero to around one hundred, and the number of wordsOfInterest can be anything from one to around one hundred.
Furthermore, I have several datasets to analyze, each with several individual sentence vectors. Then there are several sets of wordsOfInterest vectors that I need to apply to each sentence vector in each dataset, so the cumulative computational requirements start to add up.
Only succesful solution I've come up so far is to use str_detect one by one for each wordsOfInterest and applying it to the various sentences vector, but of course I'd like to find another solution. I've tried to get my head around this using the native vectorization as well as FOR loops in R but to no avail. So I have two problems, how to do it to begin with, and then how to do it as fast (both computation and typing-wise) as possible. I appreciate all help.
You can use grepl() and collapse your wordsOfInterest to include | and \\b edge of word checks around each word. This prevents partial matches, such as finding "then" when the word of interest is "the".
matchString <- paste0(wordsOfInterest, collapse = "\\b|")
matchString <- paste0("\\b", matchString, "\\b")
grepl(pattern = matchString, x = sentences)
Confirmed using the following:
wordsOfInterest <- sample(1:1000000, 100)
sentences <- ""
for(i in 1:sample(1:100, 1)){
sentences <- c(sentences,paste(sample(1:1000000, sample(0:100)), collapse = " "))
}
matchString <- paste(wordsOfInterest, collapse = "\\s+|")
grepl(pattern = matchString, x = sentences)
Regarding the throughput of the grepl() call: For 64,000 sentences of the lengths you specified, it took ~1.36 seconds.
> length(sentences)
[1] 63470
> microbenchmark::microbenchmark(grepl(pattern = matchString, x = sentences), times = 10)
Unit: seconds
min lq mean median uq max neval
1.280757 1.317157 1.357845 1.337714 1.374004 1.554918 10
I need to read a bunch of zip codes into R but they have to be in double type. I also need to keep the leading zeros for the ones that start with zero. I tried
for (i in 1:length(df$region)){
if (nchar(df$region[i])==4) {
df$region[i] <- paste0("0", df$region[i])
}
}
This converts the way I want to but it changes them all to character type and I can't read the region column into another function that requires numeric or double. If I convert to numeric or double it gets rid of the leading zeros again. Any ideas?
Why not store them as a numeric and just add the zeros when needed through formatC? For example,
tst <- 345
class(tst)
formatC(tst, width = 5, format = "d", flag = "0")
gives,
#[1] "numeric"
#[1] "00345"
For brevity, you could even write a wrapper:
zip <- function(z)formatC(z, width = 5, format = "d", flag = "0")
zip(tst)
#[1] "00345"
And this only adds leading zeroes when needed.
zip(12345)
#[1] "12345"
I would recommend keeping two columns, one in which the ZIP code appears as text, and the other as a double. You would have to first read in the ZIP codes as character data, then create the double column from that, e.g.
# given df$zip_code
df$zip_as_double <- as.double(df$zip_code)
Double variables don't normally maintain the number of leading zeroes, because those digits are not significant anyway. So I think storing your ZIP codes as character data is the only option here.
I have a dataset with a "value (in millions of USD)" column that I want to manipulate. Entries are strings in different formats - either with a dollar sign and followed by an M, e.g. "$1.3M," or followed by a K, e.g. "$450K," or some that I've already turned into proper numerical entries (e.g. 40 for 40 million USD).
I want to: get rid of the $ and extract only the numerical value for each row in millions.
Probably looking at some kind of column splitter based on values containing M or K, with an "ifelse" resembling something like: ifelse(PL$'VALUE (M)' contains M, extract.numeric from PL$'VALUE (M)', PL$'VALUE (M)' * 10^-3).
Haven't quite figured out the easiest way to do this on R though. Help would be appreciated!
You can use gsubfn to specify how to match the currency to numeric.
x <- c("$1.3M", "$450K")
library(gsubfn)
as.numeric(
gsubfn( "\\D", list( "$"="", "M" = "e6", "K" = "e3"), x)
)
#1300000 450000
One of the columns in my data frame is a character vector with time span values represented as number+suffix, as so:
c("16.14ms", "7.58ms", "8.38ms", "7.29ms", "6.40ms", "5.76ms",
"5.56ms", "5.27us", "5.12ms", "5.03us", "4.91ms", "4.76ms", "16.12ms",
"7.56ms", "8.59ms", "7.16ms", "6.59ms", "5.91s", "5.62ms", "5.44ms"
)
The units are limited to micro us, milli ms, and full seconds s.
Is there a simple way to make this into a numeric column with all values being either in milliseconds or seconds?
Here are some approaches. We suppose x is the input vector shown in the question.
1) Remove the s, replace m with e-3 and replace u with e-6. Then convert to numeric:
as.numeric(sub("u", "e-6", sub("m", "e-3", sub("s", "", x))))
2) This could also be done neatly using gsubfn. First we match the suffix and then use a replacement list as shown:
library(gsubfn)
as.numeric(gsubfn("\\D+$", list(ms = "e-3", us = "e-6", s = "e0"), x))
This would be particularly convenient if it were desired to extend the problem to many time units as it would just be a matter of extending the list.
Note that at the top of page 4 of the gsubfn vignette there is an example which is very close to this one.