Get total occurrences based on exactly 15 number in sequence - r

I currently have a messy DF where I'd like to scan everything at once for exactly 15 digit sequences. The problem with the code below is that it returns all sequences that are at least 15 digits - so I get the total occurrences returned when they also exceed 15.
Can I return occurrences based on exactly 15 and omit anything less than 15 or greater from the aggregation?
sum(str_count(list_all_df2,pattern = "[0-9]{15?}"))

Text extraction is always difficult without a firm example, but we can create a vector of text containing random-length strings of digits like this:
set.seed(12345)
string <- apply(
replicate(30, sample(c(0:9, " "), 50, TRUE, prob = c(rep(10, 10), 7))),
2, paste, collapse = "")
head(string)
#> [1] "849452168 028055552 51 875863690381144729549310007"
#> [2] " 91386805393 9 3 27 861107 86246002196904907868925"
#> [3] "17 7647433759594 701660889 2390898 4822968372 641"
#> [4] "9691398547874956 295 3915984992533 91 229411 03935"
#> [5] "74268900671853 516722206484567176886 465 4978 619"
#> [6] "2 03440226 9948029 8212 95601429203509668901919360"
To get all the length-15 strings of digits from this vector we can do:
unlist(stringr::str_extract_all(string, "\\b\\d{15}\\b"))
#> [1] "013567419835491" "607222319557192" "742113985775821" "928244409745755"
Created on 2022-05-04 by the reprex package (v2.0.1)

Related

Randomly select strings based on multiple criteria in R

I'm trying to select strings based on multiple criteria but so far no success.
My vector contains the following strings (a total of 48 strings): (1_A, 1_B, 1_C, 1_D, 2_A, 2_B, 2_C, 2_D... 12_A, 12_B, 12_C, 12_D)
I need to randomly select 12 strings. The criteria are:
I need one string containing each number
I need exactly three strings that contains each letter.
I need the final output to be something like: 1_A, 2_A, 3_A, 4_B, 5_B, 6_B, 7_C, 8_C, 9_C, 10_D, 11_D, 12_D.
Any help will appreciated.
All the best,
Angelica
The trick here is not to use your vector at all, but to create the sample strings from their components, which are randomly chosen according to your criteria.
sample(paste(sample(12), rep(LETTERS[1:4], 3), sep = '_'))
#> [1] "12_D" "8_C" "7_B" "1_B" "6_D" "5_A" "4_B" "10_A" "2_C" "3_A" "11_D" "9_C"
This will give a different result each time.
Note that all 4 letters are always represented exactly 3 times since we use rep(LETTERS[1:4], 3), all numbers 1 to 12 are present exactly once but in a random order since we use sample(12), and the final result is shuffled so that the order of the letters and the order of the numbers is not predictable.
If you want the result to give you the indices of your original vector where the samples are from, then it's easy to do that using match. We can recreate your vector by doing:
vec <- paste(rep(1:12, each = 4), rep(LETTERS[1:4], 12), sep = "_")
vec
#> [1] "1_A" "1_B" "1_C" "1_D" "2_A" "2_B" "2_C" "2_D" "3_A" "3_B"
#> [11] "3_C" "3_D" "4_A" "4_B" "4_C" "4_D" "5_A" "5_B" "5_C" "5_D"
#> [21] "6_A" "6_B" "6_C" "6_D" "7_A" "7_B" "7_C" "7_D" "8_A" "8_B"
#> [31] "8_C" "8_D" "9_A" "9_B" "9_C" "9_D" "10_A" "10_B" "10_C" "10_D"
#> [41] "11_A" "11_B" "11_C" "11_D" "12_A" "12_B" "12_C" "12_D"
And to find the location of the random samples we can do:
samp <- match(sample(paste(sample(12), rep(LETTERS[1:4], 3), sep = '_')), vec)
samp
#> [1] 30 26 37 43 46 20 8 3 33 24 15 9
So that, for example, you can retrieve an appropriate sample from your vector with:
vec[samp]
#> [1] "8_B" "7_B" "10_A" "11_C" "12_B" "5_D" "2_D" "1_C" "9_A" "6_D"
#> [11] "4_C" "3_A"
Created on 2022-04-10 by the reprex package (v2.0.1)

How can I get my time strings in a right format?

I am stuck with converting strings to times. I am aware that there are many topics on Stack regarding converting strings-to-times, however I couldn't fix this problem with the solutions.
Situation
I have a file with times like this:
> dput(df$Time[1:50])
c("1744.3", "2327.54", "1718.51", "2312.3200000000002", "1414.16",
"2046.15", "1442.5", "1912.22", "2303.2199999999998", "2146.3200000000002",
"1459.02", "1930.15", "1856.23", "2319.15", "1451.05", "25.460000000000036",
"1453.25", "2309.02", "2342.48", "2322.5300000000002", "2101.5",
"2026.07", "1245.04", "1945.15", "5.4099999999998545", "1039.5",
"1731.37", "2058.41", "2030.36", "1814.31", "1338.18", "1858.33",
"1731.36", "2343.38", "1733.27", "2304.59", "1309.47", "1916.11",
"1958.3", "1929.54", "1756.4", "1744.23", "1731.26", "1844.47",
"1353.25", "1958.3", "1746.44", "1857.53", "2047.15", "2327.2199999999998", "1915"
)
In this example, the times should be like this:
"1744.3" = 17:44:30
"2327.54" = 23:27:54
"1718.51" = 17:18:51
"2312.3200000000002" = 23:12:32
...
"25.460000000000036" = 00:25:46 # as you can see, the first two 00 are missing.
"1915" = 19:15:00
However, I tried multiple things (and now I am even stuck with str_replace()). Hopefully some one knows how I can transform this.
What have I tried?
format(df$Time, "%H%M.%S") # Yes I know...
# So therefore I thought, lets replace the strings to get them in a proper format
# like HH:MM:SS. First step was to replace the "." for a ":"
str_replace("." , ":", df$Time) # this was leading to "." (don't know why)
And that was the point that I was so frustrated that I posted it on Stack. Hope that you guys can help me.
Many thanks in advance!
Here is a way to do this, storing the output from dput in x.
library(magrittr)
#Remove all the dots
gsub('\\.', '', x) %>%
#Select only first 6 characters
substr(1, 6) %>%
#Pad 0's at the end
stringr::str_pad(6,pad = '0', side = 'right') %>%
#Add colon (:) separator
sub('(.{2})(.{2})', '\\1:\\2:', .)
# [1] "17:44:30" "23:27:54" "17:18:51" "23:12:32" "14:14:16" "20:46:15"
# [7] "14:42:50" "19:12:22" "23:03:21" "21:46:32" "14:59:02" "19:30:15"
#[13] "18:56:23" "23:19:15" "14:51:05" "25:46:00" "14:53:25" "23:09:02"
#...
Note that this can be done without pipes as well but using it for clarity. From here you can convert the time to POSIXct format if needed.
The main problem is the time "25.460000000000036". But I think I found a clear though somewhat verbose solution:
library(tidyverse)
df %>%
mutate(hours = formatC(as.numeric(Time), width = 4, format = "d", flag = "0"),
seconds = as.numeric(str_extract(Time, "[.].+")) * 100) %>%
mutate(Time_new = stringi::stri_datetime_parse(paste0(hours, seconds), format = "HHmm.ss"))
#> # A tibble: 51 x 4
#> Time hours seconds Time_new
#> <chr> <chr> <dbl> <dttm>
#> 1 25.460000000000036 0025 46. 2020-02-19 00:25:46 # I changed the order of the times so the weird format is on top
#> 2 1744.3 1744 30 2020-02-19 17:44:30
#> 3 2327.54 2327 54 2020-02-19 23:27:54
#> 4 1718.51 1718 51 2020-02-19 17:18:51
#> 5 2312.3200000000002 2312 32. 2020-02-19 23:12:32
#> 6 1414.16 1414 16 2020-02-19 14:14:16
#> 7 2046.15 2046 15 2020-02-19 20:46:15
#> 8 1442.5 1442 50 2020-02-19 14:42:50
#> 9 1912.22 1912 22 2020-02-19 19:12:22
#> 10 2303.2199999999998 2303 22.0 2020-02-19 23:03:21
#> # ... with 41 more rows
If you also have times without fractions (i.e., without the dot) you could use this approach:
normalize_time <- function(t) {
formatC(as.numeric(t) * 100, width = 6, format = "d", flag = "0")
}
df %>%
mutate(Time_new = as.POSIXct(normalize_time(Time), format = "%H%M%S"))
A roundabout way of doing it
tmp=as.numeric(lapply(strsplit(as.character(df$Time),"\\."),function(x){nchar(x[1])}))
ifelse(tmp>2,
substr(as.POSIXct(df$Time,format="%H%M.%S"),12,19),
substr(as.POSIXct(df$Time,format="%M.%S"),12,19))
a data.table way
First, convert your strings in your vector to numeric, multiply by 100 (to get the relevant part of HMS before the decimal separator) and set to integer. Then use sprintf() to add leading zero's to get a 6-digit string. Finally, convert to time.
data.table::as.ITime( sprintf( "%06d",
as.integer( as.numeric(time) * 100 ) ),
format = "%H%M%S" )
# [1] "17:44:30" "23:27:54" "17:18:51" "23:12:32" "14:14:16" "20:46:15" "14:42:50" "19:12:22" "23:03:21" "21:46:32" "14:59:02" "19:30:15"
# [13] "18:56:23" "23:19:15" "14:51:05" "00:25:46" "14:53:25" "23:09:02" "23:42:48" "23:22:53" "21:01:50" "20:26:07" "12:45:04" "19:45:15"
# [25] "00:05:40" "10:39:50" "17:31:37" "20:58:41" "20:30:36" "18:14:31" "13:38:18" "18:58:33" "17:31:36" "23:43:38" "17:33:27" "23:04:59"
# [37] "13:09:47" "19:16:11" "19:58:30" "19:29:54" "17:56:40" "17:44:23" "17:31:26" "18:44:47" "13:53:25" "19:58:30" "17:46:44" "18:57:53"
# [49] "20:47:15" "23:27:21"

How to I add a leading numeric identifier (not necessarily zero) to a character string in r

I apologize if this is a duplicate, I've searched through all of the "add leading zero" content I can find, and I'm struggling to find a solution I can work with. I have the following:
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier)
and I want a modified siteid that is always six (6) characters long with zeroes to fill the gaps. The Site ID can vary in nchar from 1-3, the modifier is always a length of 2, and the number of zeroes can vary depending on the length of the site ID (so that 6 is always the final modified length).
I would like the following final output:
df
# siteid modifier mod.siteid
#1 1 44 440001
#2 11 22 220011
#3 111 11 110111
Thanks for any suggestions or direction. This could also be numeric, but it seems like character manipulation has more options...?
The vocabulary here is left pad and paste here is one way using sprintf()::
df$mod.siteid <- with(df, sprintf("%s%04d", modifier, as.integer(siteid)))
# Note:
# code simplified thanks to suggestion by Maurits.
Output:
siteid modifier mod.siteid
1 1 44 440001
2 11 22 220011
3 111 11 110111
Data:
df <- data.frame(
siteid = c("1", "11", "111"),
modifier = c("44", "22", "11"),
stringsAsFactors = FALSE
)
Extra: If you don't want to left pad with 0, then using the stringi package is one option: with(df, paste0(modifier, stringi::stri_pad_left(siteid, 4, "q")))
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier, stringsAsFactors = FALSE)
df$mod.siteid = paste0( df$modifier,
formatC( as.numeric(df$siteid), width = 4, format = "d", flag="0") )
df
# siteid modifier mod.siteid
# 1 1 44 440001
# 2 11 22 220011
# 3 111 11 110111

Remove dots from data column

I'm beginner dealing with R and working with strings.
I've been trying to remove periods from data but unfortunately I can't find a solution.
This is the data I'm working on in a dataframe df:
df <- read.table(text = " n mesAno receita
97 1/2009 3.812.819.062,06
98 2/2009 4.039.362.599,36
99 3/2009 3.652.885.587,18
100 4/2009 3.460.247.960,02
101 5/2009 3.465.677.403,12
102 6/2009 3.131.903.622,55
103 7/2009 3.204.983.361,46
104 8/2009 3.811.786.009,24
105 9/2009 3.180.864.095,05
106 10/2009 3.352.535.553,88
107 11/2009 5.214.148.756,95
108 12/2009 4.491.795.201,50
109 1/2010 4.333.557.619,30
110 2/2010 4.808.488.277,86
111 3/2010 4.039.347.179,81
112 4/2010 3.867.676.530,69
113 5/2010 6.356.164.873,94
114 6/2010 3.961.793.391,19
115 7/2010 3797656130.81
116 8/2010 4709949715.37
117 9/2010 4047436592.12
118 10/2010 3923484635.28
119 11/2010 4821729985.03
120 12/2010 5024757038.22",
header = TRUE,
stringsAsFactors = TRUE)
My objective is to transform receita column to numeric as it's is being stored as factor. But applying conversion functions like as.numeric(as.factor(x)) does not work in the interval 97:114 (it coerces to NA's).
I suppose that this is because of the periods separating billion/million/thousands in this column.
The mentioned conversion functions will work only if I have something like 3812819062.06 as in 115:120.
I tried mutating the dataset adding another column and modelling.
I don't really know if what i'm doing is fine, but i also tried extracting the anomalous numbers to a variable, and applying sub/gsub on them but without success.
Is there some straight forward way of doing this, that is, instruct it to remove the 2 first occurrences of '.' and then replace the comma with a '.'?
I'm very confident that the function i'm needing is gsub but i'm having a hard time finding the correct usage. Any help will be appreciated.
Edit: My approach using dplyr::mutate(). Ugly but works.
df <- df %>%
mutate(receita_temp = receita) %>%
mutate(dot_count = str_count(receita, '\\.')) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\.', '', as.factor(receita_temp)),
gsub('\\,', '.',as.factor(receita_temp))
)) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\,', '.',as.factor(receita_temp)),
receita_temp)) %>%
select(-c(dot_count, receita)) %>%
rename(., receita = receita_temp)
I'm using regex and some stringr functions to remove all the periods except those followed by two digits and the end of the string. That way, periods denoting separation like in 3.811.786.009,24 are removed, but periods denoting the start of a decimal like in 4821729985.03 are not. Using str_remove_all rather than str_remove lets me not have to worry about removing the matches repeatedly or about how well it will scale. Then replace the remaining commas with periods, and make it numeric.
library(tidyverse)
df2 <- df %>%
mutate(receita = str_remove_all(receita, "\\.(?!\\d{2,}$)") %>%
str_replace_all(",", ".") %>%
as.numeric())
print(head(df2), digits = 12)
#> n mesAno receita
#> 1 97 1/2009 3812819062.06
#> 2 98 2/2009 4039362599.36
#> 3 99 3/2009 3652885587.18
#> 4 100 4/2009 3460247960.02
#> 5 101 5/2009 3465677403.12
#> 6 102 6/2009 3131903622.55
Created on 2018-09-04 by the reprex package (v0.2.0).
You can use the following:
first create a function that will be used for replacement:
repl = function(x)setNames(c("","."),c(".",","))[x]
This function takes in either "." or "," and returns "" or '.' respectively
Now use this function to replace
stringr::str_replace_all(as.character(df[,3]), "[.](?!\\d+$)|,", repl)
[1] "3812819062.06" "4039362599.36" "3652885587.18" "3460247960.02" "3465677403.12" "3131903622.55"
[7] "3204983361.46" "3811786009.24" "3180864095.05" "3352535553.88" "5214148756.95" "4491795201.50"
[13] "4333557619.30" "4808488277.86" "4039347179.81" "3867676530.69" "6356164873.94" "3961793391.19"
[19] "3797656130.81" "4709949715.37" "4047436592.12" "3923484635.28" "4821729985.03" "5024757038.22"
Of course you can do the rest. ie calling as.numeric() etc.
To do this in base R:
sub(',','.',gsub('[.](?!\\d+$)','',as.character(df[,3]),perl=T))
or If you know the exact number of . and , in your data, you could do
a = as.character(df[,3])
regmatches(a,gregexpr('[.](?!\\d+$)|,',df[,3],perl = T)) = list(c("","","","."))
a
df$num <- as.numeric(sapply(as.character(si), function(x) gsub("\\,","\\.",ifelse(grepl("\\,", x), gsub("\\.","",x),x))))
should do the trick.
First, the function searches for rows with ",", removes "." in these rows, and last it converts all occurring "," into ".", so that it can be converted without problems to numeric.
Use print(df$num, digits = 12) to see your data with 2 decimals.

Replace specific characters in a variable in data frame in R

I want to replace all ,, -, ), ( and (space) with . from the variable DMA.NAME in the example data frame. I referred to three posts and tried their approaches but all failed.:
Replacing column values in data frame, not included in list
R replace all particular values in a data frame
Replace characters from a column of a data frame R
Approach 1
> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")")
c$DMA.NAME[shouldbecomeperiod] <- "."
Approach 2
> removetext <- c("-", ",", " ", "(", ")")
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME)
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE)
Warning message:
In gsub(removetext, ".", c$DMA.NAME) :
argument 'pattern' has length > 1 and only the first element will be used
Approach 3
> c[c == c(" ", ",", "(", ")", "-")] <- "."
Sample data frame
> df
DMA.CODE DATE DMA.NAME count
111 22 8/14/2014 12:00:00 AM Columbus, OH 1
112 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
79 18 7/30/2014 12:00:00 AM Boston (Manchester) 1
99 22 8/20/2014 12:00:00 AM Columbus, OH 1
112.1 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
208 27 7/31/2014 12:00:00 AM Minneapolis-St. Paul 1
I know the problem - gsub uses pattern and only first element . The other two approaches are searching the entire variable for the exact value instead of searching within value for specific characters.
You can use the special groups [:punct:] and [:space:] inside of a pattern group ([...]) like this:
df <- data.frame(
DMA.NAME = c(
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Boston (Manchester)",
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Minneapolis-St. Paul"),
stringsAsFactors=F)
##
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME)
[1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
If your data frame is big you might want to look at this fast function from stringi package. This function replaces every character of specific class for another. In this case character class is L - letters (inside {}), but big P (before {}) indicates that we are looking for the complements of this set, so for every non letter character. Merge indicates that consecutive matches should be merged into a single one.
require(stringi)
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T)
## [1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
And some benchmarks:
x <- sample(df$DMA.NAME, 1000, T)
gsubFun <- function(x){
gsub("[[:punct:][:space:]]+","\\.",x)
}
striFun <- function(x){
stri_replace_all_charclass(x, "\\P{L}",".", T)
}
require(microbenchmark)
microbenchmark(gsubFun(x), striFun(x))
Unit: microseconds
expr min lq median uq max neval
gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984 100
striFun(x) 877.259 893.3945 907.769 929.8065 3189.017 100

Resources