I'm beginner dealing with R and working with strings.
I've been trying to remove periods from data but unfortunately I can't find a solution.
This is the data I'm working on in a dataframe df:
df <- read.table(text = " n mesAno receita
97 1/2009 3.812.819.062,06
98 2/2009 4.039.362.599,36
99 3/2009 3.652.885.587,18
100 4/2009 3.460.247.960,02
101 5/2009 3.465.677.403,12
102 6/2009 3.131.903.622,55
103 7/2009 3.204.983.361,46
104 8/2009 3.811.786.009,24
105 9/2009 3.180.864.095,05
106 10/2009 3.352.535.553,88
107 11/2009 5.214.148.756,95
108 12/2009 4.491.795.201,50
109 1/2010 4.333.557.619,30
110 2/2010 4.808.488.277,86
111 3/2010 4.039.347.179,81
112 4/2010 3.867.676.530,69
113 5/2010 6.356.164.873,94
114 6/2010 3.961.793.391,19
115 7/2010 3797656130.81
116 8/2010 4709949715.37
117 9/2010 4047436592.12
118 10/2010 3923484635.28
119 11/2010 4821729985.03
120 12/2010 5024757038.22",
header = TRUE,
stringsAsFactors = TRUE)
My objective is to transform receita column to numeric as it's is being stored as factor. But applying conversion functions like as.numeric(as.factor(x)) does not work in the interval 97:114 (it coerces to NA's).
I suppose that this is because of the periods separating billion/million/thousands in this column.
The mentioned conversion functions will work only if I have something like 3812819062.06 as in 115:120.
I tried mutating the dataset adding another column and modelling.
I don't really know if what i'm doing is fine, but i also tried extracting the anomalous numbers to a variable, and applying sub/gsub on them but without success.
Is there some straight forward way of doing this, that is, instruct it to remove the 2 first occurrences of '.' and then replace the comma with a '.'?
I'm very confident that the function i'm needing is gsub but i'm having a hard time finding the correct usage. Any help will be appreciated.
Edit: My approach using dplyr::mutate(). Ugly but works.
df <- df %>%
mutate(receita_temp = receita) %>%
mutate(dot_count = str_count(receita, '\\.')) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\.', '', as.factor(receita_temp)),
gsub('\\,', '.',as.factor(receita_temp))
)) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\,', '.',as.factor(receita_temp)),
receita_temp)) %>%
select(-c(dot_count, receita)) %>%
rename(., receita = receita_temp)
I'm using regex and some stringr functions to remove all the periods except those followed by two digits and the end of the string. That way, periods denoting separation like in 3.811.786.009,24 are removed, but periods denoting the start of a decimal like in 4821729985.03 are not. Using str_remove_all rather than str_remove lets me not have to worry about removing the matches repeatedly or about how well it will scale. Then replace the remaining commas with periods, and make it numeric.
library(tidyverse)
df2 <- df %>%
mutate(receita = str_remove_all(receita, "\\.(?!\\d{2,}$)") %>%
str_replace_all(",", ".") %>%
as.numeric())
print(head(df2), digits = 12)
#> n mesAno receita
#> 1 97 1/2009 3812819062.06
#> 2 98 2/2009 4039362599.36
#> 3 99 3/2009 3652885587.18
#> 4 100 4/2009 3460247960.02
#> 5 101 5/2009 3465677403.12
#> 6 102 6/2009 3131903622.55
Created on 2018-09-04 by the reprex package (v0.2.0).
You can use the following:
first create a function that will be used for replacement:
repl = function(x)setNames(c("","."),c(".",","))[x]
This function takes in either "." or "," and returns "" or '.' respectively
Now use this function to replace
stringr::str_replace_all(as.character(df[,3]), "[.](?!\\d+$)|,", repl)
[1] "3812819062.06" "4039362599.36" "3652885587.18" "3460247960.02" "3465677403.12" "3131903622.55"
[7] "3204983361.46" "3811786009.24" "3180864095.05" "3352535553.88" "5214148756.95" "4491795201.50"
[13] "4333557619.30" "4808488277.86" "4039347179.81" "3867676530.69" "6356164873.94" "3961793391.19"
[19] "3797656130.81" "4709949715.37" "4047436592.12" "3923484635.28" "4821729985.03" "5024757038.22"
Of course you can do the rest. ie calling as.numeric() etc.
To do this in base R:
sub(',','.',gsub('[.](?!\\d+$)','',as.character(df[,3]),perl=T))
or If you know the exact number of . and , in your data, you could do
a = as.character(df[,3])
regmatches(a,gregexpr('[.](?!\\d+$)|,',df[,3],perl = T)) = list(c("","","","."))
a
df$num <- as.numeric(sapply(as.character(si), function(x) gsub("\\,","\\.",ifelse(grepl("\\,", x), gsub("\\.","",x),x))))
should do the trick.
First, the function searches for rows with ",", removes "." in these rows, and last it converts all occurring "," into ".", so that it can be converted without problems to numeric.
Use print(df$num, digits = 12) to see your data with 2 decimals.
I need to extract the mg quantity from data that looks like this:
(100) x 10mg zepose valium ..(cipla in strips)
-- 20x2mg -- diclazepam
(10) clonazepam 2mg / roche rivotril
...
In R, I use this regex to remove all text after "mg":
dataset$quantity <- gsub('mg.+?$','mg',dataset$quantity)
The output is like this:
(100) x 10mg
-- 20x2mg
(10) clonazepam 2mg
How can I remove the text before 'mg' while also keeping the quantity? The range is from 1mg to 200mg, so from single to three digits.
Sometimes there's whitespace immediately before the mg digit(s), but not always. One pattern, however, is that there's never a number immediately before the mg quantity (unless separated by whitespace).
Based on my limited understanding of regex I'm thus looking for a code that can delete all characters before 1-3 digits and 'mg'. I've looked around and can't find what I need.
Edit:
My desired output is:
10mg
2mg
2mg
Please ignore that the text refers to 1000mg (100 x 10mg), 40mg (20 x 2mg) etc. I will have to do those calculations by hand I think.
With gsub/sub (Doesn't matter which one in this case since you only have one mg per row):
dataset$quantity <- gsub('.+?(\\d+\\s?mg).+', '\\1', dataset$quantity)
or with str_extract from stringr:
library(stringr)
dataset$quantity = str_extract(dataset$quantity, "\\d+\\s?mg")
Result:
quantity
1 10mg
2 2mg
3 2mg
Notes:
.+? matches any character one or more time lazily.
(\\d+\\s?mg) is a capture group that matches a digit one or more times followed by a space zero or one times followed by the literal "mg".
\\1 in gsub/sub replaces the pattern with whatever is in the first capture group, in this case (\\d+\\s?mg). So the gsub/sub solution effectively removes everything except <digits>[space]mg.
str_extract is a different approach, which extracts a pattern, instead of replacing. In this case, I am extracting \\d+\\s?mg directly.
Data:
dataset = structure(list(quantity = c("(100) x 10mg zepose valium ..(cipla in strips)",
"-- 20x2mg -- diclazepam", "(10) clonazepam 2mg / roche rivotril"
)), class = "data.frame", row.names = c(NA, -3L), .Names = "quantity")
1) sub Match everything before 1-3 digits followed by mg followed by anything else and replace that with the match to the capture group (parenthesized portion) consisting of the digits and mg:
dat <- c("(100) x 10mg zepose valium ..(cipla in strips)",
"-- 20x2mg -- diclazepam",
"(10) clonazepam 2mg / roche rivotril")
sub(".*?(\\d{1,3}mg).*", "\\1", dat)
## [1] "10mg" "2mg" "2mg"
If you don't want to return the mg part then put the right parenthesis before mg instead of after it.
2) strcapture Another possibility is
strcapture("(\\d{1,3}mg)", dat, data.frame(mg = character(0)))
giving this data frame:
mg
1 10mg
2 2mg
3 2mg
Update: There was an update to the question regarding calculating the quantity times the mg:
DF <- strcapture("(\\d+)\\D+(\\d+)", dat, data.frame(qty = numeric(0), mg = numeric(0)))
transform(DF, total = qty * mg, desc = sub(".*mg *", "", dat))
giving:
qty mg total desc
1 100 10 1000 zepose valium ..(cipla in strips)
2 20 2 40 -- diclazepam
3 10 2 20 / roche rivotril
I have a text filed with different types of date formats and its id`s. I need to extract all the strings using regex.
df <- data.frame(id=1:8,text=c("deficit based on wage statement 7/14/ to 7/17/2015.",
"Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2.",
"Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;Deficit due to wage,",
"statement: 4/22/15 thru 5/12/15,depos transcript 7/10/15 for 7/8/15 depos,",
"difference owed for 4/25/15-5/22/15 10-29-99 Feb. 25, 2009,",
"tpd 4:30:2015 - 5:22:2015--09/26/99, 7-14 1.3.99, 1.3.1999,",
"Medical TREATMENT DATES: 6/30/2015 - 30/06/2015 09/26/1999,",
"4/25/15-5/22/15,Medical 2010-01-29 **2010-01-30 February25,2009, February 25, 2009"))
So far, I have created regex using multiple OR statements.
#Different string patterns
#all day formats
day<-c(31:1,"01","02","03","04","05","06","07","08","09")
day_p<-paste(day,collapse = "|")
day_p <- paste0("(",day_p,")")
#all month formats
month<-c(12:1,"01","02","03","04","05","06","07","08","09")
month_p<-paste(month,collapse="|")
month_p <- paste0("(",month_p,")")
#all year 4 digit formats
year<-"\\d{4}"
year_p<-paste(year,collapse="|")
year_p <- paste0("(",year_p,")")
#all year 2 digit formats
year_i<-"\\d{2}"
year_i_p<-paste(year_i,collapse="|")
year_i_p <- paste0("(",year_i_p,")")
#all seperator symbol
symbol_p<-paste(c("\\.","\\|","\\/","\\-","\\:","\\,"),collapse="|")
symbol_p <- paste0("(",symbol_p,")")
patterns<-paste0("(",month_p,symbol_p,day_p,symbol_p,year_p,")","|",
"(",day_p,symbol_p,month_p,symbol_p,year_p,")","|",
"(",year_p,symbol_p,month_p,symbol_p,day_p,")","|",
"(",month_p,symbol_p,day_p,symbol_p,year_i_p,")","|",
"(",day_p,symbol_p,month_p,symbol_p,year_i_p,")","|",
"(",year_i_p,symbol_p,month_p,symbol_p,day_p,")","|",
"(",month_p,"\\-",day_p,")","|",
"(",day_p,"\\-",month_p,")","|",
"(",month_p,"\\/",day_p,")","|",
"(",day_p,"\\/",month_p,")")
#String extaction
extract= str_extract_all(df$text,patterns)
Is there an approach to put all the regex rules in a data frame, name each rule and do a string extraction?
#regex patterns in a data frame
df_patterns<-data.frame(pattern=c(paste0("(",month_p,symbol_p,day_p,symbol_p,year_p,")"),
paste0("(",day_p,symbol_p,month_p,symbol_p,year_p,")")),
rule=c(1,2))
The output data frame should include the extraction values and the rule which triggered its extraction.
#output data frame
output<-data.frame(id=c(1,1,2,3,3),string=c("7/14","7/17/2015",NA,"7/14/15","10/5/15"),rule=c(9,1,NA,2,3))
stringr has a function called str_match_all that can extract all matches as well as return the capture groups that matched in separate columns. This is convenient for this question since you can name the capture groups and associate them to each column of output from str_match_all:
#Different string patterns
#all day formats
day_p <- "[0-3]?[0-9]"
#all month formats
month_p <- "[0-1]?[0-9]"
#all year 4 digit formats
year_p <- "\\d{4}"
#all year 2 digit formats
year_i_p <- "\\d{2}"
#all seperator symbol
symbol_p <- "[-/:.]"
# Patterns to match structured as combination of capture groups
patterns<-paste0("(",month_p,symbol_p,day_p,symbol_p,year_p,")","|",
"(",day_p,symbol_p,month_p,symbol_p,year_p,")","|",
"(",year_p,symbol_p,month_p,symbol_p,day_p,")","|",
"(",month_p,symbol_p,day_p,symbol_p,year_i_p,")","|",
"(",day_p,symbol_p,month_p,symbol_p,year_i_p,")","|",
"(",year_i_p,symbol_p,month_p,symbol_p,day_p,")","|",
"(",month_p,"[-]",day_p,")","|",
"(",day_p,"[-]",month_p,")","|",
"(",month_p,"[/]",day_p,")","|",
"(",day_p,"[/]",month_p,")","|",
"(", "\\w+[.]?[\\s]?\\d+[,]\\s?",year_p,")")
# Name the capture groups
rule_names = c("MDYYYY", "DMYYYY",
"YYYYMD", "MDYY",
"DMYY", "YYMD",
"MD_dash", "DM_dash",
"MD_slash", "DM_slash",
"MDYYYY_word")
library(dplyr)
library(tidyr)
library(purrr)
df$text %>%
str_match_all(patterns) %>%
map2(df$id, function(x, y){
if(nrow(x) == 0){
x = rbind(x, NA)
}
data.frame(id = y, x)
}) %>%
do.call(rbind, .) %>%
mutate_at(vars(X2:X11), funs(ifelse(!is.na(.), 1, NA))) %>%
setNames(c("id", "string", rule_names)) %>%
gather(rule, value, -id, -string) %>%
na.omit() %>%
select(-value) %>%
arrange(id)
Notes:
This final part does all the work. str_match_all returns a list with each element a character matrix of matches and capture groups for each df$text value.
map2 binds the id's with the character matrices, so that each row refers to an id + match combination. The if statement checks if an element has no match and rbinds an NA value if it is the case. This allows id to have at least one row to bind to.
mutate_at converts each of the "capture_group" columns to dummy variables indicating whether "this capture group has a match"
Rename capture group columns with rule_names and transform all dummy into one single categorical variable.
Important note is that there is no way of knowing whether "5/6/2015" is MDYYYY or DMYYYY format, so in this case, you will have to order patterns to have one of them take precedence (e.g. if MDYYYY is before DMYYYY in patterns, MDYYYY will match first for "5/6/2015")
Result:
id string rule
1 1 7/17/2015 MDYYYY
2 1 7/14 MD_slash
3 3 7/14/15 MDYY
4 3 10/5/15 MDYY
5 4 4/22/15 MDYY
6 4 5/12/15 MDYY
7 4 7/10/15 MDYY
8 4 7/8/15 MDYY
9 5 4/25/15 MDYY
10 5 5/22/15 MDYY
11 5 10-29-99 MDYY
12 5 Feb. 25, 2009 MDYYYY_word
13 6 4:30:2015 MDYYYY
14 6 5:22:2015 MDYYYY
15 6 1.3.1999 MDYYYY
16 6 09/26/99 MDYY
17 6 1.3.99 MDYY
18 6 7-14 MD_dash
19 7 6/30/2015 MDYYYY
20 7 09/26/1999 MDYYYY
21 7 30/06/2015 DMYYYY
22 8 2010-01-29 YYYYMD
23 8 2010-01-30 YYYYMD
24 8 4/25/15 MDYY
25 8 5/22/15 MDYY
26 8 February25,2009 MDYYYY_word
27 8 February 25, 2009 MDYYYY_word
Answer
Brief
Correct me if I'm wrong, but I believe R does support PCRE regex. That being the case, you can use the following regex to catch any date in the formats you specified.
Code
See this regex in use here
(?(DEFINE)
(?# Definitions )
(?<day>[12]\d|3[01]|0?[1-9])
(?<month>1[0-2]|0?[1-9])
(?<year>\d+)
(?<separator>[.|\/:,-])
(?# Date formats )
(?<mdy>(?&month)(?<mdy_1>(?&separator))(?&day)(?&mdy_1)(?&year))
(?<dmy>(?&day)(?<dmy_1>(?&separator))(?&month)(?&dmy_1)(?&year))
(?<ymd>(?&year)(?<ymd_1>(?&separator))(?&month)(?&ymd_1)(?&day))
(?<md>(?&month)(?<md_1>(?&separator))(?&day)(?&md_1)?)
(?<dm>(?&day)(?<dm_1>(?&separator))(?&month)(?&dm_1)?)
(?# Date )
(?<date>(?&mdy)|(?&dmy)|(?&ymd)|(?&md)|(?&dm))
)
(?<=\b|\s)(?&date)(?=\b|\s)
Explanation
The define block specifies all our definitions for what constitutes a day, month, year, separator. It also defines our date formats (mdy, dmy, ymd, md, dm). Finally it defines our date group which is a simple OR between all our date formats.
The final regex simply specifies that the preceding or following tokens should be word boundary characters \b or whitespace character \s (whitespace added here in the case of the last character being a word boundary character, it will catch the final character as well - you can test this with the first match by removing the |\s in the final regex to see the result).
Please note that this assumes the days of a month can go to 31 (a more specific check would result in a very lengthy regex and seems pointless when you can validate it through code).
Results
Input
deficit based on wage statement 7/14/ to 7/17/2015.
Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2.
Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;Deficit due to wage,
statement: 4/22/15 thru 5/12/15,depos transcript 7/10/15 for 7/8/15 depos,
difference owed for 4/25/15-5/22/15 10-29-99 Feb. 25, 2009,
tpd 4:30:2015 - 5:22:2015--09/26/99, 7-14 1.3.99, 1.3.1999,
Medical TREATMENT DATES: 6/30/2015 - 30/06/2015 09/26/1999,
4/25/15-5/22/15,Medical 2010-01-29 **2010-01-30 February25,2009, February 25, 2009
Output
7/14/
7/17/2015
7/14/15
10/5/15
4/22/15
5/12/15
7/10/15
7/8/15
4/25/15
5/22/15
10-29-99
4:30:2015
5:22:2015
09/26/99
7-14
1.3.99
1.3.1999
6/30/2015
30/06/2015
09/26/1999
4/25/15
5/22/15
2010-01-29
2010-01-30
Edits
Code
See this code in use here
(?(DEFINE)
(?# Definitions )
(?<day>[12]\d|3[01]|0?[1-9])
(?<month>1[0-2]|0?[1-9])
(?<year>\d+)
(?<separator>[.|\/:,-])
(?# Date formats )
(?<f_mdy>(?&month)(?<mdy_1>(?&separator))(?&day)(?&mdy_1)(?&year))
(?<f_dmy>(?&day)(?<dmy_1>(?&separator))(?&month)(?&dmy_1)(?&year))
(?<f_ymd>(?&year)(?<ymd_1>(?&separator))(?&month)(?&ymd_1)(?&day))
(?<f_md>(?&month)(?<md_1>(?&separator))(?&day)(?&md_1)?)
(?<f_dm>(?&day)(?<dm_1>(?&separator))(?&month)(?&dm_1)?)
(?<f_Mdy>(?:jan(?:uary|\.)?|feb(?:ruary|\.)?|mar(?:ch|\.)?|apr(?:il|\.)?|may|jun(?:e|\.)?|jul(?:y|\.)?|aug(?:ust|\.)?|sep(?:tember|\.)?|oct(?:ober|\.)?|nov(?:ember|\.)?|dec(?:ember|\.)?)\s*(?&day)(?:\s*(?&separator)|(?&separator)\s*|\s+)(?&year))
)
(?<=\b|\s)(?:(?<mdy>(?&f_mdy))|(?<dmy>(?&f_dmy))|(?<ymd>(?&f_ymd))|(?<md>(?&f_md))|(?<dm>(?&f_dm))|(?<Mdy>(?&f_Mdy)))(?=\b|\s)
This will set captures into named capture groups. If you look at the output in the link, you'll see named groups with the content it matched.
I want to replace all ,, -, ), ( and (space) with . from the variable DMA.NAME in the example data frame. I referred to three posts and tried their approaches but all failed.:
Replacing column values in data frame, not included in list
R replace all particular values in a data frame
Replace characters from a column of a data frame R
Approach 1
> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")")
c$DMA.NAME[shouldbecomeperiod] <- "."
Approach 2
> removetext <- c("-", ",", " ", "(", ")")
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME)
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE)
Warning message:
In gsub(removetext, ".", c$DMA.NAME) :
argument 'pattern' has length > 1 and only the first element will be used
Approach 3
> c[c == c(" ", ",", "(", ")", "-")] <- "."
Sample data frame
> df
DMA.CODE DATE DMA.NAME count
111 22 8/14/2014 12:00:00 AM Columbus, OH 1
112 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
79 18 7/30/2014 12:00:00 AM Boston (Manchester) 1
99 22 8/20/2014 12:00:00 AM Columbus, OH 1
112.1 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
208 27 7/31/2014 12:00:00 AM Minneapolis-St. Paul 1
I know the problem - gsub uses pattern and only first element . The other two approaches are searching the entire variable for the exact value instead of searching within value for specific characters.
You can use the special groups [:punct:] and [:space:] inside of a pattern group ([...]) like this:
df <- data.frame(
DMA.NAME = c(
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Boston (Manchester)",
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Minneapolis-St. Paul"),
stringsAsFactors=F)
##
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME)
[1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
If your data frame is big you might want to look at this fast function from stringi package. This function replaces every character of specific class for another. In this case character class is L - letters (inside {}), but big P (before {}) indicates that we are looking for the complements of this set, so for every non letter character. Merge indicates that consecutive matches should be merged into a single one.
require(stringi)
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T)
## [1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
And some benchmarks:
x <- sample(df$DMA.NAME, 1000, T)
gsubFun <- function(x){
gsub("[[:punct:][:space:]]+","\\.",x)
}
striFun <- function(x){
stri_replace_all_charclass(x, "\\P{L}",".", T)
}
require(microbenchmark)
microbenchmark(gsubFun(x), striFun(x))
Unit: microseconds
expr min lq median uq max neval
gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984 100
striFun(x) 877.259 893.3945 907.769 929.8065 3189.017 100