Tidying financial data with mixed decimal and grouping digits

Tidying financial data with mixed decimal and grouping digits - r

Context
I need to clean financial data with mixed formats. The data has been punched in manually by different departments, some of them using "." as decimal and "," as grouping digit (e.g. US notation: $1,000,000.00) while others are using "," as decimal and "." as grouping digit (e.g. notation used in certain European countries: $1.000.000,00).
Input:
Here's a fictional example set:
df <- data.frame(Y2019= c("17.530.000,03","28000000.05", "256.000,23", "23,000",
"256.355.855","2565467,566","225,453.126")
)
Y2019
1 17.530.000,03
2 28000000.05
3 256.000,23
4 23,000
5 256.355.855
6 2565467,566
7 225,453.126
Desired result:
Y2019
1 17530000.03
2 28000000.05
3 256000.23
4 23000.00
5 256355855.00
6 2565467.566
7 225453.126
My attempt:
I got pretty close by considering the first occurrence (starting from the right) of "," or "." as the decimal operator and replacing the other occurrences accordingly. However, some entries are without decimals (e.g. entry 4 and 5) or have a variable number of decimals, rendering this strategy less useful.
Any input is greatly appreciated!
Edit:
As per request, I salvaged some of the code of the original attempt. I am sure it could be written a lot cleaner.
df %>%
mutate(Y2019r = ifelse(str_length(Y2019)- data.frame(str_locate(pattern =",",Y2019 ))[,1]==2, gsub("\\.","", Y2019),NA )) %>%
mutate(Y2019r = ifelse((is.na(Y2019r) & str_length(Y2019)- data.frame(str_locate(pattern ="\\.",Y2019 ))[,1]==2), gsub("\\.",",", Y2019),Y2019r ))%>%
mutate(Y2019r = gsub(",",".", Y2019r))
Y2019 Y2019r
1 17.530.000,03 17530000.03
2 28000000.05 28000000.05
3 256.000,23 256000.23
4 23,000 <NA>
5 256.355.855 <NA>
6 2565467,566 <NA>
7 225,453.126 <NA>

Here's a functional approach to build up the logic needed to parse the strings you might come across. I suppose it is built up from thinking about how we parse these strings when we read them, and trying to emulate that.
I think the key is realising that all we really need to know is whether the value after the last delimiter is decimal or not. If we could somehow label the strings as having a decimal portion it would be easy to parse the strings then.
The following method involves splitting the character strings at the points and commas and trying to label them as having a terminal decimal or not. The split strings will be held as a list of string vectors, with each vector being composed of the "chunks" of digits between the delimiters.
First we will write two helper functions to create the final numbers from the string vectors once we have correctly labeled them as having a terminal decimal portion or not:
last_element_is_decimal <- function(x)
{
as.numeric(paste0(paste(x[-length(x)], collapse = ""), ".", x[length(x)]))
}
last_element_is_whole <- function(x)
{
as.numeric(paste0(x, collapse = ""))
}
It will be easy to decide what to do in the event of no delimiters, since we assume these are just whole numbers. Similarly, it is easy to see that any numbers containing both a comma and a stop (in either order) must have a terminal decimal component.
However, it is less obvious what to do when there is only a single delimiter; in these cases we have to use the length of the digit chunks to decide. If any chunk is longer than three digits, then a thousands seperator isn't in use, and the presence of a delimiter indicates we have a decimal component. If the terminal chunk contains only two digits then we must have a decimal. In all other cases, we assume a whole number.
This says the same thing in code:
decide_last_element <- function(x)
{
if(max(nchar(x)) > 3)
return(last_element_is_decimal(x))
if(nchar(x[length(x)]) < 3)
return(last_element_is_decimal(x))
return(last_element_is_whole(x))
}
Now we can write our main function. It takes our strings as input and classifies each string into having either two types of delimiter, one type of delimiter or no delimiter. Then we can apply the functions above using lapply accordingly.
parse_money <- function(money_strings)
{
any_comma <- grepl(",", money_strings)
any_point <- grepl("[.]", money_strings)
both <- any_comma & any_point
neither <- !any_comma & !any_point
single <- (any_comma & !any_point) | (any_point & !any_comma)
digit_groups <- strsplit(money_strings, "[.]|,")
values <- rep(0, length(money_strings))
values[neither] <- as.numeric(money_strings[neither])
values[both] <- sapply(digit_groups[both], last_element_is_decimal)
values[single] <- sapply(digit_groups[single], decide_last_element)
return(format(round(values, 2), nsmall = 2))
}
So now we can just do
parse_money(df$Y2019)
#> [1] " 17530000.03" " 28000000.05" " 256000.23" " 23000.00" "256355855.00"
#> [6] " 2565467.57" " 225453.13"
Note I have output as strings so that rounding inaccuracies in the console output aren't ascribed to mistakes in the code.

Related

how do I filter dataset based on "Version" column containing _________.000 decimal?

I have a dataset where I am trying to filter based on 3 different columns.
I have the 2 columns that have character values figured out by doing:
filter(TRANSACTION_TYPE != "ABC", CUSTOMER_CODE == "123") however, I have a "VERSION" column where there will be multiple versions for each customer which will then duplicate my $ amount.
I want to filter on only the VERSION that contains ".000" as decimal since the .000 represents the final and most accurate version. For example, VERSION can = 20220901.000 and 20220901.002 (enter image description here
), 20220901.003, etc. However the numbers before the decimal will always change so I can't filter on it to equal this 20220901 as it will change by day.
I hope I was clear enough, thank you!

Sample data:
quux <- data.frame(VERS_chr = c("20220901.000","20220901.002","20220901.000","20220901.002"),
VERS_num = c(20220901.000,20220901.002,20220901.000,20220901.002))
If is.character(quux$VERSION) is true in your data, then
dplyr::filter(quux, grepl("\\.000$", VERS_chr))
# VERS_chr VERS_num
# 1 20220901.000 20220901
# 2 20220901.000 20220901
Explanation:
"\\.000$" matches the literal period . (it needs to be escaped since it's a regex reserved symbol) followed by three literal zeroes 000, at the end of string ($). See https://stackoverflow.com/a/22944075/3358272 for more info on regex.
If it is false (and it is not a factor), then
dplyr::filter(quux, abs(VERS_num %% 1) < 1e-3)
# VERS_chr VERS_num
# 1 20220901.000 20220901
# 2 20220901.000 20220901
Explanation:
abs(.) < 1e-3 is defensive against high-precision tests of equality, where floating-point limitations (in computers in general) don't always see a number very-close to zero as exactly zero. See Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.
. %% 1 is the modulus operator, reducing a number down to its fractional component.

Converting 7 or 8 digit numbers to dates in R

I am importing a very large fixed-width dataset into R and wish to use vroom for much better speed. However, the dates in this dataset are in numeric format with either 7 or 8 digits, depending on whether the day of the month has 1 or 2 digits (examples below).
#8 digit date (1985-03-21):
# 21031985
#7 digit date (1985-03-01):
# 1031985
I cannot see any way to specify this type of format using col_date(format = ) as one normally would. It is easy to make a function that converts these 7/8 digit numbers into dates, but doing that means materialising the imported data and removes the speed advantage that vroom provides.
I am looking for a way to have vroom interpret these numbers on its own, or a workaround that does not sacrifice vroom's speed.
Thanks very much for any help here.

Those formats are horrible in general, but regardless I expect nothing in readr is going to work right for you because of the 1 or 2 digit day-of-month. I suggest importing reading that column in as col_character, then post-processing them with
vec <- c("21031985", "1031985")
as.Date(paste0(strrep("0", pmax(8 - nchar(vec), 0)), vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"
Quick walk-through:
8 - nchar(vec) tells us how many 0s need to be padded to the left of each string. In this case, it should be 0 and 1, respectively. This might be a problem if you have length 6 strings, only you know if that's an issue.
strrep("0", ..) repeats the 0 string as many times as we need, including strrep("0", 0) producing "" (no zeroes).
pmax(.., 0) is the defensive programmer, if there's a length-9 string in there, we cannot do strrep("0", -1), we want to keep it from going negative.
paste0(.., vec) to do the actual padding.
From there, all strings should be normalized and able to be converted using "%d%m%Y".

Vroom can use a pipe as input. That means you can use a tool like awk to fix the format (e.g. make it always 8 digit, which is eaasy with sprintf). That way you can still benefit from vroom streaming the file. You could even use R - but if you are after performance, you need something that can process the file streaming and better be lightweight.
I used a test file test.csv:
id,date,text
1,1022020,some
2,12042020,more
3,2012020,text
I could read it via (of course the awk call needs to be adjusted for your data - but essentially if you need to just adjust the column $2 means 2nd column, the ',' specifies the separator):
vroom(pipe("awk -F ',' 'BEGIN{OFS=\",\"}; NR==1{print}; NR!=1 {$2=sprintf(\"%08d\",$2);print;}' test.csv"),
col_types=cols(date=col_date(format='%d%m%Y'))
)
giving
# A tibble: 3 × 3
id date text
<int> <date> <chr>
1 1 2020-02-01 some
2 2 2020-04-12 more
3 3 2020-01-02 text

If you have integer data you can left pad the lost 0s back on.
as.Date(sprintf("%08d", vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"

Splitting a column in a dataframe in R into two based on content

I have a column in a R dataframe that holds a product weight i.e. 20 kg but it has mixed measuring systems i.e. 1 lbs & 2 kg etc. I want to separate the value from the measurement and put them in separate columns then convert them in a new column to a standard weight. Any thoughts on how I might achieve that? Thanks in advance.

Assume you have the column given as
x <- c("20 kg","50 lbs","1.5 kg","0.02 lbs")
and you know that there is always a space between the number and the measurement. Then you can split this up at the space-character, e.g. via
splitted <- strsplit(x," ")
This results in a list of vectors of length two, where the first is the number and the second is the measurement.
Now grab the numbers and convert them via
numbers <- as.numeric(sapply(splitted,"[[",1))
and grab the units via
units <- sapply(splitted,"[[",2)
Now you can put everything together in a `data.frame.
Note: When using as.numeric, the decimal point has to be a dot. If you have commas instead, you need to replace them by a dot, for example via gsub(",","\\.",...).

separate(DataFrame, VariableName, into = c("Value", "Metric"), sep = " ")
My case was simple enough that I could get away with just one space separator but I learned you can also use a regular expression here for more complex separator considerations.

number some patterns in the string using R

I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply

Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9

Split character string into unequal segments R

I have some data that is I need to split into multiple elements, but there is not a specific identifier within the row to split on. I know the positions of different variables; is there a way I can split the string into multiple uneven parts based on my prior information. Example:
String: " 00008 L 1957110642706 194711071019561030R 1/812.5000000"
Desired result:
" 00008 "," ","L"," "," ","19571106","42706"," ","19471107","10","19561030","R 1/8","12.5000000"
So, my prior information is that the first element begins on the first position and is seven spaces long; the second begins at the 8th position in the string and is 8 spaces long; the 3rd element starts at the 16th position and is 1 space long, etc, etc.

xstr <- " 00008 L 1957110642706 194711071019561030R 1/812.5000000"
Rather than use this description:
first element begins on the first position and is seven spaces long; the second begins at the 8th position in the string and is 8 spaces long; the 3rd element starts at the 16th position and is 1 space long, etc, etc. ...
I'm just going to take the desired widths from your specified answer (nchar(res)):
res <- c(" 00008 "," ","L"," "," ","19571106","42706"," ","19471107","10","19561030","R 1/8","12.5000000")
Make sure that all variables are read as character strings:
res2 <- read.fwf(textConnection(xstr),widths=nchar(res),
colClasses=rep("character",length(res)))
Test:
all.equal(unname(unlist(res2)),res) ## TRUE

You can also use a simple substr function over your array of read lines:
my_lines <- read.table("your_file") #Or whatever way you read the lines
firstColumn <- substr(my_lines,1,7) #you can also use as.numeric and others if needed
secondColumn <- substr(my_lines,8,11)
# ..etc
rm(my_lines) #to save memory
Sometimes this is actually faster than other read.something packages specially if you dont use them correctly.