Cumulative application of a gsub sequence in R

Cumulative application of a gsub sequence in R - r

I'm working on a project dealing with chess games. After some processing of the data I need to get the FEN (https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation) notation of a particular position. I've already written the code for each piece FEN encoding, but I'm having a hard time encoding the character that represents the number of consecutive squares that are not occupied.
As an example, take the following FEN code:
"rnbq1rk1/pppp1ppp/1b11pn11/11111111/11PP1111/11111NP1/PP11PPBP/RNBQ1RK1 w KQkq c6 0 2"
Each 1 represents an unoccupied square inside the chess board. So, for example: 11111111 is telling us that this row inside the board is not occupied by pieces.
Problem is, R packages that plot chess boards using FEN as input don't like this notation and they want the more suscint, original notation where all the 1s are represented by one character: the sum of all this consecutive 1s. For the previous example, that would be:
"rnbq1rk1/pppp1ppp/1b2pn2/8/2PP4/5NP1/PP2PPBP/RNBQ1RK1 w KQkq c6 0 2"
Note that, for example, the 11111111 sequence was replaced by 8, the sum of all consecutive 1s
I've tried use mapply with gsub to get the replacements done, but it iterates over the strings applying the pattern-replacement pair one at a time. The result is the following:
Code:
pattern <- c("11111111","1111111","111111","111111","1111","111","11")
replacement <- c("8","7","6","5","4","3","2")
FENCodeToBeChanged <- "rnbq1rk1/pppp1ppp/1b11pn11/11111111/11PP1111/11111NP1/PP11PPBP/RNBQ1RK1 w KQkq c6 0 2"
mapply(gsub,pattern,replacement,FENCodeToBeChanged)
Result:
11111111
"rnbq1rk1/pppp1ppp/1b11pn11/8/11PP1111/11111NP1/PP11PPBP/RNBQ1RK1 w KQkq c6 0 2"
1111111
"rnbq1rk1/pppp1ppp/1b11pn11/71/11PP1111/11111NP1/PP11PPBP/RNBQ1RK1 w KQkq c6 0 2"
111111
"rnbq1rk1/pppp1ppp/1b11pn11/611/11PP1111/11111NP1/PP11PPBP/RNBQ1RK1 w KQkq c6 0 2"
111111
"rnbq1rk1/pppp1ppp/1b11pn11/511/11PP1111/11111NP1/PP11PPBP/RNBQ1RK1 w KQkq c6 0 2"
1111
"rnbq1rk1/pppp1ppp/1b11pn11/44/11PP4/41NP1/PP11PPBP/RNBQ1RK1 w KQkq c6 0 2"
111
"rnbq1rk1/pppp1ppp/1b11pn11/3311/11PP31/311NP1/PP11PPBP/RNBQ1RK1 w KQkq c6 0 2"
11
"rnbq1rk1/pppp1ppp/1b2pn2/2222/2PP22/221NP1/PP2PPBP/RNBQ1RK1 w KQkq c6 0 2"
As you can see, it does the replacements but one at a time and for the next pattern-replacement pair it starts from the original string, it does not accumulate them in the sequence that I've specified in the pattern - replace vectors.
I´ve tried the strategies described here and here, but they also didn't work. As it mention in the last link, I'm trying to avoid at all cost to loop over gsubs to get the job done, as it seems quite inefficient.
Any thoughts on how to proceed?
Thanks!

The issue with mapply is that it is looking at a fresh copy of the FEN string for each replacement, which is not what you need. I think you can use a Reduce mindset:
(BTW, your pattern for "5" has 6 ones, this fixed that.)
pattern <- c("11111111","1111111","111111","11111","1111","111","11")
Reduce(function(txt, ptn) gsub(ptn, as.character(nchar(ptn)), txt), pattern, init=FENCodeToBeChanged)
# [1] "rnbq1rk1/pppp1ppp/1b2pn2/8/2PP4/5NP1/PP2PPBP/RNBQ1RK1 w KQkq c6 0 2"
To be able to reduce over multiple arguments takes a little bit of work, usually iterating along a list of pairs or such. With this problem, it's easy enough to replace a pattern with its length instead of including another vector of strings, ergo nchar(ptn). (Technically as.character(.) is not required as gsub will implicitly convert it, but I wanted to be a bit "declarative" in that that's what I want. There are many tools in R that are less deterministic in this way (e.g., ifelse). Style.)

Related

How can I identify inconsistencies and outliers in a dataset in R

I have a big dataset with alot of columns, being most of them not numeric values. I need to find inconsistencies in the data as well as outliers and the part of obtaining inconsistencies would be easy if the dataset wasn't so big (7032 rows to be exact).
An inconsistency would be something like: ID supposed to be 4 letters and 4 numbers and I obtain something else (like 3 numbers and 2 letters); or other example would be a number that should be a 0 or 1 and I obtain a -1 or a 2 .
Is there any function that I can use to obtain the inconsitencies in each column?
For the specific columns that doesn't have numeric values, I thought of doing a regex and validate if each row for a certain column is valid but I didn't found info that could give me that.
For the part of outliers I did a boxplot to see if I could obtain any outlier, like this:
boxplot(dataset$column)
But the graphic didn't gave me any outliers. Should I be ok with the results that I obtain in the graphic or should I try something else to see if there is really any outlier in the data?

For the specific examples you've given:
an ID must be be four numbers and 4 letters:
!grepl("^[0-9]{4}-[[:alpha:]]{4}$", ID)
will be TRUE for inconsistent values (^ and $ mean beginning- and end-of-string respectively; {4} means "previous pattern repeats exactly four times"; [0-9] means "any symbol between 0 and 9 (i.e. any numeral); [[:alpha:]] means "any alphabetic character"). If you only want uppercase letters you could use [A-Z] instead (assuming you are not working in some weird locale like Estonian).
If you need a numeric value to be 0 or 1, then !num_val %in% c(0,1) will work (this will work for any set of allowed values; you can use it for a specific set of allowed character values as well)
If you need a numeric value to be between a and b then !(a < num_val & num_val < b) ...

R: regex for extracting non-numeric entries from a character vector of numbers

I have a data.frame in which one column of numeric data is read by readr as character, at least in part because some of the values are "N/A". I don't know if the values actually include quotation marks.
I am trying to extract all the values in that column that contain things other than pure numbers, i.e. which contain a any character which not a number, 1-9. My purpose is to learn how many of these there are and to see if there are any formats besides the "N/A", in preparation for replacing them with something else and then converting the vector to numeric.
While I am confident that there are smarter ways to do this, I am trying to extract those values with a logical vector created from a regex applied to to the vector using R's grepl command.
A2 <- 1:10
A3 <- sample(1000:9999, 10)
dat_df <- data.frame(A2, A3)
str(dat_df)
dat_df$A3[1:3]<- c("N/A", "", "banana")
dat_df is a simplified data set, provided for reproducibility.
Here is an example
dat_df$A3[grepl(as.character(\<\d*[a-zA-Z][a-zA-Z0-9]*>\), x = dat_df$A3)]
This particular one gives the error
Error: unexpected '<' in dat_df"$A3[grepl(as.character(\<"
I have tried a lot of varients of this. These include:
Wrapping the initial data in ( ) (in case it was a precedence problem).
Defining the regex as a character string using as.character as the help file recommends, or with quotation marks.
Wrapping the central portion of the regex with ^ and $ instead of \< and >\
Doubling all the "\"s
In every case I get some variant of the syntax error shown above, varying with the version.
Error: unexpected (and then)
'^' if it starts with a '^'
'\' if it starts with a '\'
'<' if it starts with a '<'
'\d' if it is wrapped in quotation marks instead of using
as.character
I can not make heads or tails out of this pattern of errors.
Any help gratefully received and acknowledged.

Firstly, as.character(\<\d*[a-zA-Z][a-zA-Z0-9]*>\) is incorrect and doesn't work. For example, as.character(A) doesn't give you "A" but gives an error. You should enclose the pattern with quotes.
Secondly, in R regex you need to use double backslash to escape. So \\ instead of \.
If you only have integer data you can use grep with invert = TRUE and value = TRUE to get the values which are not numbers.
grep('^\\d+$', dat_df$A3, invert = TRUE, value = TRUE)
#[1] "N/A" "" "banana"
To change these values to NA and turn them to numeric you may do -
dat_df$A3[grep('^\\d+$', dat_df$A3, invert = TRUE)] <- NA
dat_df$A3 <- as.numeric(dat_df$A3)
dat_df
# A2 A3
#1 1 NA
#2 2 NA
#3 3 NA
#4 4 7475
#5 5 1162
#6 6 9828
#7 7 6359
#8 8 7823
#9 9 2544
#10 10 5287
You can also use grepl to do the same if you prefer that over grep but it doesn't have value and invert argument so might need to change few small things to make it work.

Convert binary, octal, hexa FLOATS to decimal "manually" (ex: [135.263]B10 -> [?]B2)

I'm a student in programmation and I have a course called "Informatic Mathematics". In the exercices it's asked to convert floating numbers from decimal, octal, hexadecimal or binary to another base (not necesserly to base 10) and keep 12 digits after the comma(or the dot) if it is possible. For exemple:
(135.263)b10 => b2
(100101001.11)b2 => b10
(1011110010.0101)b2 => b8
...
I know how to convert numbers. The way I convert the decimal part (after the dot) is to divide this part by the highest multiple of the target base until I get 0 or until I reach the 12th digits after the dot. The problem is that I don't know all the negate multiples of 2 so usually I write them on a separate sheet, but usually I don't have to keep 12 digits after the dot and writing these multiples on a seperate sheet takes time and during the exam, time is a precious thing and I can't waste it to write these multiples.
So I would like to know if there's a better way to do these conversions or if anyone has any tips.
Also, when I convert from non-decimal number to another non-decimal number (ex: b2 => b8) I usually convert the first number to base 10 and then convert the base 10 number to the target base. I would like to know if there's a way to convert the first number directly into the target base without having to convert it in base 10 first.
BTW: Sorry if my english is a bit weird. I'm a french canadian and I did my best, but please let me know if there is something you do not understand well.

I'll start with b2 > b8.
001 011 110 010.010 100
As you see, I've separated the number into 3 digit segments (2^3 = 8). You have to add extra 0 to the left and to the right to make it like that. Then you convert it digit by digit. In this case you'll receive 1352.24
b2 => b10
Some harder math here. Mark digits in your number this way:
1 0 0 1 0 1 0 0 1 . 1 1
8 7 6 5 4 3 2 1 0 -1 -2
Then calculate fractional and whole part
2^0 + 2^3 + 2^5 + 2^8 + 2^-1 + 2^-2
b10 => b2
Multiple the fraction by 2 till you get 1. From each multiplication you take the whole part. Example:
0.25 * 2 = 0.5; 0.5 * 2 = 1;
Thus, 0.25 is 0.01;
UPD For negative conversions check out first and second complement.

subset() with grepl() using REGEX for filtering a dataframe in R

I am learning R and experimenting with subset() and grepl() with Regex for filtering a dataframe. I have created a very small dataframe to play with:
x y z w
1 10 a k
2 12 b l
3 14 c m
4 16 d n
5 18 e o
My code is the following:
subset(df14, grepl('^c | [l - n]', c(df14$z , df14$w) ), grepl('[yz]', colnames(df14)) )
In my mind, the second argument should return the indices of the rows found by grepl() to match the pattern in the columns with names: 'z' or 'w'. However, this is not what happens (returns an empty dataframe with columns y and z).
I would expect to return the rows 2,3,4 since column 'w' contains the letters l, m, n specified in the [l-n] regex pattern and the columns z and w since these names match the regex [yz] in the third argument of the subset().
(I suspect that it is looking for a match in the names of the columns rather the contents of the columns, which is what interests me.)
Obviously, I am not interested in the result per se. This is an experiment to understand how the functions work. So, what I am looking for is an explanation and a method to correct the specific code -- not an alternative solution.
Your advice will be appreciated.

There are a variety of problems.
One issue is the extra spaces in your patterns. Drop them or use the free-spacing modifier (?x) with perl = TRUE. Either way, you have to get rid of the spaces in the character class. [l-n] matches "m" and [l - n] does not, even with (?x). You can read more about the free-spacing modifier and its impact inside and outside character classes here.
Another issues is that in your first grepl, you're searching within a vector (character vector? we can't tell from the example) of length 10. What would a TRUE in the 6th position mean for a 5 row data.frame? It doesn't make sense to return the 6th row of a 5 row data frame. Instead, you can see if your pattern is found for column "w" or (|) column "z". Look within each column, not a concatenation of columns.
Another issue is in your second grepl, "w" is not a match for [yz]. If you want to select the columns with a name containing a "w" or a "z", one way would be with [wz]:
There is no need for the ^ anchor since all your strings contain a single character, but I'll leave it in anyway:
subset(df14,
subset = grepl('^c|[l-n]', df14$z) |
grepl('^c|[l-n]', df14$w),
select = grepl('[wz]', colnames(df14)))
# z w
#2 b l
#3 c m
#4 d n
Or with the free-spacing mode modifier and a different pattern ([wz] vs w|z) for the second grepl:
subset(df14,
subset = grepl('(?x)^c | [l-n]', df14$z, perl = TRUE) |
grepl('(?x)^c | [l-n]', df14$w, perl = TRUE),
select = grepl('w|z', colnames(df14)))
# z w
#2 b l
#3 c m
#4 d n

The '^c | [l - n]' search expression can't find anything in those columns. Also, a more intuitive approach is use [ , ] to do this type of subsetting. See http://adv-r.had.co.nz/Subsetting.html.

readcsv when text has commas

I'm trying to read.csv thousands of csv files into R but am into a lot of trouble when a my text has commas.
My csv file has 16 columns with headers. Some of the text in column 1 has commas. Column 2 is a string, and Column 3 is always a number.
For instance an entry in column 1 is:
"I do not know Robert, Kim, or Douglas"- Marcus. A. Ten, Inc President
When I try to
df <- do.call("rbind", lapply(paste(CSVpath, fileNames, sep=""), read.csv, header=TRUE, stringsAsFactors=TRUE, row.names=NULL))
I get a df with more than 16 columns and the above text is split into 4 columns:
V1 V2 V3 V4
"I do not know Robert Kim or Douglas" - Marcus. A. Ten Inc President
when I need it all in one column as:
V1
"I do not know Robert, Kim, or Douglas"- Marcus. A. Ten, Inc President

First, if you have control over the data output format, I strongly urge you to either (a) correctly quote the fields, or (b) use another character as a delimiter (e.g., tab, pipe "|"). This is the ideal solution, as it will certainly speed up future processing and "fix the glitch", so to speak.
Lacking that, you can try to programmatically fix all rows. Assuming that only the first column is problematic (i.e., all of the other columns are perfectly defined), then on a line-by-line basis, change the true-separators to a different delimiter (e.g., pipe or tab).
For this example, I have 4 columns delimited with a comma, and I'm going to change the legitimate separators to a pipe.
Some data and magic constants:
txt <- '"I do not know Robert, Kim, or Douglas" - Marcus. A. Ten, Inc President,TRUE,0,14
"Something, else",FALSE,1,15
"Something correct",TRUE,2,22
Something else,FALSE,3,33'
nColumns <- 4 # known a priori
oldsep <- ","
newsep <- "|"
In your case, you'll read in the data:
txt <- readLines("path/to/malformed.csv")
nColumns <- 16
Do a manual (text-based, not parsing for data types) separation:
splits <- strsplit(readLines(textConnection(txt)), oldsep)
Realize that this reads, for example, the false fields as the verbatim characters "FALSE", not as a boolean data type. This might be avoided if we take on the magic type-detection done by read.csv and cousins, but why?
Per line: first ignore the last nColumns-1 fields, take the first fields and recombine them with the old separator, resulting in a single field (with commas); now combine this with the remaining nColumns-1 fields and combine these with the new separator. (BTW: making sure we deal with quoting double-quotes correctly, too.)
txt2 <- sapply(splits, function(vec) {
n <- length(vec)
if (n < nColumns) return(paste(vec, collapse = newsep))
vec1 <- paste(vec[1:(n - nColumns + 1)], collapse = oldsep)
vec1 <- sprintf('"%s"', gsub('"', '""', vec1))
paste(c(vec1,
vec[(n - nColumns + 2):n]), collapse = newsep)
})
txt2[1]
# [1] "\"\"\"I do not know Robert, Kim, or Douglas\"\" - Marcus. A. Ten, Inc President\"|TRUE|0|14"
(The sprintf line may not be necessary if the original file has correct quoting of double-quotes ... but then again, if it had correct quoting, we wouldn't be having this problem in the first place.)
Now, either absorb the data directly into a data.frame:
read.csv(textConnection(txt2), header = FALSE, sep = newsep)
# V1 V2 V3 V4
# 1 "I do not know Robert, Kim, or Douglas" - Marcus. A. Ten, Inc President TRUE 0 14
# 2 "Something, else" FALSE 1 15
# 3 "Something correct" TRUE 2 22
# 4 Something else FALSE 3 33
or write these back to a file (good if you want to deal with these files elsewhere), adding con = "path/to/filename as appropriate:
writeLines(txt2)
# """I do not know Robert, Kim, or Douglas"" - Marcus. A. Ten, Inc President"|TRUE|0|14
# """Something, else"""|FALSE|1|15
# """Something correct"""|TRUE|2|22
# "Something else"|FALSE|3|33
(Two notable changes: the correct comma-delimiters are now pipes, and all other commas are still commas; and there is correct quoting around the double-quotes. Yes, an escaped double-quote is just two double-quotes. That's what R expects if there are quotes within a field.)
NB: though this seems to work with my fabricated data (and I hope it works with yours), you do not hear of people touting R's speed and efficiency in doing text mangling in this fashion. There are certainly better ways to do this, perhaps using python, awk, or sed. There are possibly faster ways to do this in R.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cumulative application of a gsub sequence in R - r

Related

How can I identify inconsistencies and outliers in a dataset in R

R: regex for extracting non-numeric entries from a character vector of numbers

Convert binary, octal, hexa FLOATS to decimal "manually" (ex: [135.263]B10 -> [?]B2)

subset() with grepl() using REGEX for filtering a dataframe in R

readcsv when text has commas

Categories

Resources