R: How to read in a PGN as a Data Frame

R: How to read in a PGN as a Data Frame - r

I have a single .pgn Portable Game Notation of a large number of chess games. The games are contained in the file like this:
[Event "4th Bayern-chI Bank Hofmann"]
[Site "?"]
[Date "2000.10.29"]
[Round "?"]
[White "Carlsen, Magnus"]
[Black "Cordts, Ingo"]
[ECO "A56"]
[WhiteElo "0"]
[BlackElo "2222"]
[Result "0-1"]
1. d4 Nf6 2. c4 c5 3. Nf3 cxd4 4. Nxd4 e5 5. Nb5 d5 6. cxd5 Bc5 7. N5c3 O-O 8. e3 e4 9. h3 Re8 10. g4 Re5 11. Bc4 Nbd7 12. Qb3 Ne8 13. Nd2 Nd6 14. Be2 Qh4 15. Nc4 Nxc4 16. Qxc4 b5 17. Qxb5 Rb8 18. Qa4 Nf6 19. Qc6 Nd7 20. d6 Re6 21. Nxe4 Bb7 22. Qxd7 Bxe4 23. Rh2 Bxd6 24. Bc4 Rd8 25. Qxa7 Bxh2 26. Bxe6 fxe6 27. Qa6 Bf3 28. Bd2 Qxh3 29. Qxe6+ Kh8 30. Qe7 Bc7
0-1
[Event "4th Bayern-chI Bank Hofmann"]
[Site "?"]
[Date "2000.10.30"]
[Round "?"]
[White "Kaiser, Guenter"]
[Black "Carlsen, Magnus"]
[ECO "A46"]
[WhiteElo "0"]
[BlackElo "0"]
[Result "0-1"]
1. d4 Nf6 2. Nf3 d6 3. Nc3 g6 4. e4 Bg7 5. Be2 O-O 6. O-O e5 7. Be3 h6 8. Qd2 Ng4 9. d5 f5 10. exf5 gxf5 11. h3 Nxe3 12. Qxe3 e4 13. Nd4 Qe7 14. Rad1 c5 15. dxc6 bxc6 16. Bc4+ Kh7 17. Nce2 d5 18. Bb3 c5 19. Nb5 d4 20. Qd2 Bb7 21. Nf4 a6 22. Nd5 Qe5 23. Nbc7 Ra7 24. Qa5 Nd7 25. g3 Rc8 26. Nb5 Raa8 27. Nbc7 Bxd5 28. Nxa8 Rxa8 29. Ba4 Be6 30. Kh2 f4 31. Qe1 Nf6 32. Bc6 Rc8 33. Bb7 Rc7 34. Ba8 Bd5 35. Bxd5 Nxd5 36. Qe2 fxg3+
0-1
I would like to read in this data as a DataFrame where the column titles are simply the word to the left of the string in quotation marks and the row values are whatever is in the quotation marks. Another column would contain a string of all the moves.
I am completely new to R and simply cannot figure out how to read in a file that is not already in some known format.
readLines() looks promising.

Try this:
pgn <- read.table("your_file.pgn", quote="", sep="\n", stringsAsFactors=FALSE)
# get column names
colnms <- sub("\\[(\\w+).+", "\\1", pgn[1:12,1])
# give columns 11 (the moves) and 12 (redundant results column) nice names
colnms[11] <- "Moves"
colnms[12] <- "Results2"
pgn.df <- data.frame(matrix(sub("\\[\\w+ \\\"(.+)\\\"\\]", "\\1", pgn[,1]),
byrow=TRUE, ncol=12))
names(pgn.df) <- colnms
This solution assumes each game is 12 lines, as in your example. If games take up a variable numbers of lines, this solution won't work.
Explanation of the regex lines (see ?regex for more):
sub("\\[(\\w+).+", "\\1", pgn[1:12,1])
In this regex, we want the first word that follows a square bracket. We have to escape that bracket, as it's a metacharacter. There are other ways to achieve that without using escapes (\), such as by making the [ a character class by putting it inside square brackets: sub("[[](\\w+).+", "\\1", pgn[1:12,1]).
The parentheses (a capture group) go together with the \\1. The \\1 as the second argument to sub says to replace the original string with the contents of the first (and only in this case) capture group. Were there to be a 2nd capture group, you'd use \\2 to reference it.
The contents of the capture group \\w+ are one or more (that's what the + means) word characters (represented by \\w). After the () we want to match the rest of the string, which we can do by looking for any character (that's what . means) one or more times (i.e. .+).
So, the regex finds the first square bracket and the first consecutive block of word characters, which we capture, followed by one or more of any other characters.
The second regex: "\\[\\w+ \\\"(.+)\\\"\\]"
Let's look at the first entry of pgn[,1]: [1] "[Event \"4th Bayern-chI Bank Hofmann\"]". We start out the same as the first regex, but this time we don't want to capture the first word, we just want to find it followed by a space, and then we want to capture everything between the two sets of \".
Both \ and " have to be escaped, so we have a pair of \\\" surrounding a capture group that looks for any character one or more times (.+), and finally we have a square bracket, which we escape the same way as the first square bracket. If we didn't escape the ", R would think that was the end of the first argument to sub, and not interpret the " as a literal quote.
In the case of entries like line 11 and 12, nothing is matched because neither line starts with a [, and so, nothing is substituted. We just get the original string back in its entirety.

Here's what I'd try:
con = file("pgn_file.txt", "r")
all_lines = readLines(con)
close(con);
res = list();
for(this_line in all_lines)
{
if(grepl("^\\s*$", this_line, perl=T))
{
print("Empty line: do nothing")
}else
{
if(grepl("^\\[", this_line, perl=T))
{
field = gsub("^\\[\\s*([a-zA-Z]+)\\s*\"([a-zA-Z0-9\\s.?, -]+)\"\\]$", "\\1", this_line);
value = gsub("^\\[\\s*([a-zA-Z]+)\\s*\"([a-zA-Z0-9\\s.?, -]+)\"\\]$", "\\2", this_line);
print(field);
res[[tolower(field)]] = c(res[[tolower(field)]], value);
}else
{
print(this_line)
}
if(grepl("^1\\.", this_line, perl=T))
{
res[["move_list"]] = c(res[["move_list"]], this_line);
}
}
}
res = as.data.frame(res);

Related

gsub extracting string

My sample data is:
c("2\tNO PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217",
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156",
"5\tUNABLE TO WORK PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185",
"2\tNO PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433"
For each line, I'm looking to extract (they are variable names):
Line 1: "PEMJNUM"
Line 2: "PRFAMTYP"
Line 3: "PUBUS1"
Line 4: "PEIO1COW"
My initial goal was to gsub remove the characters to the left and right of each variable name to leave just the variable names, but I was only able to grab everything to the right of the variable name and had issues with grabbing characters to the left. (as shown here https://regexr.com/67r6j).
Not sure if there's a better way to do this!

You can use sub in the following way:
x <- c("2\tNO PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217",
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156",
"5\tUNABLE TO WORK PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185",
"2\tNO PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433")
sub("^(?:.*\\b)?(\\w+)\\s*\\b2\\b.*", "\\1", x, perl=TRUE)
# => [1] "PEMJNUM" "PRFAMTYP" "PUBUS1" "PEIO1COW"
See the online regex demo and the R demo.
Details:
^ - start of string
(?:.*\b)? - an optional non-capturing group that matches any zero or more chars (other than line break chars since I use perl=TRUE, if you need to match line breaks, too, add (?s) at the pattern start) as many as possible, and then a word boundary position
(\w+) - Group 1 (\1): one or more word chars
\s* - zero or more whitespaces
\b - a word boundary
2 - a 2 digit
\b - a word boundary
.* - the rest of the line/string.
If there are always whitespaces before 2, the regex can be written as "^(?:.*\\b)?(\\w+)\\s+2\\b.*".

R: gsub/replace only those occurrences following a keyword occurrence

I only want to replace string occurrences that follow a particular keyword/pattern and not before. in other words, do nothing until the first occurrence of the keyword-pattern, and then start to gsub to the right of that keyword-pattern. See below:
gsub("\\[|\\]", "", "ab[ cd] ef keyword [ gh ]keyword ij ")
Actual results:
"ab cd ef keyword gh keyword ij "
Desired results:
"ab[ cd] [][asfg] ]] ef keyword gh keyword ij "
[Edited to fix the results. I don't want to remove 'keyword']
[Edited to show case of multiple occurrences of keyword]

You might use \G to get continous matches after keyword. Use \K to forget what was matched and match the following [ or ] to be replaced with an empty string.
(?:^.*?keyword\b|\G(?!^))[^\[\]]*\K[\[\]]
In parts
(?: Non capturing group
^.*?keyword Match until the first keyword
| Or
\G(?!^) Assert position at the end of previous match, not at the start to get continuous matches
) Close non capturing group
[^\[\]]*\K Match 0+ times not [ or ] and forget what was matched using \K
[\[\]] Match either [ or ]
Regex demo | R demo
Your code might look like
gsub("(?:^.*?keyword\\b|\\G(?!^))[^\\[\\]]*\\K[\\[\\]]", "", "ab[ cd] ef keyword [ gh ]keyword ij ", perl=T)
Note to use perl=T at the end for Perl-like regular expressions.

Regex to match a pattern but not two specific cases

I want to match every cases of "-", but not these ones:
[\d]-[A-Z]
[A-Z]-[\d]
I tried this pattern: ((?<![A-Z])-(?![0-9]))|((?<![0-9])-(?![A-Z])) but some results are incorrect like: "RUA VF-32 N"
Can anyone help me?

A simple approach is to use grep with your current logic and inverting the result, and then run another grep to only keep those items that have a hyphen in them:
x <- c("QUADRA 120 - ASA BRANCA","FAZENDA LAGE -RODOVIA RIO VERDE","C-15","99-B","A-A")
grep("-", grep("[A-Z]-\\d|\\d-[A-Z]", x, invert=TRUE, value=TRUE), value=TRUE, fixed=TRUE)
# => [1] "QUADRA 120 - ASA BRANCA" "FAZENDA LAGE -RODOVIA RIO VERDE"
# [3] "A-A"
Here, [A-Z]-\\d|\\d-[A-Z] matches a hyphen either in between an uppercase ASCII etter or a digit or betweena digit and an ASCII uppercase letter. If there is a match, the result is inverted due to invert=TRUE.
See the R demo.
To only match - in all contexts other than in between a letter and a digit, you may use the PCRE regex based on SKIP-FAIL technique like
> grep("(?:\\d-[A-Z]|[A-Z]-\\d)(*SKIP)(*F)|-", x, perl=TRUE)
[1] 1 2
See this regex demo
Details
(?:\d-[A-Z]|[A-Z]-\d) - a non-capturing group that matches either a digit, - and then uppercase ASCII letter, or an uppercase ASCII letter, - and a digit
(*SKIP)(*F) - omit the current match and proceed looking for the next match at the end of the "failed" match
| - or
- - a hyphen.

Add leading zero within a character string

One column of my data.frame looks like the following:
c("BP_1_CSPP", "BP_2_GEGS", "BP_3_AEAG", "BP_4_KPAP", "BP_5_TAKP",
"BP_6_GGDR", "BP_7_MQQP", "BP_8_EEEE", "BP_9_RSDP", "BP_10_APAS",
"BP_11_KRGG", "BP_12_RSQQ", "BP_13_QQLS", "BP_14_EPEV", "BP_15_AAPS",
"BP_16_SDVT", "BP_17_GQQQ", "BP_18_AETP", "BP_19_PPSA", "BP_20_DATP",
"EpQ_1_AYAT", "EpQ_2_HEKL", "EpQ_3_SCSV", "EpQ_4_MAYV", "EpQ_5_LKDP",
"EpQ_6_ERCE", "EpQ_7_DNPA", "EpQ_8_YGIS", "EpQ_9_GMSS", "EpQ_10_AAKK",
"EpQ_11_NIRI", "EpQ_12_ERRR", "EpQ_13_MDRE", "EpQ_14_SRQM", "EpQ_15_DWSI",
"EpQ_16_VLVQ", "EpQ_17_GRTI", "EpQ_18_EKVR", "EpQ_19_PDVA", "EpQ_20_ADVT",
"LbT_1_RPGG", "LbT_2_TQGD", "LbT_3_EVKS", "LbT_4_VIEM", "LbT_5_GSAD",
"LbT_6_VRPI", "LbT_7_CELG", "LbT_8_APQQ", "LbT_9_SAEE", "LbT_10_GEAE",
"LbT_11_EELR", "LbT_12_EWAN", "LbT_13_IKEE", "LbT_14_VSDF", "LbT_15_WEDV",
"LbT_16_SGGA", "LbT_17_KATN", "LbT_18_EREG", "LbT_19_AWAS", "LbT_20_VDRD",
"abc_1_CVTQ", "abc_2_KEAP", "abc_3_TAYI", "abc_4_MITN", "abc_5_MPTV",
"abc_6_TRTG", "abc_7_KSTI", "abc_8_KEAI", "abc_9_HVYS", "abc_10_LGMG",
"abc_11_VAYQ", "abc_12_AGTG", "abc_13_TDSW", "abc_14_HKKS", "abc_15_YGLA",
"abc_16_WEEW", "abc_17_HSTI", "abc_18_EKCI", "abc_19_PAGI", "abc_20_TGTI",
"TcII")
Considering all the numbers < 10, which are located within the strings (e.g. "BP_1_CSPP", "BP_2_GEGS" , I wanted to add a leading zero to them, such that I would have:
"BP_01_CSPP", "BP_02_GEGS", "BP_03_AEAG", "BP_04_KPAP", "BP_05_TAKP",
"BP_06_GGDR"
and so on.
This question almost did the job, yet it does not worked for my data as:
The "0" will not be inserted at the same position all the time (some strings have 3 characters before the 0 to be inserted (e.g. BP_1_CSPP) while others have 4 (e.g. EpQ_3_SCSV)
I will still have some characters after the zero to be inserted i.e. the zero will be inserted at the middle of the string.

We can use sub to match the pattern of _ followed by a single number (([0-9])) captured as a group (inside the brackets) followed by _ and replace it with _ followed by 0, the backreference of the capture group (\\1) followed by _.
v1 <- sub("_([0-9])_", "_0\\1_", v1)
v1
#[1] "BP_01_CSPP" "BP_02_GEGS" "BP_03_AEAG" "BP_04_KPAP" "BP_05_TAKP" "BP_06_GGDR" "BP_07_MQQP" "BP_08_EEEE" "BP_09_RSDP" "BP_10_APAS" "BP_11_KRGG"
#[12] "BP_12_RSQQ" "BP_13_QQLS" "BP_14_EPEV" "BP_15_AAPS" "BP_16_SDVT" "BP_17_GQQQ" "BP_18_AETP" "BP_19_PPSA" "BP_20_DATP" "EpQ_01_AYAT" "EpQ_02_HEKL"
#[23] "EpQ_03_SCSV" "EpQ_04_MAYV" "EpQ_05_LKDP" "EpQ_06_ERCE" "EpQ_07_DNPA" "EpQ_08_YGIS" "EpQ_09_GMSS" "EpQ_10_AAKK" "EpQ_11_NIRI" "EpQ_12_ERRR" "EpQ_13_MDRE"
#[34] "EpQ_14_SRQM" "EpQ_15_DWSI" "EpQ_16_VLVQ" "EpQ_17_GRTI" "EpQ_18_EKVR" "EpQ_19_PDVA" "EpQ_20_ADVT" "LbT_01_RPGG" "LbT_02_TQGD" "LbT_03_EVKS" "LbT_04_VIEM"
#[45] "LbT_05_GSAD" "LbT_06_VRPI" "LbT_07_CELG" "LbT_08_APQQ" "LbT_09_SAEE" "LbT_10_GEAE" "LbT_11_EELR" "LbT_12_EWAN" "LbT_13_IKEE" "LbT_14_VSDF" "LbT_15_WEDV"
#[56] "LbT_16_SGGA" "LbT_17_KATN" "LbT_18_EREG" "LbT_19_AWAS" "LbT_20_VDRD" "abc_01_CVTQ" "abc_02_KEAP" "abc_03_TAYI" "abc_04_MITN" "abc_05_MPTV" "abc_06_TRTG"
#[67] "abc_07_KSTI" "abc_08_KEAI" "abc_09_HVYS" "abc_10_LGMG" "abc_11_VAYQ" "abc_12_AGTG" "abc_13_TDSW" "abc_14_HKKS" "abc_15_YGLA" "abc_16_WEEW" "abc_17_HSTI"
#[78] "abc_18_EKCI" "abc_19_PAGI" "abc_20_TGTI" "TcII"
If we are using strsplit, another option is split by _, replace the numbers by formatting with sprintf and then paste together
sapply(strsplit(v1, "_"), function(x) {
if(length(x)>1) x[2] <- sprintf("%02d", as.numeric(x[2]))
paste(x, collapse="_")})

What is the best way to parse this flat text format in R?

1. ZFP112
Official Symbol: ZFP112 and Name: zinc finger protein 112 homolog (mouse)[Homo sapiens]
Other Aliases: ZNF112, ZNF228
Other Designations: zfp-112; zinc finger protein 112; zinc finger protein 228
Chromosome: 19; Location: 19q13.2
Annotation: Chromosome 19NC_000019.9 (44830706..44860856, complement)
ID: 7771
2. SEP15
15 kDa selenoprotein[Homo sapiens]
Chromosome: 1; Location: 1p31
Annotation: Chromosome 1NC_000001.10 (87328128..87380107, complement)
MIM: 606254
ID: 9403
3. MLL4
myeloid/lymphoid or mixed-lineage leukemia 4[Homo sapiens]
Other Aliases: HRX2, KMT2B, MLL2, TRX2, WBP7
Other Designations: KMT2D; WBP-7; WW domain binding protein 7; WW domain-binding protein 7; histone-lysine N-methyltransferase MLL4; lysine N-methyltransferase 2B; lysine N-methyltransferase 2D; mixed lineage leukemia gene homolog 2; myeloid/lymphoid or mixed-lineage leukemia protein 4; trithorax homolog 2; trithorax homologue 2
Chromosome: 19; Location: 19q13.1
Annotation: Chromosome 19NC_000019.9 (36208921..36229779)
MIM: 606834
ID: 9757
37. LOC100509547
hypothetical protein LOC100509547[Homo sapiens]
This record was discontinued.
ID: 100509547
43. LOC100509587
hypothetical protein LOC100509587[Homo sapiens]
Chromosome: 6
This record was replaced with GeneID: 100506601
ID: 100509587
I want to get the gene name (ZFP112, SEP15, MLL4), the Location field (if present), the ID field, and skip the other stuff. All the string utilities like scan() seem geared toward more regular data. The blank line between records is effectively the record separator. I can write this to disk and read it back in with readLines() but I'd prefer to do it from memory since I downloaded it over HTTP.

Read the data in from "myfile.dat", say, (or just start from L below if you have previously read it in as separate lines). Now extract those lines that begin with digits followed by a dot followed by a space or that contain the word Location: or start with ID:. Then remove everything in those lines up to and including the last space. Create a group vector g which identifies the group to which each component of v2 belongs. (We have used the fact that the beginning field of each group starts with a non-digit and the other fields start with a digit.) Then split v2 into those groups . Expand short components of s by appropriately inserting an NA assuming that if its short that Location: is missing. (We assume the first field and the ID fields cannot be missing.) Finally transpose it so that the fields are in columns and the cases in rows.
L <- readLines("myfile.dat")
v <- grep("^\\d+\\. |Location: |^ID: ", L, value = TRUE)
v2 <- sub(".* ", "", v)
g <- cumsum(regexpr("^\\D", v2) > 0)
s <- split(v2, g)
m <- sapply(s, function(x) if (length(x) == 2) c(x[[1]], NA, x[[2]]) else x)
t(m)
Using the sample data in the post we get this from the last line:
[,1] [,2] [,3]
1 "ZFP112" "19q13.2" "7771"
2 "SEP15" "1p31" "9403"
3 "MLL4" "19q13.1" "9757"
4 "LOC100509547" NA "100509547"
5 "LOC100509587" NA "100509587"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: How to read in a PGN as a Data Frame - r

Related

gsub extracting string

R: gsub/replace only those occurrences following a keyword occurrence

Regex to match a pattern but not two specific cases

Add leading zero within a character string

What is the best way to parse this flat text format in R?

Categories

Resources