extract all two-character-combinations from a string - r

In order to identify nonsense text (e.g. djsarejslslasdfhsl) from real (German) words, I would like to do an analysis of letter-frequencies.
My idea is to calculate the relative frequencies of two-letter-combinations ("te", "ex", "xt", "is" etc.) using a long text. Based on that information I would like to calculate the probability that a given word (or sentence) is real German.
But my first problem is, how to extract all the two-letter-combinations and to count them? I fear that using substring(string, start, stop) and increasing the values of start and stop in a loop might not be a very efficient solution. Do you have any idea?
# A short sample text
text <- 'Es ist ein Freudentag – ohne Zweifel. Gesundheitsminister Alain Berset und der Bundesrat gehen weiter, als man annehmen durfte. Die Zertifikatspflicht wird aufgehoben, die Maskenpflicht gilt nur noch im ÖV und in Gesundheitseinrichtungen.
Die beste Meldung des Tages aber ist: Die Covid-19-Task-Force, inzwischen als «Task-Farce» verballhornt, wird auf Ende März aufgehoben – zwei Monaten früher als geplant. Die Dauerkritik war wohl mit ein Grund, dass dieses Gremium sich jetzt rasch auflösen will.
Keine Rosen ohne Dornen: Einzelne Punkte von Bersets Ausführungen geben zu denken.
Die «Isolationshaft» für positiv Getestete bleibt zwingend. Das ist Unsinn und steht in einem scharfen Kontrast zu den übrigen Öffnungsschritten. Die Grundimmunität der Bevölkerung beträgt über 90 Prozent, das Virus ist nicht mehr gefährlich, warum will man weiter Leute zu Hause einsperren? Wer schwer krank ist, geht von sich aus nicht zur Arbeit. Die krankheitsbedingte Bettruhe muss man den Menschen nicht vorschreiben.
Gesundheitsminister Berset findet, das Modell Task-Force habe eine interessante Möglichkeit aufgezeigt für die Zusammenarbeit zwischen Regierung und Wissenschaft. Unter Umständen eigne sich dieses Modell auch für andere Bereiche.
Nein danke, Herr Berset.
Die Task-Force war mit ihrem öffentlichen Dauer-Alarmismus und ihren haarsträubenden Falsch-Prognosen vor allem eine Manipulationsmaschine.
Und dann noch dies: Irgendwann während der heutigen Pressekonferenz gab Alain Berset zu verstehen, man habe mit all diesen Massnahmen die Bevölkerung schützen wollen. Vielleicht hatte man diese hehre Absicht einmal im Hinterkopf. Alle Massnahmen ab der zweiten Welle erfolgten nicht zum Schutz der Bevölkerung, sondern, um einen Zusammenbruch des Spital-Systems zu verhindern.
Doch jetzt stossen wir erst einmal auf das Ende der Apartheit an.'
# Some cleaning:
library(stringr)
text <- str_replace_all(text, "[^[:alnum:]]", " ")
text <- tolower(text)
words <- strsplit(text, "\\s+")[[1]]
words
for(word in words){
???
}

Clean, replacing any sequence of non-alphanumeric with a space
text = tolower(gsub("[^[:alnum:]]+", " ", text))
Find all pairs of sequential letters
twos = substring(text, 1:(nchar(text) - 1), 2:nchar(text))
but only keep those that did not overlap a space
twos[nchar(trimws(twos)) == 2L]
Here's the result
> twos[nchar(trimws(twos)) == 2L] |> table()
19 90 aa ab af ag äg ah äh ai al am an än ap ar är as at ät au äu ba be bl br
1 1 1 6 2 2 1 2 2 2 14 2 16 1 1 10 1 15 6 1 12 1 1 24 1 2
bs bt bu ce ch co da de dh di do du dw eb ed ef eg eh ei ek el em en ep er es
1 1 1 4 34 1 9 23 3 18 2 2 1 1 1 1 1 9 32 1 7 5 54 1 42 19
et eu ev ez fa fä fe ff fg fi fl fn fo fr ft fü ga ge gi gl gn gr gs gt ha he
12 3 3 1 2 1 4 2 3 2 3 1 4 2 3 4 1 19 2 1 2 3 1 4 8 17
hi hk hl hm hn ho hr ht hu hü hw ib ic id ie if ig ih ik il im in io ip ir is
3 1 1 3 2 3 9 11 1 1 1 2 16 1 18 2 4 2 2 3 3 28 2 1 5 12
it iu iv je ka ke kh ko kr kt la ld le lg lh li lk ll ln lö ls lt ma mä me mi
19 1 1 2 1 8 1 3 3 1 6 1 7 1 1 5 3 11 1 1 4 1 12 1 8 7
mm mo mö ms mu na nb nd ne nf ng ni nk nm nn no np nr ns nt nu nz ob oc od öf
3 3 1 2 3 4 1 23 13 1 10 8 5 2 4 3 1 1 6 10 2 3 2 3 2 2
og ög oh ol öl on op or os ös ov öv oz pa pe pf pi pl po pr pu ra rä rb rc rd
1 1 3 3 3 8 1 7 4 1 1 1 1 1 1 3 1 1 1 3 2 5 2 3 4 2
re rf rg rh ri rk rl rm rn ro rr rs rt ru rü rz sa sb sc se sf sh si sk sm sn
14 3 1 1 4 2 1 1 4 3 2 9 2 11 1 1 3 1 13 17 1 1 6 5 4 2
so sp sr ss st su sy ta tä te th ti tl to tr ts tt tu tz ub üb uc ud ue uf uh
2 3 1 9 17 3 1 7 2 24 1 6 1 1 4 6 3 1 4 1 2 2 1 2 6 1
üh ul um un ur ür us ut üt ve vi vo vö wa wä we wi wo ys ze zt zu zw
2 1 5 24 3 3 8 3 1 3 3 4 3 4 1 8 9 2 1 5 2 9 6
The algorithm seems to generalize to sequences of any number of letters by separating words with
chartuples <-
function(text, n = 2)
{
n0 <- n - 1
text <- tolower(gsub(
"[^[:alnum:]]+", paste(rep(" ", n0), collapse = ""), text
))
tuples <- substring(text, 1:(nchar(text) - n0), n:nchar(text))
tuples[nchar(trimws(tuples)) == n]
}
This is also easy to use for looking up the values of any 'word'
counts <- table(charuples(text))
counts[chartuples("djsarejslslasdfhsl")] |> as.vector()
(the NA's in the resulting vector mean letters not present in your original corpus).

words <- unlist(strsplit(text, '[^[:alnum:]]+'))
cmbs2 <- sapply(words, function(x)substring(x, len <- seq(nchar(x) - 1), len + 1),USE.NAMES = TRUE)
head(cmbs2) ## Just to show a few words.
$Es
[1] "Es"
$ist
[1] "is" "st"
$ein
[1] "ei" "in"
$Freudentag
[1] "Fr" "re" "eu" "ud" "de" "en" "nt" "ta" "ag"
$ohne
[1] "oh" "hn" "ne"
$Zweifel
[1] "Zw" "we" "ei" "if" "fe" "el"

If I'm not wrong, this should be pretty efficient:
tokens_char <- function(str, window = 2) {
# remove non-word characters
str <- stringi::stri_replace_all_regex(str, "\\W", "")
# lowercase
str <- tolower(str)
# prep window variable
win <- window - 1
len1 <- seq_len(nchar(str) - win)
# split into strings of length window
stringi::stri_sub(str, from = len1, to = len1 + win)
}
The key is stringi::stri_sub which is a vectorised version of substr. A string is split by moving the window one character at the time. So "This text" is turned into "th" "hi" "is" "st" "te" "ex" "xt". After doing this, we can use some tidyverse code to count occurrences of tokens:
library(tidyverse)
tibble(
token = tokens_char(text, window = 2)
) %>%
count(token, sort = TRUE)
#> # A tibble: 308 × 2
#> token n
#> <chr> <int>
#> 1 en 55
#> 2 er 43
#> 3 ei 35
#> 4 ch 34
#> 5 nd 34
#> 6 in 28
#> 7 te 28
#> 8 be 24
#> 9 un 24
#> 10 de 23
#> # … with 298 more rows
Note that I also included a window argument, which I believe might be useful for your analysis.
tibble(
token = tokens_char(text, window = 3)
) %>%
count(token, sort = TRUE)
#> # A tibble: 851 × 2
#> token n
#> <chr> <int>
#> 1 die 16
#> 2 ich 16
#> 3 ein 15
#> 4 end 13
#> 5 sch 13
#> 6 und 12
#> 7 eit 11
#> 8 nde 10
#> 9 cht 9
#> 10 der 9
#> # … with 841 more rows
And finally, you can also first split your string into words so that letters following each other over word boundaries do not count. For example, "This text" is turned into "th" "hi" "is" "te" "ex" "xt":
tokens_char_words <- function(str, window = 2) {
str <- unlist(tokenizers::tokenize_words(str))
# prep window variable
win <- window - 1
len1 <- lapply(nchar(str) - win, seq_len)
# split into strings of length window
unlist(stringi::stri_sub_all(str = str, from = len1, to = lapply(len1, function(x) x + win)))
}
tokens_char_words("This text", window = 2)
#> [1] "th" "hi" "is" "te" "ex" "xt"
Created on 2022-02-18 by the reprex package (v2.0.1)

Related

Remove rows containing specific strings

I want to remove rows containing specific strings which I stored in a separate vector.
I tried everything from
Delete rows containing specific strings in R and Remove Rows From Data Frame where a Row matches a String
but it always removes every row and my output is empty
I tried it with an example and it works fine, but not for my input and my remove_list
My input is:
ID Aufzeichnungen
<dbl> <chr>
1 1 "Aufzeichnungen"
2 1 "07.03.22 A: stechender Schmerz"
3 1 " scharfkantig"
4 1 "D/B:"
5 1 "T:"
6 1 "pat aht an 36 üz distal"
7 1 " seit paartagen"
8 1 "36 vipr++"
9 1 " perk-"
10 1 " keine c zu entdekcne"
11 1 "üz bilfuird"
12 1 "pat aufgekläörtggf RÖ um c auszuschileßen"
13 1 " pat verweigert RÖ aus Angst vor Strahlung"
14 1 " pat"
15 1 "aufgeklärt angst nicht nötig und c unter fllg oder apprx nicht auszuschließen"
16 1 ""
17 1 "pat knirscht"
18 1 " schiene empohlen"
19 1 " pat meldet sich.."
and I want to remove every row containing strings from this list:
remove_list <- paste(c("einverst", "empf", "raten", "aufgeklä", "nicht", "weiß nicht", "bespr", "soll",
"kein", "?", "raten", "klären", "überprüf", "erst, wenn", "verweiger",
"notwendig"), collapse = '|')
Logically it should remove rows 10, 12, 13, 15
My codes are:
removed <- PKV[grep(remove_list, PKV$Aufzeichnungen, invert = TRUE), ]
removed <- PKV %>% filter(!grepl(remove.list, PKV$Aufzeichnungen ))
and also every variant with str_detect
But the output looks always like this:
# A tibble: 0 × 2
# Groups: ID [0]
# … with 2 variables: ID <dbl>, Aufzeichnungen <chr>
Thank you for your help!
We can first grep the indices of the rows contains one of remove_list words , then exclude them from your data.frame
remove_ind <- lapply(strsplit(remove_list , "\\|")[[1]] ,
\(x) grep(x , PKV$Aufzeichnungen , fixed = T)) |>
unlist() |> unique()
#> [1] 12 15 10 13
PKV[-remove_ind,]
output
ID Aufzeichnungen
1 1 Aufzeichnungen
2 1 07.03.22 A: stechender Schmerz
3 1 scharfkantig
4 1 D/B:
5 1 T:
6 1 pat aht an 36 üz distal
7 1 seit paartagen
8 1 36 vipr++
9 1 perk-
11 1 üz bilfuird
14 1 pat
16 1
17 1 pat knirscht
18 1 schiene empohlen
19 1 pat meldet sich..
Try this:
remove_list <- c("einverst", "empf", "raten", "aufgeklä", "nicht", "weiß nicht", "bespr", "soll",
"kein", "?", "raten", "klären", "überprüf", "erst, wenn", "verweiger",
"notwendig")
mylist <- c("notwendig","einverst","1","2" )
mylist[!mylist %in% remove_list]
#> [1] "1" "2"
Created on 2022-08-10 by the reprex package (v2.0.1)
I see that you have a question mark in remove_list but ? has a meaning in regex. So I suggest you escape it, i.e.
remove_list <- paste(c("einverst", "empf", "raten", "aufgeklä", "nicht", "weiß nicht", "bespr", "soll",
"kein", "\\?", "raten", "klären", "überprüf", "erst, wenn", "verweiger",
"notwendig"), collapse = '|')
Then select the right rows, using the inverse of the grepl using !
PKV[!grepl(remove_list, PKV$AUFZEICHNUNGEN),]
Example of escaping ?:
#first rows of your data
dt <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1), AUFZEICHNUNGEN = c("Aufzeichnungen",
"07.03.22 A: stechender Schmerz", " scharfkantig", "D/B:",
" keine c zu entdekcne", "pat aufgekläörtggf RÖ um c auszuschileßen",
" seit paartagen", "36 vipr++")), class = "data.frame", row.names = c(NA,
-8L))
#grepl without escaping ?
grepl("?", PKV$AUFZEICHNUNGEN)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#grepl with escaping ?
grepl("\\?", PKV$AUFZEICHNUNGEN)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Identify Nouns using Quanteda Corpuses

I am using the quanteda package by Ken Benoit and Paul Nulty to work with textual data.
My corpus contains texts with full German sentences and I want to work with the nouns of every text only. One trick in German is to use the upper case words only, but this would fail at the beginning of a sentence.
Text1 <- c("Halle an der Saale ist die grünste Stadt Deutschlands")
Text2 <- c("In Hamburg regnet es immer, das ist also so wie in London.")
Text3 <- c("James Bond trinkt am liebsten Martini")
myCorpus <- corpus(c(Text1, Text2, Text3))
metadoc(myCorpus, "language") <- "german"
summary(myCorpus, showmeta = T)
myDfm <- dfm(myCorpus, tolower = F, remove_numbers = T,
remove = stopwords("german"), remove_punct = TRUE,
remove_separators = T)
topfeatures(myDfm, 20)
From this minimal example, I would like to retrieve:
"Halle", "Saale", "Stadt", "Deutschland", "Hamburg", "London", "Martini", "James", "Bond".
I assume I need a dictionary, which defines verbs/nouns/etc. and the proper names (James Bond, Hamburg etc.), or is there a build in function/dict?
Bonus Question: Does the solution work for English texts too?
You need some help from a part-of-speech tagger. Fortunately there is a great one, with a German language model, in the form of spaCy, and a package we wrote as a wrapper around it, spacyr. Installation instructions are at the spacyr page.
This code will do what you want:
txt <- c("Halle an der Saale ist die grünste Stadt Deutschlands",
"In Hamburg regnet es immer, das ist also so wie in London.",
"James Bond trinkt am liebsten Martini")
library("spacyr")
spacy_initialize(model = "de")
txtparsed <- spacy_parse(txt, tag = TRUE, pos = TRUE)
head(txtparsed, 20)
# doc_id sentence_id token_id token lemma pos tag entity
# 1 text1 1 1 Halle halle PROPN NE LOC_B
# 2 text1 1 1 an an ADP APPR LOC_I
# 3 text1 1 1 der der DET ART LOC_I
# 4 text1 1 1 Saale saale PROPN NE LOC_I
# 5 text1 1 1 ist ist AUX VAFIN
# 6 text1 1 1 die die DET ART
# 7 text1 1 1 grünste grünste ADJ ADJA
# 8 text1 1 1 Stadt stadt NOUN NN
# 9 text1 1 1 Deutschlands deutschlands PROPN NE LOC_B
# 10 text2 1 1 In in ADP APPR
# 11 text2 1 1 Hamburg hamburg PROPN NE LOC_B
# 12 text2 1 1 regnet regnet VERB VVFIN
# 13 text2 1 1 es es PRON PPER
# 14 text2 1 1 immer immer ADV ADV
# 15 text2 1 1 , , PUNCT $,
# 16 text2 1 1 das das PRON PDS
# 17 text2 1 1 ist ist AUX VAFIN
# 18 text2 1 1 also also ADV ADV
# 19 text2 1 1 so so ADV ADV
# 20 text2 1 1 wie wie CONJ KOKOM
(nouns <- with(txtparsed, subset(token, pos == "NOUN")))
# [1] "Stadt"
(propernouns <- with(txtparsed, subset(token, pos == "PROPN")))
# [1] "Halle" "Saale" "Deutschlands" "Hamburg" "London"
# [6] "James" "Bond" "Martini"
Here, you can see that the nouns you wanted are marked in the simpler pos field as "proper nouns". The tag field is a more detailed, German-language tagset that you could also select from.
The lists of selected nouns can then be used in quanteda:
library("quanteda")
myDfm <- dfm(txt, tolower = FALSE, remove_numbers = TRUE,
remove = stopwords("german"), remove_punct = TRUE)
head(myDfm)
# Document-feature matrix of: 3 documents, 14 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale grünste Stadt Deutschlands Hamburg
# text1 1 1 1 1 1 0
# text2 0 0 0 0 0 1
# text3 0 0 0 0 0 0
head(dfm_select(myDfm, pattern = propernouns))
# Document-feature matrix of: 3 documents, 8 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale Deutschlands Hamburg London James
# text1 1 1 1 0 0 0
# text2 0 0 0 1 1 0
# text3 0 0 0 0 0 1

How do I create a dataset with all entries in each row in R? [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Say I have a large dataset, and the information is organized based on a type of entry, and the amount of occurrences of that type of entry.
Say...
Area Animal Observations
US Cat 4
NE Cat 9
US Dog 2
My question is how would I create a dataset (to do analysis in R) that would list the items like...
Say...
Area Animal
US Cat
US Cat
US Cat...
US
NE
NE
NE
NE....
US.. Dog..
I'm asking because I have a large data set and I'm trying to get each entry for each row, rather them being grouped. Anyone know how to do this?
Try
library(splitstackshape)
expandRows(df1, 'Observations')
# Area Animal
#1 US Cat
#1.1 US Cat
#1.2 US Cat
#1.3 US Cat
#2 NE Cat
#2.1 NE Cat
#2.2 NE Cat
#2.3 NE Cat
#2.4 NE Cat
#2.5 NE Cat
#2.6 NE Cat
#2.7 NE Cat
#2.8 NE Cat
#3 US Dog
#3.1 US Dog
Index the dataframe by 'rownames' repeated as many times as 'Observations':
> rep(rownames(dat), dat$Observations)
[1] "1" "1" "1" "1" "2" "2" "2" "2" "2" "2" "2" "2" "2" "3" "3"
> dat[ rep(rownames(dat), dat$Observations) , ]
Area Animal Observations
1 US Cat 4
1.1 US Cat 4
1.2 US Cat 4
1.3 US Cat 4
2 NE Cat 9
2.1 NE Cat 9
2.2 NE Cat 9
2.3 NE Cat 9
2.4 NE Cat 9
2.5 NE Cat 9
2.6 NE Cat 9
2.7 NE Cat 9
2.8 NE Cat 9
3 US Dog 2
3.1 US Dog 2
Here's an approach using lapply() and rep():
df <- data.frame(Area=c('US','NE','US'), Animal=c('Cat','Cat','Dog'), Observations=c(4,9,2) );
as.data.frame(lapply(df[-3],rep,df[,3]));
## Area Animal
## 1 US Cat
## 2 US Cat
## 3 US Cat
## 4 US Cat
## 5 NE Cat
## 6 NE Cat
## 7 NE Cat
## 8 NE Cat
## 9 NE Cat
## 10 NE Cat
## 11 NE Cat
## 12 NE Cat
## 13 NE Cat
## 14 US Dog
## 15 US Dog

How to speed up the loop running process in R for huge files

I have a genetics sequencing file - 4 millon rows. I'm trying to run a piece of code for variants each unique gene listed.
Here is an exmple of how the data is
CHROM POS GENE IMPACT HOM
1 23455 A HIGH HET
1 23675 A HIGH HET
1 23895 A MODERATE
1 24115 B LOW HET
1 24335 B HIGH HET
1 24555 B LOW HET
2 6789 C LOW
2 12346 C LOW HET
2 17903 C MODERATE HET
2 23460 C MODERATE
2 29017 D LOW HET
2 34574 D HIGH
2 40131 D HIGH HET
3 567890 E HIGH HET
3 589076 E HIGH
3 610262 E LOW HET
3 631448 F HIGH HET
3 652634 F MODERATE HET
And here is my code:
sam <- read.csv("../sample/sample1.txt", sep="\t",header=TRUE,stringsAsFactors=FALSE)
glist <- unique(sam[,3])
for(i in glist) {
lice <- subset(sam, GENE == i)
lice$mut <- as.numeric(ifelse((lice[c(4)] == 'MODERATE' | lice[c(4)] == 'HIGH'), c(1), c(0)))
lice$cntmut <- sum(lice$mut, na.rm=TRUE)
lice$het <- as.numeric(ifelse(lice$HOM == 'HET', c(1), c(0)))
lice$cnthet <- sum(lice$het, na.rm=TRUE)
lice$cnthetmut <- lice$mut + lice$het
lice$lice <- ifelse(lice$mut == 1 & lice$cntmut >= 2 & lice$het == 1 & lice$cnthet >= 2 & lice$cnthetmut == 2 , 'lice', '')
write.table(lice,paste0("../sample/list/",i,".txt"),sep="\t",quote=F,row.names=F)
}
licelist <- list.files("../sample/list/", full.names=T)
lice2 <- do.call("rbind",lapply(licelist, FUN=function(files){read.table(files, header=TRUE, sep="\t", stringsAsFactors=FALSE)}))
lice_out <- merge(sam,lice2,by.x=c("CHROM","POS"),by.y=c("CHROM","POS"), all=T)
write.table(lice_out,"../sample/sample1_lice.txt",sep="\t",quote=F,row.names=F)
I have 30,000 genes which means running this code will take about 2 weeks (the original file is about 4GB in size). I was wondering whether anyone had any advice on how to speed this up? I've tried writing a function to include all this info (which some is repetitive) but to no avail.
Just to add:
The code in the loop is essentially doing the following:
1. adding up how many variants in each gene are moderate or high and how many are are het.
2. lice is given to a variant in a gene if the variant is moderate/high, is a het, and only if there are more than two of these types variants in the gene
For this result:
CHROM POS GENE IMPACT HOM LICE
1 23455 A HIGH HET lice
1 23675 A HIGH HET lice
1 23895 A MODERATE
1 24115 B LOW HET
1 24335 B HIGH HET
1 24555 B LOW HET
2 6789 C LOW
2 12346 C LOW HET
2 17903 C MODERATE HET
2 23460 C MODERATE
2 29017 D LOW HET
2 34574 D HIGH
2 40131 D HIGH HET
3 567890 E HIGH HET
3 589076 E HIGH
3 610262 E LOW HET
3 631448 F HIGH HET lice
3 652634 F MODERATE HET lice
Like I mentioned a bit further up, the steps are not all necessary but worked at the time when I was doing it on a smaller datat frame.
Ir's a bit difficult to help you when you don't explain what you are trying to accomplish, or provide an example of what the desired result looks like with your sample dataset, but here are a few suggestions:
(1) Use data tables. They a much faster and use memory much more efficiently.
(2) Other than the sums (cntmut, cnthet) I don't see why you spit the original table. There are other ways to get the sums without splitting the dataset.
(3) I don't really see the point of the merge at the end.
Here's an option that will likely be much faster.
library(data.table)
dt <- data.table(sam)
setkey(dt,GENE)
dt[,mut:=as.numeric(IMPACT=="MODERATE"|IMPACT=="HIGH")]
dt[,cntmut:=sum(mut), by=GENE]
dt[,het:=as.numeric(HOM=="HET")]
dt[,cnthet:=sum(het),by=GENE]
dt[,cnthetmut:=mut+het]
dt[,lice:=ifelse(mut==1 & cntmut>=2 & het==1 & cnthetmut ==2,'lice',''), by=GENE]
head(dt)
# CHROM POS GENE IMPACT HOM mut cntmut het cnthet cnthetmut lice
# 1: 1 23455 A HIGH HET 1 3 1 2 2 lice
# 2: 1 23675 A HIGH HET 1 3 1 2 2 lice
# 3: 1 23895 A MODERATE 1 3 0 2 1
# 4: 1 24115 B LOW HET 0 1 1 3 1
# 5: 1 24335 B HIGH HET 1 1 1 3 2
# 6: 1 24555 B LOW HET 0 1 1 3 1

Parsing Deeply Nested JSON Structures in R Using RJSONIO

I suspect I'm missing something obvious here, but how do I parse deeply nested structures in R using RJSONIO?
For example - suppose I directly want to reference $familyName in results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver as grabbed using the following snippet:
require(RJSONIO)
resultsURL='http://ergast.com/api/f1/2012/1/results.json'
results.data.json=fromJSON(resultsURL)
RJSONIO doesn't appear to want to parse the ..$Results[[i]] data as structured elements?
require(RJSONIO)
somedata1<-list(a=1,b='w')
fromJSON(toJSON(somedata1))
# $a
# [1] 1
# $b
# [1] "w"
somedata2<-list(a=1,b=2)
fromJSON(toJSON(somedata2))
# a b
# 1 2
somedata3<-list(a='v',b='w')
fromJSON(toJSON(somedata3))
# a b
# "v" "w"
fromJSON(toJSON(somedata3),simplify=StrictNumeric)
# $a
# [1] "v"
# $b
# [1] "w"
fromJSON(toJSON(somedata2),simplify=FALSE)
# $a
# [1] 1
# $b
# [1] 2
fromJSON(toJSON(somedata3),simplifyWithNames = FALSE)
# $a
# [1] "v"
# $b
# [1] "w"
fromJSON(toJSON(somedata2),simplifyWithNames = FALSE)
# $a
# [1] 1
# $b
# [1] 2
from the examples above by default RJSON simplifies "collections/arrays of homogeneous scalar elements to R vectors". This simplification can be controlled using simplify or simplifyWithNames. In your example you can do any of the following to access the element you want:
require(RJSONIO)
resultsURL='http://ergast.com/api/f1/2012/1/results.json'
results.data.json=fromJSON(resultsURL)
results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver['familyName']
# familyName
# "Pérez"
results.data.json=fromJSON(resultsURL,simplify=FALSE)
results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver$familyName
# [1] "Pérez"
results.data.json=fromJSON(resultsURL,simplify=StrictNumeric)
results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver$familyName
# [1] "Pérez"
results.data.json=fromJSON(resultsURL,simplifyWithNames = FALSE)
results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver$familyName
# [1] "Pérez"
The jsonlite package is a fork of RJSONIO which tries to use a smarter mapping between R and JSON structures. I think this might make your life easier:
> x = fromJSON('http://ergast.com/api/f1/2012/1/results.json')
> x$RaceTable$Races$MRData$Results[[1]]$Driver
driverId code url
1 button BUT http://en.wikipedia.org/wiki/Jenson_Button
2 vettel VET http://en.wikipedia.org/wiki/Sebastian_Vettel
3 hamilton HAM http://en.wikipedia.org/wiki/Lewis_Hamilton
4 webber WEB http://en.wikipedia.org/wiki/Mark_Webber
5 alonso ALO http://en.wikipedia.org/wiki/Fernando_Alonso
6 kobayashi KOB http://en.wikipedia.org/wiki/Kamui_Kobayashi
7 raikkonen RAI http://en.wikipedia.org/wiki/Kimi_R%C3%A4ikk%C3%B6nen
8 perez PER http://en.wikipedia.org/wiki/Sergio_P%C3%A9rez
9 ricciardo RIC http://en.wikipedia.org/wiki/Daniel_Ricciardo
10 resta DIR http://en.wikipedia.org/wiki/Paul_di_Resta
11 vergne VER http://en.wikipedia.org/wiki/Jean-%C3%89ric_Vergne
12 rosberg ROS http://en.wikipedia.org/wiki/Nico_Rosberg
13 maldonado MAL http://en.wikipedia.org/wiki/Pastor_Maldonado
14 glock GLO http://en.wikipedia.org/wiki/Timo_Glock
15 pic PIC http://en.wikipedia.org/wiki/Charles_Pic
16 bruno_senna SEN http://en.wikipedia.org/wiki/Bruno_Senna
17 massa MAS http://en.wikipedia.org/wiki/Felipe_Massa
18 kovalainen KOV http://en.wikipedia.org/wiki/Heikki_Kovalainen
19 petrov PET http://en.wikipedia.org/wiki/Vitaly_Petrov
20 michael_schumacher MSC http://en.wikipedia.org/wiki/Michael_Schumacher
21 grosjean GRO http://en.wikipedia.org/wiki/Romain_Grosjean
22 hulkenberg HUL http://en.wikipedia.org/wiki/Nico_H%C3%BClkenberg
23 rosa DLR http://en.wikipedia.org/wiki/Pedro_de_la_Rosa
24 karthikeyan KAR http://en.wikipedia.org/wiki/Narain_Karthikeyan
givenName familyName dateOfBirth nationality
1 Jenson Button 1980-01-19 British
2 Sebastian Vettel 1987-07-03 German
3 Lewis Hamilton 1985-01-07 British
4 Mark Webber 1976-08-27 Australian
5 Fernando Alonso 1981-07-29 Spanish
6 Kamui Kobayashi 1986-09-13 Japanese
7 Kimi Räikkönen 1979-10-17 Finnish
8 Sergio Pérez 1990-01-26 Mexican
9 Daniel Ricciardo 1989-07-01 Australian
10 Paul di Resta 1986-04-16 Scottish
11 Jean-Éric Vergne 1990-04-25 French
12 Nico Rosberg 1985-06-27 German
13 Pastor Maldonado 1985-03-09 Venezuelan
14 Timo Glock 1982-03-18 German
15 Charles Pic 1990-02-15 French
16 Bruno Senna 1983-10-15 Brazilian
17 Felipe Massa 1981-04-25 Brazilian
18 Heikki Kovalainen 1981-10-19 Finnish
19 Vitaly Petrov 1984-09-08 Russian
20 Michael Schumacher 1969-01-03 German
21 Romain Grosjean 1986-04-17 French
22 Nico Hülkenberg 1987-08-19 German
23 Pedro de la Rosa 1971-02-24 Spanish
24 Narain Karthikeyan 1977-01-14 Indian

Resources