Related
For some reason, stop words is not working for my corpus, entirely in French. I've been trying repeatedly over the past few days, but many words that should have been filtered simply are not. I am not sure if anyone else has a similar issue? I read somewhere that it could be because of the accents. I tried stringi::stri_trans_general(x, "Latin-ASCII") but I am not certain I did this correctly. Also, I notice that French stop words are sometimes referred to as "french" or "fr".
This is one example of code I tried, I would be extremely grateful for any advice.
I also manually installed quanteda, because I had difficulties downloading it, so it could be linked to that.
text_corp <- quanteda::corpus(data,
text_field="text")
head(stopwords("french"))
summary(text_corp)
my_dfm <- dfm(text_corp)
myStemMat <- dfm(text_corp, remove = stopwords("french"), stem = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE)
myStemMat[, 1:5]
topfeatures(myStemMat 20)
In this last step, there are still words like "etre" (to be), "plus" (more), comme ("like"), avant ("before"), avoir ("to have")
I also tried to filter stop words in a different way, through token creation:
tokens <-
tokens(
text_corp,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
split_hyphens = TRUE,
include_docvars = TRUE,
)
mydfm <- dfm(tokens,
tolower = TRUE,
stem = TRUE,
remove = stopwords("french")
)
topfeatures(mydfm, 20)
The stopwords are working just fine, however the default Snowball list of French stopwords simply does not include the words you wish to remove.
You can see that by inspecting the vector of stopwords returned by stopwords("fr"):
library("quanteda")
## Package version: 2.1.2
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr")
## [1] FALSE FALSE FALSE FALSE FALSE
This is the full list of words:
sort(stopwords("fr"))
## [1] "à" "ai" "aie" "aient" "aies" "ait"
## [7] "as" "au" "aura" "aurai" "auraient" "aurais"
## [13] "aurait" "auras" "aurez" "auriez" "aurions" "aurons"
## [19] "auront" "aux" "avaient" "avais" "avait" "avec"
## [25] "avez" "aviez" "avions" "avons" "ayant" "ayez"
## [31] "ayons" "c" "ce" "ceci" "cela" "celà"
## [37] "ces" "cet" "cette" "d" "dans" "de"
## [43] "des" "du" "elle" "en" "es" "est"
## [49] "et" "étaient" "étais" "était" "étant" "été"
## [55] "étée" "étées" "étés" "êtes" "étiez" "étions"
## [61] "eu" "eue" "eues" "eûmes" "eurent" "eus"
## [67] "eusse" "eussent" "eusses" "eussiez" "eussions" "eut"
## [73] "eût" "eûtes" "eux" "fûmes" "furent" "fus"
## [79] "fusse" "fussent" "fusses" "fussiez" "fussions" "fut"
## [85] "fût" "fûtes" "ici" "il" "ils" "j"
## [91] "je" "l" "la" "le" "les" "leur"
## [97] "leurs" "lui" "m" "ma" "mais" "me"
## [103] "même" "mes" "moi" "mon" "n" "ne"
## [109] "nos" "notre" "nous" "on" "ont" "ou"
## [115] "par" "pas" "pour" "qu" "que" "quel"
## [121] "quelle" "quelles" "quels" "qui" "s" "sa"
## [127] "sans" "se" "sera" "serai" "seraient" "serais"
## [133] "serait" "seras" "serez" "seriez" "serions" "serons"
## [139] "seront" "ses" "soi" "soient" "sois" "soit"
## [145] "sommes" "son" "sont" "soyez" "soyons" "suis"
## [151] "sur" "t" "ta" "te" "tes" "toi"
## [157] "ton" "tu" "un" "une" "vos" "votre"
## [163] "vous" "y"
That's why they are not removed. We can see this with an example I created, using many of your words:
toks <- tokens("Je veux avoir une glace et être heureux, comme un enfant avant le dîner.",
remove_punct = TRUE
)
tokens_remove(toks, stopwords("fr"))
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "avoir" "glace" "être" "heureux" "comme" "enfant"
## [8] "avant" "dîner"
How to remove them? Either use a more complete list of stopwords, or customize the Snowball list by appending the stopwords you want to the existing ones.
mystopwords <- c(stopwords("fr"), "comme", "avoir", "plus", "avant", "être")
tokens_remove(toks, mystopwords)
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "glace" "heureux" "enfant" "dîner"
You could also use one of the other stopword sources, such as the "stopwords-iso", which does contain all of the words you wish to remove:
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr", source = "stopwords-iso")
## [1] TRUE TRUE TRUE TRUE TRUE
With regard to the language question, see the help for ?stopwords::stopwords, which states:
The language codes for each stopword list use the two-letter ISO code from https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. For backwards compatibility, the full English names of the stopwords from the quanteda package may also be used, although these are deprecated.
With regard to what you tried with stringi::stri_trans_general(x, "Latin-ASCII"), this would only help you if you wanted to remove "etre" and your stopword list contained only "être". In the example below, the stopword vector containing the accented character is concatenated with a version of itself in which the accents have been removed.
sw <- "être"
tokens("etre être heureux") %>%
tokens_remove(sw)
## Tokens consisting of 1 document.
## text1 :
## [1] "etre" "heureux"
tokens("etre être heureux") %>%
tokens_remove(c(sw, stringi::stri_trans_general(sw, "Latin-ASCII")))
## Tokens consisting of 1 document.
## text1 :
## [1] "heureux"
c(sw, stringi::stri_trans_general(sw, "Latin-ASCII"))
## [1] "être" "etre"
This question already has answers here:
Pasting two vectors with combinations of all vectors' elements
(8 answers)
Closed 2 years ago.
I have two vectors, one that contains a list of variables, and one that contains dates, such as
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
I want to merge them to have a vector with each variable indexed by my date, that is my desired output is
> Colonnes_Pays_Principaux
[1] "PIB_2020" "PIB_2021" "ConsommationPrivee_2020"
[4] "ConsommationPrivee_2021" "ConsommationPubliques_2020" "ConsommationPubliques_2021"
[7] "FBCF_2020" "FBCF_2021" "ProductionIndustrielle_2020"
[10] "ProductionIndustrielle_2021" "Inflation_2020" "Inflation_2021"
[13] "InflationSousJacente_2020" "InflationSousJacente_2021" "PrixProductionIndustrielle_2020"
[16] "PrixProductionIndustrielle_2021" "CoutHoraireTravail_2020" "CoutHoraireTravail_2021"
Is there a simpler / more readabl way than a double for loop as I have tried and succeeded below ?
Colonnes_Pays_Principaux <- vector()
for (Variable in (1:length(Variables_Pays))){
for (Annee in (1:length(Annee_Pays))){
Colonnes_Pays_Principaux=
append(Colonnes_Pays_Principaux,
paste(Variables_Pays[Variable],Annee_Pays[Annee],sep="_")
)
}
}
expand.grid will create a data frame with all combinations of the two vectors.
with(
expand.grid(Variables_Pays, Annee_Pays),
paste0(Var1, "_", Var2)
)
#> [1] "PIB_2000" "ConsommationPrivee_2000"
#> [3] "ConsommationPubliques_2000" "FBCF_2000"
#> [5] "ProductionIndustrielle_2000" "Inflation_2000"
#> [7] "InflationSousJacente_2000" "PrixProductionIndustrielle_2000"
#> [9] "CoutHoraireTravail_2000" "PIB_2001"
#> [11] "ConsommationPrivee_2001" "ConsommationPubliques_2001"
#> [13] "FBCF_2001" "ProductionIndustrielle_2001"
#> [15] "Inflation_2001" "InflationSousJacente_2001"
#> [17] "PrixProductionIndustrielle_2001" "CoutHoraireTravail_2001"
We can use outer :
c(t(outer(Variables_Pays, Annee_Pays, paste, sep = '_')))
# [1] "PIB_2000" "PIB_2001"
# [3] "ConsommationPrivee_2000" "ConsommationPrivee_2001"
# [5] "ConsommationPubliques_2000" "ConsommationPubliques_2001"
# [7] "FBCF_2000" "FBCF_2001"
# [9] "ProductionIndustrielle_2000" "ProductionIndustrielle_2001"
#[11] "Inflation_2000" "Inflation_2001"
#[13] "InflationSousJacente_2000" "InflationSousJacente_2001"
#[15] "PrixProductionIndustrielle_2000" "PrixProductionIndustrielle_2001"
#[17] "CoutHoraireTravail_2000" "CoutHoraireTravail_2001"
No real need to go beyond the basics here! Use paste for pasting the strings and rep to repeat either Annee_Pays och Variables_Pays to get all combinations:
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
# To get this is the same order as in your example:
paste(rep(Variables_Pays, rep(2, length(Variables_Pays))), Annee_Pays, sep = "_")
# Alternative order:
paste(Variables_Pays, rep(Annee_Pays, rep(length(Variables_Pays), 2)), sep = "_")
# Or, if order doesn't matter too much:
paste(Variables_Pays, rep(Annee_Pays, length(Variables_Pays)), sep = "_")
In base R:
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
cbind(paste(Variables_Pays, Annee_Pays,sep="_"),paste(Variables_Pays, rev(Annee_Pays),sep="_")
When I iterate over dates in a loop, R prints out the numeric coding of the dates.
For example:
dates <- as.Date(c("1939-06-10", "1932-02-22", "1980-03-13", "1987-03-17",
"1988-04-14", "1979-08-28", "1992-07-16", "1989-12-11"), tryFormats = c("%Y-%m-%d"))
for(d in dates){
print(d)
}
The output is as follows:
[1] -11163
[1] -13828
[1] 3724
[1] 6284
[1] 6678
[1] 3526
[1] 8232
[1] 7284
How do I get R to print out the actual dates?
So the output reads:
[1] "1939-06-10"
[1] "1932-02-22"
[1] "1980-03-13"
[1] "1987-03-17"
[1] "1988-04-14"
[1] "1979-08-28"
[1] "1992-07-16"
[1] "1989-12-11"
Thank you!
When you use dates as seq in a for loop in R, it loses its attributes.
You can use as.vector to strip attributes and see for yourself (or dput to see under the hood on the full object):
as.vector(dates)
# [1] -11163 -13828 3724 6284 6678 3526 8232 7284
dput(dates)
# structure(c(-11163, -13828, 3724, 6284, 6678, 3526, 8232, 7284), class = "Date")
In R, Date objects are just numeric vectors with class Date (class is an attribute).
Hence you're seeing numbers (FWIW, these numbers count days since 1970-01-01).
To restore the Date attribute, you can use the .Date function:
for (d in dates) print(.Date(d))
# [1] "1939-06-10"
# [1] "1932-02-22"
# [1] "1980-03-13"
# [1] "1987-03-17"
# [1] "1988-04-14"
# [1] "1979-08-28"
# [1] "1992-07-16"
# [1] "1989-12-11"
This is equivalent to as.Date(d, origin = '1970-01-01'), the numeric method for as.Date.
Funnily enough, *apply functions don't strip attributes:
invisible(lapply(dates, print))
# [1] "1939-06-10"
# [1] "1932-02-22"
# [1] "1980-03-13"
# [1] "1987-03-17"
# [1] "1988-04-14"
# [1] "1979-08-28"
# [1] "1992-07-16"
# [1] "1989-12-11"
There are multiple way you can handle this :
Loop over index of dates :
for(d in seq_along(dates)){
print(dates[d])
}
#[1] "1939-06-10"
#[1] "1932-02-22"
#[1] "1980-03-13"
#[1] "1987-03-17"
#[1] "1988-04-14"
#[1] "1979-08-28"
#[1] "1992-07-16"
#[1] "1989-12-11"
Or convert date to list and then print directly.
for(d in as.list(dates)) {
print(d)
}
I have a word and want to output in R all possible deviatons (replacement, substitution, insertion) for a fixed distance value into a vector.
For instance, the word "Cat" and a fixed distance value of 1 results in a vector with the elements "cot", "at", ...
I'm going to assume that you want all actual words, not just permutations of the characters with an edit distance of 1 that would include non-words such as "zat".
We can do this using adist() to compute the edit distance between your target word and all eligible English words, taken from some word list. Here, I used the English syllable dictionary from the quanteda package (you did tag this question as quanteda after all) but this could have been any vector of English dictionary words from any other source as well.
To narrow things down, we first exclude all words that are different in length from the target word by your distance value.
distfn <- function(word, distance = 1) {
# select eligible words for efficiency
eligible_y_words <- names(quanteda::data_int_syllables)
wordlengths <- nchar(eligible_y_words)
eligible_y_words <- eligible_y_words[wordlengths >= (nchar(word) - distance) &
wordlengths <= (nchar(word) + distance)]
# compute Levenshtein distance
distances <- utils::adist(word, eligible_y_words)[1, ]
# return only those for the requested distance value
eligible_y_words[distances == distance]
}
distfn("cat", 1)
## [1] "at" "bat" "ca" "cab" "cac" "cad" "cai" "cal" "cam" "can"
## [11] "cant" "cao" "cap" "caq" "car" "cart" "cas" "cast" "cate" "cato"
## [21] "cats" "catt" "cau" "caw" "cay" "chat" "coat" "cot" "ct" "cut"
## [31] "dat" "eat" "fat" "gat" "hat" "kat" "lat" "mat" "nat" "oat"
## [41] "pat" "rat" "sat" "scat" "tat" "vat" "wat"
To demonstrate how this works on longer words, with alternative distance values.
distfn("coffee", 1)
## [1] "caffee" "coffeen" "coffees" "coffel" "coffer" "coffey" "cuffee"
## [8] "toffee"
distfn("coffee", 2)
## [1] "caffey" "calfee" "chafee" "chaffee" "cofer" "coffee's"
## [7] "coffelt" "coffers" "coffin" "cofide" "cohee" "coiffe"
## [13] "coiffed" "colee" "colfer" "combee" "comfed" "confer"
## [19] "conlee" "coppee" "cottee" "coulee" "coutee" "cuffe"
## [25] "cuffed" "diffee" "duffee" "hoffer" "jaffee" "joffe"
## [31] "mcaffee" "moffet" "noffke" "offen" "offer" "roffe"
## [37] "scoffed" "soffel" "soffer" "yoffie"
(Yes, according to the CMU pronunciation dictionary, those are all actual words...)
EDIT: Make for all permutations of letters, not just actual words
This involves permutations from the alphabet that have the fixed edit distances from the input word. Here I've done it not particular efficiently by forming all permutations of letters within the eligible ranges, and then computing their edit distance from the target word, and then selecting them. So it's a variation of above, except instead of a dictionary, it uses permuted words.
distfn2 <- function(word, distance = 1) {
result <- character()
# start with deletions
for (i in max((nchar(word) - distance), 0):(nchar(word) - 1)) {
result <- c(
result,
combn(unlist(strsplit(word, "", fixed = TRUE)), i,
paste,
collapse = "", simplify = TRUE
)
)
}
# now for changes and insertions
for (i in (nchar(word)):(nchar(word) + distance)) {
# all possible edits
edits <- apply(expand.grid(rep(list(letters), i)),
1, paste0,
collapse = ""
)
# remove original word
edits <- edits[edits != word]
# get all distances, add to result
distances <- utils::adist(word, edits)[1, ]
result <- c(result, edits[distances == distance])
}
result
}
For the OP example:
distfn2("cat", 1)
## [1] "ca" "ct" "at" "caa" "cab" "cac" "cad" "cae" "caf" "cag"
## [11] "cah" "cai" "caj" "cak" "cal" "cam" "can" "cao" "cap" "caq"
## [21] "car" "cas" "aat" "bat" "dat" "eat" "fat" "gat" "hat" "iat"
## [31] "jat" "kat" "lat" "mat" "nat" "oat" "pat" "qat" "rat" "sat"
## [41] "tat" "uat" "vat" "wat" "xat" "yat" "zat" "cbt" "cct" "cdt"
## [51] "cet" "cft" "cgt" "cht" "cit" "cjt" "ckt" "clt" "cmt" "cnt"
## [61] "cot" "cpt" "cqt" "crt" "cst" "ctt" "cut" "cvt" "cwt" "cxt"
## [71] "cyt" "czt" "cau" "cav" "caw" "cax" "cay" "caz" "cata" "catb"
## [81] "catc" "catd" "cate" "catf" "catg" "cath" "cati" "catj" "catk" "catl"
## [91] "catm" "catn" "cato" "catp" "catq" "catr" "cats" "caat" "cbat" "acat"
## [101] "bcat" "ccat" "dcat" "ecat" "fcat" "gcat" "hcat" "icat" "jcat" "kcat"
## [111] "lcat" "mcat" "ncat" "ocat" "pcat" "qcat" "rcat" "scat" "tcat" "ucat"
## [121] "vcat" "wcat" "xcat" "ycat" "zcat" "cdat" "ceat" "cfat" "cgat" "chat"
## [131] "ciat" "cjat" "ckat" "clat" "cmat" "cnat" "coat" "cpat" "cqat" "crat"
## [141] "csat" "ctat" "cuat" "cvat" "cwat" "cxat" "cyat" "czat" "cabt" "cact"
## [151] "cadt" "caet" "caft" "cagt" "caht" "cait" "cajt" "cakt" "calt" "camt"
## [161] "cant" "caot" "capt" "caqt" "cart" "cast" "catt" "caut" "cavt" "cawt"
## [171] "caxt" "cayt" "cazt" "catu" "catv" "catw" "catx" "caty" "catz"
Also works with other edit distances, although it becomes very slow for longer words.
d2 <- distfn2("cat", 2)
set.seed(100)
c(head(d2, 50), sample(d2, 50), tail(d2, 50))
## [1] "c" "a" "t" "ca" "ct" "at" "aaa" "baa"
## [9] "daa" "eaa" "faa" "gaa" "haa" "iaa" "jaa" "kaa"
## [17] "laa" "maa" "naa" "oaa" "paa" "qaa" "raa" "saa"
## [25] "taa" "uaa" "vaa" "waa" "xaa" "yaa" "zaa" "cba"
## [33] "aca" "bca" "cca" "dca" "eca" "fca" "gca" "hca"
## [41] "ica" "jca" "kca" "lca" "mca" "nca" "oca" "pca"
## [49] "qca" "rca" "cnts" "cian" "pcatb" "cqo" "uawt" "hazt"
## [57] "cpxat" "aaet" "ckata" "caod" "ncatl" "qcamt" "cdtp" "qajt"
## [65] "bckat" "qcatr" "cqah" "rcbt" "cvbt" "bbcat" "vcaz" "ylcat"
## [73] "cahz" "jcgat" "mant" "jatd" "czlat" "cbamt" "cajta" "cafp"
## [81] "cizt" "cmaut" "qwat" "jcazt" "hdcat" "ucant" "hate" "cajtl"
## [89] "caaty" "cix" "nmat" "cajit" "cmnat" "caobt" "catoi" "ncau"
## [97] "ucoat" "ncamt" "jath" "oats" "chatz" "ciatz" "cjatz" "ckatz"
## [105] "clatz" "cmatz" "cnatz" "coatz" "cpatz" "cqatz" "cratz" "csatz"
## [113] "ctatz" "cuatz" "cvatz" "cwatz" "cxatz" "cyatz" "czatz" "cabtz"
## [121] "cactz" "cadtz" "caetz" "caftz" "cagtz" "cahtz" "caitz" "cajtz"
## [129] "caktz" "caltz" "camtz" "cantz" "caotz" "captz" "caqtz" "cartz"
## [137] "castz" "cattz" "cautz" "cavtz" "cawtz" "caxtz" "caytz" "caztz"
## [145] "catuz" "catvz" "catwz" "catxz" "catyz" "catzz"
This could be speeded up by less brute force formation of all permutations and then applying adist() to them - it could consist of changes or insertions of known edit distances generated algorithmically from letters.
Question:
How to filter rows based on a nested dataframe using dplyr:filter
Problem:
The following code provides an example dataset to enable a working example.
Using the example code I can subset using which, but I am having a problem using dplyr due to the nested data frames.
Now I appreciate I could flatten the dataframe using jsonlite, however I am interested to know if and how I might harness dplyr without flattening the dataframe.
All help gratefully received and appreciated.
requiredPackages <- c("devtools","dplyr","tidyr","data.table","ggplot2","ggvis","RMySQL", "jsonlite", "psych", "plyr", "knitr")
ipak <- function(pkg)
{
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(requiredPackages)
dataDir <- "./data"
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/yelp_dataset_challenge_academic_dataset.zip"
filePath <- file.path(dataDir)
# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
dir.create(dataDir)
}
# Now we check if we have downloaded the data already into
# "./data/yelp_dataset_challenge_academic_dataset". If not, then we download the
# zip file... and extract it under the data directory as
# './data/yelp_dataset_challenge_academic_dataset'...
if (!file.exists( file.path(dataDir,"yelp_dataset_challenge_academic_dataset"))) {
temp <- tempfile()
download.file(fileUrl, temp, mode = "wb", method = "curl")
unzip(temp, exdir = dataDir)
unlink(temp)
}
if ( !exists("yelpBusinessData") )
{
if (file.exists( file.path(dataDir,"yelpBusinessData.rds"))) {
yelpBusinessData <- readRDS(file.path(dataDir,"yelpBusinessData.rds"))
} else {
yelpBusinessDataFilePath <- file.path(dataDir,
"yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json")
yelpBusinessData <- fromJSON(sprintf("[%s]",
paste(readLines(yelpBusinessDataFilePath),
collapse = ",")),
flatten = FALSE)
str(yelpBusinessData, max_level = 1)
# Fix the column name duplication issue
# If and when you flatten the data the you create two columns wiht the same column id
#
# i.e. yelpBusinessData$attributes.Good.for.kids
#
# This fixes the issue by renaming the first column...
#
colnames(yelpBusinessData$attributes)[6] <- "Price_Range"
colnames(yelpBusinessData$attributes)[7] <- "Good_For_Kids"
saveRDS( yelpBusinessData, file.path(dataDir, "yelpBusinessData.rds"))
}
}
The above code loads the example dataframe.
Here is an example of the problem I mentioned above. The first code example works and harnesses which to select four records. The problem is how to do the same with dplyr::filter - what am I missing? Specifically, how do you dereference nested dataframes???
# Extract the Phoenix subset using `which`
yelpBusinessData.PA <- yelpBusinessData[which(yelpBusinessData$city == "Phoenix"),]
yelpBusinessData.PA.rest <- yelpBusinessData.PA[which(grepl("Restaurants",
yelpBusinessData.PA$categories)),]
Exp <- yelpBusinessData.PA.rest[which(yelpBusinessData.PA.rest$attributes$Price_Range == 4),]
dim(Exp)
Result - Four records selected :-)
> dim(Exp)
[1] 4 15
Question: How to do this with dplyr?
yelpBusinessData.PA.rest <- yelpBusinessData %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter(attributes$Price_Range == 4)
the above code fails... now if I flatten the file I can get this to work correctly but...
Note the subtle change from: "attributes$Price_Range" to "attributes.Price_Range".
yelpBusinessData2 <- flatten(yelpBusinessData, recursive = TRUE)
dim(yelpBusinessData2)
Exp2 <- yelpBusinessData2 %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter(attributes.Price_Range == 4)
dim(Exp2)
my goal however is to understand how to do this without flattening the nested data frames.
I.E -> **How to use dplyr with nested dataframes? **
What am I missing here? :-)
One potential answer that I have tried is to index the nested data frame using [[]], this does work but you loose the elegance of dplyr...
Is there a better way?
Exp2 <- yelpBusinessData %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter( attributes[[6]][] == 4)
The above indexes into "attributes$Price_range" and returned the correct result when using nested data frames. i.e Price_Range is the 6th dataframe of the attributes dataframe...
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitcitations_1.0.6 pander_0.5.2 plyr_1.8.3 jsonlite_0.9.16 ggvis_0.4.2.9000
[6] tidyr_0.2.0 devtools_1.8.0 qmap_1.0-3 fitdistrplus_1.0-4 knitr_1.11
[11] dplyr_0.4.3.9000 data.table_1.9.4 psych_1.5.6 mapproj_1.2-4 maptools_0.8-36
[16] rworldmap_1.3-1 sp_1.1-1 maps_2.3-11 ggmap_2.5.2 ggplot2_1.0.1
[21] RMySQL_0.10.5 DBI_0.3.1 setwidth_1.0-4 colorout_1.1-1 vimcom_1.2-3
loaded via a namespace (and not attached):
[1] httr_1.0.0 splines_3.2.2 shiny_0.12.2 assertthat_0.1 highr_0.5
[6] yaml_2.1.13 lattice_0.20-33 chron_2.3-47 digest_0.6.8 RefManageR_0.8.63
[11] colorspace_1.2-6 htmltools_0.2.6 httpuv_1.3.3 XML_3.98-1.3 bibtex_0.4.0
[16] xtable_1.7-4 scales_0.3.0 jpeg_0.1-8 git2r_0.11.0 lazyeval_0.1.10.9000
[21] mnormt_1.5-3 proto_0.3-10 survival_2.38-3 RJSONIO_1.3-0 magrittr_1.5
[26] mime_0.3 memoise_0.2.1 evaluate_0.7.2 MASS_7.3-43 xml2_0.1.1
[31] foreign_0.8-66 ggthemes_2.2.1 rsconnect_0.4.1.4 tools_3.2.2 geosphere_1.4-3
[36] RgoogleMaps_1.2.0.7 formatR_1.2 stringr_1.0.0 munsell_0.4.2 rversions_1.0.2
[41] grid_3.2.2 RCurl_1.95-4.7 rstudioapi_0.3.1 rjson_0.2.15 spam_1.0-1
[46] bitops_1.0-6 labeling_0.3 rmarkdown_0.7 gtable_0.1.2 curl_0.9.3
[51] reshape2_1.4.1 R6_2.1.1 lubridate_1.3.3 stringi_0.5-5 parallel_3.2.2
[56] Rcpp_0.12.0 fields_8.2-1 png_0.1-7
There are at least 3 different parts to this question, each of which has very likely been answered well (& thoroughly) elsewhere on SO.
These are:
How to work with a "messy" data.frame in R/dplyr?
The example you give here is messier than a 'nested' data.frame since it contains list-columns as well as data-frame-columns containing data-frame-columns.
How to clean up a "messy" data.frame in R/dplyr?
Is there a better way to work with these data, maintaining their hierarchy?
Working with a "messy" data frame in R/dplyr?
Generally and particularly when starting out I take an approach of iteratively cleaning my data. This means I first identify the columns I most need to work with, the columns that are most problematic, and clean only those at the intersection.
Specifically:
Filter out any column that is problematic but not important
Focus my effort on any column that is problematic AND important
Keep any column that is important and not problematic
Aside: This leaves a fourth group of columns that are BOTH unimportant and not problematic. What you do with these depends on the problem. For example, if I'm preparing a production database, I will exclude them and only include the "cleaned" columns (#2 and #3 above). If I'm doing an exploratory analysis I'll include them since I may change my mind about their importance down the line.
In this example you give, the problematic columns are those that contain data.frames. (These are problematic because they break compatibility with dplyr -- not because they are messy).
You can filter them out using dplyr::select_if:
yelpBusinessData %>%
dplyr::select_if(purrr::negate(is.data.frame)) %>%
dplyr::filter(city == 'Phoenix')
After that, the other dplyr operators in your example will work provided they don't reference data in columns that are data.frames (for example, the attributes). This brings me to part II ..
How to clean a "messy" data.frame in R/dplyr?
One way to handle the "messy" data-frame columns in this data would be to flatten each one & join it back to the original data frame.
Taking the attributes column as an example, we can use jsonlite::flatten on this sub-data-frame & then join it back to our original:
yelpBusinessData %>%
dplyr::select_if(purrr::negate(is.data.frame)) %>%
dplyr::bind_cols(jsonlite::flatten(yelpBusinessData$attributes, recursive = T)) %>%
dplyr::filter(city == 'Phoenix') %>%
dplyr::filter(grepl("Restaurants", categories)) %>%
dplyr::filter(Price_Range == 4)
The hours component, however, you might want to handle differently. In this example, the hours data.frame contains a data.frame for each day of the week with two fields ("open" and "close"). Here I use purrr:map to apply a function to each column simplifying the data.frame into a character vector.
hours <-
yelpBusinessData$hours %>%
purrr::map(. %>%
dplyr::transmute(hours = stringr::str_c(open, close, sep = ' - ')) %>%
unlist()) %>%
tibble::as_tibble()
This produces a data.frame with the start - stop time for each day of the week:
> str(hours)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 61184 obs. of 7 variables:
$ Tuesday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Friday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Monday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Wednesday: chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Thursday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Sunday : chr NA NA NA "11:00 - 18:00" ...
$ Saturday : chr NA NA NA "10:00 - 21:00" ...
Similarly, one could use map2_dfc (which automagically calls bind_cols after mapping) to collapse the data-frames of this object yourself:
hours <- yelpBusinessData$hours %>%
purrr::map2_dfc(.x = .,
.y = names(.),
.f = ~ .x %>%
dplyr::rename_all(funs(stringr::str_c(.y, ., sep = '_'))))
This produces a single data.frame with the day-specific start & stop times:
> str(hours)
'data.frame': 61184 obs. of 14 variables:
$ Tuesday_close : chr "17:00" NA NA "21:00" ...
$ Tuesday_open : chr "08:00" NA NA "10:00" ...
$ Friday_close : chr "17:00" NA NA "21:00" ...
$ Friday_open : chr "08:00" NA NA "10:00" ...
$ Monday_close : chr "17:00" NA NA "21:00" ...
$ Monday_open : chr "08:00" NA NA "10:00" ...
$ Wednesday_close: chr "17:00" NA NA "21:00" ...
$ Wednesday_open : chr "08:00" NA NA "10:00" ...
$ Thursday_close : chr "17:00" NA NA "21:00" ...
$ Thursday_open : chr "08:00" NA NA "10:00" ...
$ Sunday_close : chr NA NA NA "18:00" ...
$ Sunday_open : chr NA NA NA "11:00" ...
$ Saturday_close : chr NA NA NA "21:00" ...
$ Saturday_open : chr NA NA NA "10:00" ...
Rather than put important information in the field names, however, you might prefer to "denormalize" some data, to produce a more tidy structure:
> purrr::flatten_dfr(yelpBusinessData$hours, .id = 'day')
# A tibble: 61,184 x 3
day close open
<chr> <chr> <chr>
1 1 NA NA
2 1 NA NA
3 1 NA NA
4 1 21:00 10:00
5 1 16:00 10:00
6 1 NA NA
7 1 NA NA
8 1 NA NA
9 1 NA NA
10 1 02:00 08:00
# ... with 61,174 more rows
Is there a better way to filter these data, maintaining their original hierarchy?
At the end of the day, there is a fundamental problem in your original data structure. A data.frame in R is implemented as a list of lists, and yet your data are stored as a data.frame of data.frames. This leads to confusion when indexing into various parts of the structure.
This is a little unorthodox, but one option is to keep your data as a list of lists rather than convert to a data.frame right away. Using the tools in purrr package, you can work with lists pretty easily to filter/flatten your data and then construct a data.frame from the filtered results.
For example:
> ## read in yelpBusinessData without converting to data.frame
> yelpBusinessData2 <- fromJSON(sprintf("[%s]",
paste(readLines(yelpBusinessDataFilePath),
collapse = ",")),
flatten = FALSE,
simplify = FALSE)
# filter to Phoenix cities _before_ converting to a data.frame
> yelpBusinessData2 %>%
purrr::keep(~ .$'city' == 'Phoenix'
&& grepl("Restaurants", .$categories)) %>%
jsonlite:::simplify(., flatten = T) %>%
dplyr::select(business_id, full_address, contains('kids')) %>%
str()
'data.frame': 8410 obs. of 5 variables:
$ business_id : chr "vcNAWiLM4dR7D2nwwJ7nCA" "x5Mv61CnZLohZWxfCVCPTQ" "2ZnCITVa0abGce4gZ6RhIw" "EmzaQR5hQlF0WIl24NxAZA" ...
$ full_address : chr "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018" "2819 N Central Ave\nPhoenix, AZ 85004" "1850 N Central Ave\nPhoenix, AZ 85004" "132 E Washington St\nPhoenix, AZ 85004" ...
$ attributes.Good for Kids : logi NA FALSE TRUE FALSE NA NA ...
$ attributes.Good For Kids : logi NA NA NA NA NA NA ...
$ attributes.Hair Types Specialized In.kids: logi NA NA NA NA NA NA ...
As a final thought, I would say, if you're still left with variable-naming issues, take a look at the janitor package in R, specifically the clean_names() function. This package has some nice features for working with messy data, particularly when read in from Excel.