I want to combine all files in a folder in one DataFrame. All the files are identically structured. An example:
{"title": "Olijvenpest duikt op", "source": "De Standaard", "source_page": "18", "date": "2018-10-16", "body": "In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden"}
R is able to see all the files with the following code:
library(jsonlite)
path <- "..."
files <- dir(path, pattern = "*",full.names = TRUE)
However, I'm unable to combine all the files. I have tried numerous options discussed on stackoverflow, but they always result in errors.
Each json should be one observation with the different variables: title, source, date,...
Kind regards
Steven
Import all these files in a vector:
x <- character(length(files))
for(i in seq_along(files)){
x[i] <- readLines(files[i])
}
Create the JSON representation of the dataframe:
json <- paste0("[", paste0(x, collapse = ","), "]")
Now use 'jsonlite': fromJSON(json).
First, you read in the JSON files with
files <- dir(pattern = "*.json", full.names = TRUE)
JSON <- lapply(files, function(z) rjson::fromJSON(file = z))
Note that you need to have your .json files in an appropriate folder for this to work where the folder is also your current working directory.
Then you transform your read-in files with the help of tidyverse's dplyr like so
JSON_New <- lapply(JSON, function(x) {cols <- names(x); x %>%
unique %>%
unlist %>%
as.data.frame %>%
tibble::rownames_to_column("Content") %>%
mutate(Category = names(x)) %>%
select(3, 2)})
so that you receive a list of data.frames that you can now rename according to your filename with
names(myJSON_New) <- files
For example, if I use your given example file twice then I'll get the following output after I followed the steps that I described up til now.
[[1]]
Category .
1 title Olijvenpest duikt op
2 source De Standaard
3 source_page 18
4 date 2018-10-16
5 body In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
[[2]]
Category .
1 title Olijvenpest duikt op
2 source De Standaard
3 source_page 18
4 date 2018-10-16
5 body In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
Then, to combine the list of data.frames such that one json is one row you can do
> A <- JSON_New %>% reduce(inner_join, by = "Category") %>% data.table::transpose(.)
V1 V2 V3 V4 V5
1 title source source_page date body
2 Olijvenpest duikt op De Standaard 18 2018-10-16 In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
3 Olijvenpest duikt op De Standaard 18 2018-10-16 In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
Related
I have messy data. I want to subset the data based on a phrase in a column till the end.
df1 <- data.frame(
V1=c("No. de Control Interno de", "la Partida / Concepto de Obra","",
"LO-009J0U004-","E50-2021",""),
V2=c("","Descripción Breve","Trabajos de señalamiento horizontal en puente de",
"cuota \"El Zacatal\", consistentes en suministro y","aplicación de pintura de tránsito, suministro y",
"colocación de botones y ménsulas reflejantes."),
V3=c("","ClaveCUCOP","","","62502002",""),
V4=c("Unidad","Observaciones de Medida","","","Obra",""),
V5=c("","Cantidad","","","1","")
)
Whenver in V2, there is the phrase Descripción, the code should subset dataframe from that row till the end. For example, in the case above, this means selecting data from row 2 till row 6. I was trying with str_detect from stringr package.
You can use the which() function to return the indices where str_detect() is TRUE.
library(stringr)
which(str_detect(df1$V2, "Descripción"))
[1] 2
If instead you save the output of which() to a variable, you can use it to subset your data. Note that the follow explicitly calls the first value in x in case there are more than one place str_detect returns true.
x <- which(str_detect(df1$V2, "Descripción"))
df1[x[1]:nrow(df1),]
V1 V2 V3 V4 V5
2 la Partida / Concepto de Obra Descripción Breve ClaveCUCOP Observaciones de Medida Cantidad
3 Trabajos de señalamiento horizontal en puente de
4 LO-009J0U004- cuota "El Zacatal", consistentes en suministro y
5 E50-2021 aplicación de pintura de tránsito, suministro y 62502002 Obra 1
6 colocación de botones y ménsulas reflejantes.
I have a dataframe consisting of 2 columns called Q1Dummy: respondent ID and responses they made in a string format.
It looks like this:
resp_id Q1
1 Ik vind het niet helemaal netjes om je sociale huurwoning te verhuren, aangezien je dan mensen passeert die al lang op de wachtrij staan of er meer recht op hebben.
2 Ja dat vind ik heel goed omdat mensen die al heel lang op zoek zijn ook een huisje kunnen krijgen.
3 Ik vind het iets begrijpelijks. Als je in de sociale huur zit, geeft het al aan dat je een klein inkomen hebt. Het is fijn om de woning dan achter de hand te hebben als extra inkomen en uitvalsbasis in een stad als Amsterdam. Ook de huur illegaal met iemand delen, waardoor je beide geld bespaard, is een logisch gevolg van de krapte op de huizenmarkt. Ondanks dat het iets illegaals is kan ik er dus begrip voor opbrengen.
... ...
n Dat kan echt niet. Je maakt winst op een woning waar subsidie opzit. Daar is de woning niet voor bedoeld.
Now, for text mining purposes I would like to unnest the responses in ngrams (of 3), as I tried below:
tokensQ1Dummy <- Q1Dummy %>%
unnest_tokens(words, Q1, token = "ngrams", n = 3, n_min = 1) %>%
count(resp_id, words, sort = TRUE)
However, when I try this the created 'words' column consists of multiple issues of the same word. So in this case it would show the word 'de' multiple times, for the multiple users:
resp_id words count
3 de 6
3 het 4
5 de 4
But what I want is to consider all responses as 'one' response, so that the important subjects returning in multiple responses will be considered as one subject, and thus that the word 'de' will only come up once (since it is the same word, but used by multiple respondents). How do I go about this?
You need to group by resp_id, summarise and collapse to concatenate into one. Difficult to illustrate precisely from your data example but the code is something like:
library(tidyverse)
df %>%
group_by(resp_id) %>%
summarise(col = paste(Q1, collapse=" "))
I used pdftools to convert some pdf documents to txt. This is a part of the output (it's not so bad)
REPÚBLICA DE CHILE PADRON ELECTORAL AUDITADO ELECCIONES PRESIDENCIAL, PARLAMENTARIAS y de CONSEJEROS REGIONALES 2017 REGISTROS: 2.421
SERVICIO ELECTORAL REGIÓN : ARICA Y PARINACOTA COMUNA: GENERAL LAGOS PÁGINA 1 de 38
PROVINCIA : PARINACOTA
NOMBRE C.IDENTIDAD SEXO DOMICILIO ELECTORAL CIRCUNSCRIPCIÓN MESA
AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M
AGUILERA ZENTENO PATRICIA ALEJANDRA 16.223.938-4 MUJ PUEBLO DE GUACOLLO S N CERCANO A GENERAL LAGOS 5M
AGUIRRE CHOQUE MARCOS JULIO 15.000.385-7 VAR CIRCUNSCRIPCION
CALLE TORREALBA DE VISVIRI
CASA N° 4 PUEBLO DE VISVIRI GENERAL LAGOS 7V
So I'm doing this to clean this and convert it into formatted tsv:
test = read_lines("file.txt")
test2 = test[!grepl("REP\u00daBLICA",test)]
test2 = test2[!grepl("SERVICIO",test2)]
test2 = test2[!grepl("NOMBRE",test2)]
test2 = test2[!grepl("PROVINCIA",test2)]
test2 = gsub("\\.", "", test2)
test2 = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", test2, perl=TRUE)
and the output is:
ABRIGO PIZARRO PATRICIO ESTEBAN 16024716-9 VAR PUEB ALCERRECA GENERAL LAGOS 5V
ABURTO VELASCO ESTHER MARISOL 13005517-6 MUJ VILLA INDUSTRIAL GENERAL LAGOS 2M
ACEVEDO MONTT SEBASTIAN ANDRES 17829470-9 VAR CALLE RAFAEL TORREALBA N° 3 PUEBLO DE VISVIRI GENERAL LAGOS 3V
ACHILLO BLAS ADOLFO ARTURO 13008044-8 VAR VISURI GENERAL LAGOS 7V
I've read some posts and I'm not sure how to implement:
Something like gsub("(?<=[\\s+])[0-9]", "\t", test2, perl=TRUE), this is to replace multiple spaces followed by a number by tab followed by a number
How to move broken lines to the end of the previous line, such as line 8 in the above sample that starts with multiple spaces.
Fixing (1) and (2) would return this:
ABRIGO PIZARRO PATRICIO ESTEBAN \t 16024716-9 \t VAR \t PUEB ALCERRECA \t GENERAL LAGOS \t 5V
ABURTO VELASCO ESTHER MARISOL \t 13005517-6 \t MUJ \t VILLA INDUSTRIAL \t GENERAL LAGOS \t 2M
(1) You can use the words "VAR" and "MUJ" as key-words for splitting:
x <- "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M"
strsplit(x, "\\s{2,}|\\s(?=\\bMUJ\\b)|(?<=\\bMUJ\\b)\\s|\\s(?=\\bVAR\\b)|(?<=\\bVAR\\b)\\s", perl = TRUE)
The result is:
[[1]]
[1] "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA" "13.638.826-6" "MUJ"
[4] "PUEBLO DE TACORA S N VISVIRI" "GENERAL LAGOS" "4M"
Maybe not the most elegant solution, but it works and if you can modify the data you could use real key-words and assure they are unique.
(2) An easy solution would be to check rows length and move values up if the row is too short
I have more than a thousand rows in my dataframe. One of its columns should hold a single word. I want to lowercase this column:
df$precedingWord <- tolower(df$precedingWord)
But surprisingly, I get an error
Error in tolower(df$precedingWord) :
invalid input '/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #' in 'utf8towcs'
From this I gather that one a specific row, df$precedingWord doesn't hold a single word, but more than a sentence, namely /home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #.
Now to debug this, I'd like to know the row ID of the sentence that is thrown. How can I find this out?
Use grep to search for the string:
x <- c("a",
'/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #')
grep("/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:",
x, fixed = TRUE)
#[1] 2
I want to merge/paste (paste(c(...), collapse=" ")) strings in a dataframe based on value (author) in a different column. I am looking for an efficient way to do it.
df <- data.frame(author = c("Shakespeare",
"Dante",
"Proust",
"Shakespeare",
"Dante",
"Proust",
"Shakespeare"),
text = c("Put the wild waters in this roar, allay them",
"Ma tu perche' ritorni a tanta noia?",
"Longtemps, je me suis couché de bonne heure",
"The very virtue of compassion in thee",
"Pensa oramai qual fu colui che degno",
"Quelle horreur! me disais-je",
"She said thou wast my daughter; and thy father"))
And the end result should be
result <- c("Put the wild waters in this roar, allay them The very virtue of compassion in thee She said thou wast my daughter; and thy father",
"Ma tu perche' ritorni a tanta noia? Pensa oramai qual fu colui che degno",
"Longtemps, je me suis couché de bonne heure Quelle horreur! me disais-je")
names(result) <- c("Shakespeare","Dante","Proust")
result
# Shakespeare
# "Put the wild waters in this roar, allay them The very virtue of compassion in thee She said thou wast my daughter; and thy father"
# Dante
# "Ma tu perche' ritorni a tanta noia? Pensa oramai qual fu colui che degno"
# Proust
# "Longtemps, je me suis couché de bonne heure Quelle horreur! me disais-je"
I guess I should somehow use some function from the apply family. Something like
apply( df[??? , 2 , paste , collapse = " " )
but I am not sure how to pass the condition and then obtain as result the name of the author to which the pasted strings correspond...
tapply works more or less exactly as you expected:
tapply(df$text, df$author, paste, collapse = " ")
A more en vogue solution would be to use dplyr
library(dplyr)
df %>% group_by(author) %>% summarize(passage = paste(text, collapse = " "))