Related
I have a large json dataset and I would like to convert it to a data frame in R
(Sorry if it may be a duplicated question but other answers didn't help me)
My Json file is as follows:
[{"src": "http://www.europarl.eu", "peid": "PE529.899v01-00", "reference": "2014/2021(INI)", "date": "2014-03-05T00:00:00", "committee": ["AFET"], "seq": 1, "id": "PE529.899-1", "orig_lang": "en", "new": ["- having regard to its resolution of 13", "December 20071 on Justice for the", "'Comfort Women' (sex slaves in Asia", "before and during World War II) as well", "as the statements by Japanese Chief", "Cabinet Secretary Yohei Kono in 1993", "and by the then Prime Minister Tomiichi", "Murayama in 1995, the resolutions of the", "Japanese parliament (the Diet) of 1995", "and 2005 expressing apologies for", "wartime victims, including victims of the", "'comfort women' system,", "_______________________", "1", "OJ C 323E, 18.12.2008, p.531"], "authors": "Reinhard Bütikofer on behalf of the Verts/ALE Group", "meps": [96739], "location": [["Motion for a resolution", "Citation 6 a (new)"]], "meta": {"created": "2019-07-03T05:06:17"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.863v01-00", "reference": "2014/2016(INI)", "date": "2014-02-27T00:00:00", "committee": ["AFET"], "seq": 1, "id": "PE529.863-1", "orig_lang": "en", "new": ["- having regard to the Statement by the", "Vice-President of the Commission/ High", "Representative of the Union for Foreign", "affairs and Security Policy (VP/HR)", "Catherine Ashton of 20 March 2013 on", "the Magnitsky case in the Russian", "Federation,"], "authors": "Jacek Protasiewicz", "meps": [23782], "location": [["Motion for a resolution", "Citation 4 a (new)"]], "meta": {"created": "2019-07-03T05:06:17"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.713v01-00", "reference": "2013/2149(INI)", "date": "2014-02-12T00:00:00", "committee": ["AFET"], "seq": 238, "id": "PE529.713-238", "orig_lang": "en", "old": ["A. whereas the European Neighbourhood", "Policy (ENP), in particular the Eastern", "Partnership (EaP), aims to extend the", "values and ideas of the founders of the EU;"], "new": ["A. whereas the European Neighbourhood", "Policy (ENP) embraces the values and", "ideas of the founders of the EU, notably", "the principles of Peace, Solidarity and", "Prosperity;"], "authors": "Mário David", "meps": [96973], "location": [["Motion for a resolution", "Recital A"]], "meta": {"created": "2019-07-03T05:06:18"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.899v01-00", "reference": "2014/2021(INI)", "date": "2014-03-05T00:00:00", "committee": ["AFET"], "seq": 2, "id": "PE529.899-2", "orig_lang": "en", "new": ["- having regard to the catastrophic", "earthquake and subsequent tsunami", "which devastated important parts of", "Japan's coast on 11 March 2011 and led", "to the destruction of the Fukushima", "nuclear power plant, causing possibly the", "greatest radiation disaster in human", "history,"], "authors": "Reinhard Bütikofer on behalf of the Verts/ALE Group", "meps": [96739], "location": [["Motion for a resolution", "Citation 11 a (new)"]], "meta": {"created": "2019-07-03T05:06:18"}, "changes": {}}
I would like to have a dataframe as follows:
src peid reference date committee seq id orig_lang new ...
http://www.europarl.eu PE529.899v01-00 2014/2021(INI) 2014-03-05T00:00:00 AFET 1 PE529.899-1 en ["- having ... p.531"] ...
http://www.europarl.eu PE529.863v01-00 2014/2016(INI) 2014-02-27T00:00:00 AFET 128 PE529.899-1 en ["- having ..."Federation,"] ...
http://www.europarl.eu PE529.713v01-00 2013/2149(INI) 2014-02-12T00:00:00 AFET 238 PE529.899-1 en ["- having ..."Federation,"] ...
http://www.europarl.eu PE529.899v01-00 2014/2021(INI) 2014-03-05T00:00:00 AFET 1 PE529.899-1 en ["- having ..."Federation,"] ...
(I didn't write the complete table above)
I have already tried the following codes:
library(rjson)
library(jsonlite)
Data <- fromJSON(file="data.json")
but each row is shown as below:
[[1]]
[[1]]$src
[1] "http://www.europarl.eu/sides/getDoc.do?pubRef=-//EP//NONSGML+COMPARL+PE-529.899+01+DOC+PDF+V0//EN&language=EN"
[[1]]$peid
[1] "PE529.899v01-00"
[[1]]$reference
[1] "2014/2021(INI)"
[[1]]$date
[1] "2014-03-05T00:00:00"
[[1]]$committee
[1] "AFET"
[[1]]$seq
[1] 1
[[1]]$id
[1] "PE529.899-1"
[[1]]$orig_lang
[1] "en"
[[1]]$new
[1] "- having regard to its resolution of 13" "December 20071 on Justice for the"
[3] "'Comfort Women' (sex slaves in Asia" "before and during World War II) as well"
[5] "as the statements by Japanese Chief" "Cabinet Secretary Yohei Kono in 1993"
[7] "and by the then Prime Minister Tomiichi" "Murayama in 1995, the resolutions of the"
[9] "Japanese parliament (the Diet) of 1995" "and 2005 expressing apologies for"
[11] "wartime victims, including victims of the" "'comfort women' system,"
[13] "_______________________" "1"
[15] "OJ C 323E, 18.12.2008, p.531"
[[1]]$authors
[1] "Reinhard Bütikofer on behalf of the Verts/ALE Group"
[[1]]$meps
[1] 96739
[[1]]$location
[[1]]$location[[1]]
[1] "Motion for a resolution" "Citation 6 a (new)"
[[1]]$meta
[[1]]$meta$created
[1] "2019-07-03T05:06:17"
[[1]]$changes
list()
dput version is below:
list(list(src = "http://www.europarl.eu",
peid = "PE529.899v01-00", reference = "2014/2021(INI)", date = "2014-03-05T00:00:00",
committee = "AFET", seq = 1, id = "PE529.899-1", orig_lang = "en",
new = c("- having regard to its resolution of 13", "December 20071 on Justice for the",
"'Comfort Women' (sex slaves in Asia", "before and during World War II) as well",
"as the statements by Japanese Chief", "Cabinet Secretary Yohei Kono in 1993",
"and by the then Prime Minister Tomiichi", "Murayama in 1995, the resolutions of the",
"Japanese parliament (the Diet) of 1995", "and 2005 expressing apologies for",
"wartime victims, including victims of the", "'comfort women' system,",
"_______________________", "1", "OJ C 323E, 18.12.2008, p.531"
), authors = "Reinhard Bütikofer on behalf of the Verts/ALE Group",
meps = 96739, location = list(c("Motion for a resolution",
"Citation 6 a (new)")), meta = list(created = "2019-07-03T05:06:17"),
changes = list()))
One of the problems that I have is in column 9 as you can see below, I want to put all the 15 components in one cell of the dataframe
[[1]]$new
[1] "- having regard to its resolution of 13" "December 20071 on Justice for the"
[3] "'Comfort Women' (sex slaves in Asia" "before and during World War II) as well"
[5] "as the statements by Japanese Chief" "Cabinet Secretary Yohei Kono in 1993"
[7] "and by the then Prime Minister Tomiichi" "Murayama in 1995, the resolutions of the"
[9] "Japanese parliament (the Diet) of 1995" "and 2005 expressing apologies for"
[11] "wartime victims, including victims of the" "'comfort women' system,"
[13] "_______________________" "1"
[15] "OJ C 323E, 18.12.2008, p.531"
How can I get the table I mentioned above?
We may either convert the nested list elements with lengths greater than 1 to a single string by pasteing (str_c) and then bind the named list to columns with _dfr
library(purrr)
library(dplyr)
library(stringr)
map_dfr(Data, ~ map(.x, unlist) %>%
map_dfr(~ if(length(.x) > 1) str_c(.x, collapse = ";") else .x))
Or use a recursive function rrapply to bind the elements having length greater than 1 as list column
library(rrapply)
map_dfr(Data, ~ rrapply(.x, how = "bind"))
After a decade of lurking on stack overflow, I'm finally dipping in my toes to ask for help! Apologies for any mistakes!
I'm extracting tables from word to create my own data frame. There's about 50 documents, all with the same table, but the data isn't mine, and is a little messy, to put it mildly.
The table is 2 columns (Name, Values) by 60 rows, and df$Name contents are often written wrongly, or rows are missing all together. This is not my data, so editing it is not an option.
My problem is - I want to bind each word docs' data together, so they need to have the same columns. I will be transposing the data so Name becomes the header, Value becomes row 1. Because the df$Name contents are messy, I used grep to extract those rows I wanted. (previously I tried extracting but row number, but the row numbers changed between word docs)
These are all the values in df$Name that should be present.
Col <- c("Top Film / Web code (if applicable)", "Base Film / Web code (if applicable)", "Top Label / Sleeves code", "Base Label code", "Promotional Label code", "Trays Code", "SRP code", "SRP label code", "Packing format (overwrap, MAP, VAC)", "Vac pressure (if applicable)", "Die set", "Optimal running speed (max)", "Gas mix (if applicable)", "Pressure for Leaker checks (bar)", "Frequency of checks", "Metal Detection Limits", "No. of Units per pack","Pack weight", "Claims", "Shelf Life Of Product From Pack / Slice", "Date code format", "Health Mark", "UK & EU Address","“e” mark present", "Weight present", "Top Label Placement", "Base Label Placement", "Promo Label Placement", "Barcode (if applicable)", "No. of Packs per SRP/Basket","Weight of outercase", "Max No. of SRP/Baskets per pallet")
##use grep to get R to search for similar words present in all word docs################
toMatch <- c("Top Film","Base Film", "Top Label", "Base Label", "Promotional", "Trays", "SRP","Packing format", "Vac pressure", "Die set", "Optimal running speed", "Gas mix",
"Pressure", "Frequency", "Metal Detection Limits", "per pack",
"Pack weight",
"Claims", "Shelf Life", "Date code", "Health", "Address",
"“e”", "Weight present", "Top Label Place", "Base Label Place", "Promo Label Place", "Barcode","No. of Packs","outercase", "Max No.")
tab_select <- unique (df[grep(paste(toMatch,collapse="|"),
df$Name, ignore.case=TRUE),])
Using grep like this is pretty successful - but if a value is missing, there's no sign of it - So in this case "Trays Code" was not present - but I need a blank "Trays Code" (with NA in Value) to be created. Adding one in doesn't help, as it's at the bottom of the table, and I need them to stay in the right order.
Is there a way to get grep to match, but also create a row with NA if there are no matches?
I tried making a separate table with the correct column names - using dplyr to join, hoping any duplicates would disappear, but the slightly differing names in df$Name and Col mean more duplicates.
I'm not sure if I should be looping through each pattern and creating a row if there's no much - I'm just wary of making loops in loops in loops, which could happen.
ATM, this one grep formula is using multiple patterns, and some patterns pick up multiple rows of data, which might complicate things.
How about this:
df <- data.frame(Name = c("Top Film / Web code (if applicable)", "Base Film / Web code (if applicable)", "Top Label / Sleeves code", "Base Label code", "Promotional Label code", "Trays Code", "SRP code", "SRP label code", "Packing format (overwrap, MAP, VAC)", "Vac pressure (if applicable)", "Die set", "Optimal running speed (max)", "Gas mix (if applicable)", "Pressure for Leaker checks (bar)", "Frequency of checks", "Metal Detection Limits", "No. of Units per pack","Pack weight", "Claims", "Shelf Life Of Product From Pack / Slice", "Date code format", "Health Mark", "UK & EU Address","“e” mark present", "Weight present", "Top Label Placement", "Base Label Placement", "Promo Label Placement", "Barcode (if applicable)", "No. of Packs per SRP/Basket","Weight of outercase", "Max No. of SRP/Baskets per pallet"))
toMatch <- c("Top Film","Base Film", "Top Label", "Base Label", "Promotional", "Trays", "SRP","Packing format", "Vac pressure", "Die set", "Optimal running speed", "Gas mix", "Pressure", "Frequency", "Metal Detection Limits", "per pack", "Pack weight", "Claims", "Shelf Life", "Date code", "Health", "Address", "“e”", "Weight present", "Top Label Place", "Base Label Place", "Promo Label Place", "Barcode","No. of Packs","outercase", "Max No.")
df$Value <- 1:nrow(df)
df$Name[6] <- "Not Matched"
out <- lapply(toMatch, function(x){
if(any(grepl(x, df$Name))){
df[grep(x, df$Name), ]
}else{
data.frame(Name = x, Value=NA)
}
})
out <- do.call(rbind, out)
head(out, n=10)
#> Name Value
#> 1 Top Film / Web code (if applicable) 1
#> 2 Base Film / Web code (if applicable) 2
#> 3 Top Label / Sleeves code 3
#> 26 Top Label Placement 26
#> 4 Base Label code 4
#> 27 Base Label Placement 27
#> 5 Promotional Label code 5
#> 110 Trays NA
#> 7 SRP code 7
#> 8 SRP label code 8
Created on 2023-01-08 by the reprex package (v2.0.1)
Note that I changed the sixth observation of Name in df to "Not Matched" to show what happens when there is no match. It is a match for "Trays" in the original data. You can see what happens with no match on line 6 of the output.
I've a character object with 84 elements.
> head(output.by.line)
[1] "\n17"
[2] "Now when Joseph saw that his father"
[3] "laid his right hand on the head of"
[4] "Ephraim, it displeased him; so he took"
[5] "hold of his father's hand to remove it"
[6] "from Ephraim's head to Manasseh's"
But there is a line that has 2 numbers (49) that is not in it's own line:
[35] "49And Jacob called his sons and"
I'd like to transform this into:
[35] "\n49"
[36] "And Jacob called his sons and"
And insert this in the correct numeration, after object 34.
Dput Output:
dput(output.by.line)
c("\n17", "Now when Joseph saw that his father", "laid his right hand on the head of",
"Ephraim, it displeased him; so he took", "hold of his father's hand to remove it",
"from Ephraim's head to Manasseh's", "head.", "\n18", "And Joseph said to his father, \"Not so,",
"my father, for this one is the firstborn;", "put your right hand on his head.\"",
"\n19", "But his father refused and said, \"I", "know, my son, I know. He also shall",
"become a people, and he also shall be", "great; but truly his younger brother shall",
"be greater than he, and his descendants", "shall become a multitude of nations.\"",
"\n20", "So he blessed them that day, saying,", "\"By you Israel will bless, saying, \"May",
"God make you as Ephraim and as", "Manasseh!\"' And thus he set Ephraim",
"before Manasseh.", "\n21", "Then Israel said to Joseph, \"Behold, I",
"am dying, but God will be with you and", "bring you back to the land of your",
"fathers.", "\n22", "Moreover I have given to you one", "portion above your brothers, which I",
"took from the hand of the Amorite with", "my sword and my bow.\"",
"49And Jacob called his sons and", "said, \"Gather together, that I may tell",
"you what shall befall you in the last", "days:", "\n2", "\"Gather together and hear, you sons of",
"Jacob, And listen to Israel your father.", "\n3", "\"Reuben, you are my firstborn, My",
"might and the beginning of my strength,", "The excellency of dignity and the",
"excellency of power.", "\n4", "Unstable as water, you shall not excel,",
"Because you went up to your father's", "bed; Then you defiled it-- He went up to",
"my couch.", "\n5", "\"Simeon and Levi are brothers;", "Instruments of cruelty are in their",
"dwelling place.", "\n6", "Let not my soul enter their council; Let",
"not my honor be united to their", "assembly; For in their anger they slew a",
"man, And in their self-will they", "hamstrung an ox.", "\n7",
"Cursed be their anger, for it is fierce;", "And their wrath, for it is cruel! I will",
"divide them in Jacob And scatter them", "in Israel.", "\n8",
"\"Judah, you are he whom your brothers", "shall praise; Your hand shall be on the",
"neck of your enemies; Your father's", "children shall bow down before you.",
"\n9", "Judah is a lion's whelp; From the prey,", "my son, you have gone up. He bows",
"down, he lies down as a lion; And as a", "lion, who shall rouse him?",
"\n10", "The scepter shall not depart from", "Judah, Nor a lawgiver from between his",
"feet, Until Shiloh comes; And to Him", "shall be the obedience of the people.",
"\n11", "Binding his donkey to the vine, And his", "donkey's colt to the choice vine, He"
)
Please, check this:
library(tidyverse)
split_line_number <- function(x) {
x %>%
str_replace("^([0-9]+)", "\n\\1\b") %>%
str_split("\b")
}
output.by.line %>%
map(split_line_number) %>%
unlist()
# Output:
# [35] "\n49"
# [36] "And Jacob called his sons and"
# [37] "said, \"Gather together, that I may tell"
# [38] "you what shall befall you in the last"
An option using stringr::str_match is to match two components of an optional number followed by everything. Get the captured output from the matched matrix (2:3) and create a new vector of strings by dropping NAs and empty strings.
vals <- c(t(stringr::str_match(output.by.line, "(\n?\\d+)?(.*)")[, 2:3]))
output <- vals[!is.na(vals) & vals != ""]
output[32:39]
#[1] "portion above your brothers, which I"
#[2] "took from the hand of the Amorite with"
#[3] "my sword and my bow.\""
#[4] "49"
#[5] "And Jacob called his sons and"
#[6] "said, \"Gather together, that I may tell"
#[7] "you what shall befall you in the last" "days:"
We'll make use of the stringr package:
library(stringr)
Modify the object:
output.by.line <- unlist(
ifelse(grepl('[[:digit:]][[:alpha:]]', output.by.line), str_split(gsub('([[:digit:]]+)([[:alpha:]])', paste0('\n', '\\1 \\2'), output.by.line), '[[:blank:]]', n = 2), output.by.line)
)
Print the resuts:
dput(output.by.line)
#[32] "portion above your brothers, which I"
#[33] "took from the hand of the Amorite with"
#[34] "my sword and my bow.\""
#[35] "\n49"
#[36] "And Jacob called his sons and"
#[37] "said, \"Gather together, that I may tell"
#[38] "you what shall befall you in the last"
So I have a long dataset of sequence.
Every column (from t1 to t...n) has the same levels or categories.
There are more than 200 categories or levels and 144 column (variables) in total.
id t1 t2 t3 t...n
"1" "eating" "tv" "conversation" "..."
"2" "sleep" "driving" "relaxing" "..."
"3" "drawing" "kissing" "knitting" "..."
"..." "..." "..." "..." "..."
Variables t1 has the same levels has t2 and so on.
What I need is a loop-style recoding for each column (but avoiding to loop).
I would like to avoid the usual
seq$t1[seq$t1== "drawing"] <- 'leisure'
seq$t1[seq$t1== "eating"] <- 'meal'
seq$t1[seq$t1== "sleep"] <- 'personal care'
seq$t1[seq$t1== "..."] <- ...
The most convenient recoding style would be something like
c('leisure') = c('drawing', 'tv', ...)
That would help me to better cluster variables into bigger categories.
Is there some new and easier recoding methods in R that appeared lately ?
What would you advise me to use ?
This is a sample of my real dataset, 5 repeated observations (in column) for 10 respondents (in rows).
dtaSeq = structure(c("Wash and dress", "Eating", "Various arrangements", "Cleaning dwelling", "Ironing", "Activities related to sports",
"Eating", "Eating", "Other specified construction and repairs",
"Other specified physical care & supervision of a child", "Wash and dress",
"Filling in the time use diary", "Food preparation", "Wash and dress",
"Ironing", "Travel related to physical exercise", "Eating", "Eating",
"Other specified construction and repairs", "Other specified physical care & supervision of a child",
"Wash and dress", "Filling in the time use diary", "Food preparation",
"Wash and dress", "Food preparation", "Wash and dress", "Eating",
"Eating", "Other specified construction and repairs", "Other specified physical care & supervision of a child",
"Wash and dress", "Filling in the time use diary", "Baking",
"Teaching the child", "Food preparation", "Wash and dress", "Eating",
"Eating", "Other specified construction and repairs", "Other specified physical care & supervision of a child",
"Dish washing", "Unspecified TV watching", "Reading periodicals",
"Teaching the child", "Food preparation", "Reading periodicals",
"Eating", "Eating", "Other specified construction and repairs",
"Feeding the child", "Laundry", "Unspecified TV watching", "Cleaning dwelling",
"Teaching the child", "Eating", "Eating", "Eating", "Eating",
"Other specified construction and repairs", "Feeding the child"),
.Dim = c(10L, 6L), .Dimnames = list(c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10"), c("act1.050", "act1.051", "act1.052",
"act1.053", "act1.054", "act1.055")))
As far as I know, the car package can handle strings or characters in its recode-function, but I'm not sure. An alternative could be the sjmisc-package, making a detour by converting the strings to numeric values and set back value labels later:
library(sjmisc)
dtaSeq <- as.data.frame(dtaSeq)
# convert to values
dtaSeq.values <- to_value(dtaSeq)
# random recode example, use your own values for clustering here
dtaSeq.values <- rec(dtaSeq.values, "1:3=1; 4:6=2; else=3")
# set value labels, these will be added as attributes
dtaSeq.values <- set_val_labels(dtaSeq.values, c("meal", "leisure", "personal care"))
# replace numeric values with assicated label attributes
dtaSeq.values <- to_label(dtaSeq.values)
Result:
> head(dtaSeq.values)
act1.050 act1.051 act1.052 act1.053 act1.054 act1.055
1 personal care personal care leisure personal care meal leisure
2 meal meal meal meal personal care personal care
3 personal care meal meal meal leisure meal
4 meal personal care leisure personal care personal care leisure
5 leisure leisure meal leisure leisure meal
6 meal personal care leisure personal care leisure meal
An advantage of the sjmisc-recode function is, if you have a data frame with variables of similar "structure", you can recode the complete data frame just with one call to rec.
Does this help you?
You don't seem to have fully specified recoding rules for your real data,
so I made some up:
recodes <- list("meals"=c("Eating"),
"leisure"=c("Reading Periodicals",
"Unspecified TV watching"),
"child care"=c("Feeding the child","Teaching the child"),
"house care"=c("Food preparation","Dish washing",
"Cleaning dwelling","Ironing"))
Here's a general-purpose recoding function. car::recode does work,
but I find it a little clumsy. There's also plyr::revalue, but
it's one-to-one, not many-to-one.
recodeFun <- function(x) {
for (i in seq_along(recodes)) {
x[x %in% recodes[[i]]] <- names(recodes)[i]
}
return(x)
}
d2 <- recodeFun(dtaSeq)
I have following data frame:
sent <- data.frame(words = c("just right size", "size love quality", "laptop worth price", "price amazing user",
"explanation complex what", "easy set", "product best buy", "buy priceless when"), user = c(1,2,3,4,5,6,7,8))
Sent data frame resulted into:
words user
just right size 1
size love quality 2
laptop worth price 3
price amazing user 4
explanation complex what 5
easy set 6
product best buy 7
buy priceless when 8
I need to remove word at the begining of following sentence which is the same as a word at the end of previous sentece.
I mean eg. we have a sentences "just right size" and "size love quality", so I need to remove word size at the second user possition.
Then sentences "laptop worth price" and "price amazing user", so I need to remove word price at fourth user possition.
Can anyone help me, I'll appreciate any of your help. Thank you very much in advance.
You could extract the "first" and "last" word from the "words" column for the succeeding row and the current row using sub. If the words are the same, remove the first word from the succeeding row or else keep it as such (ifelse(...))
w1 <- sub(' .*', '', sent$words[-1])
w2 <- sub('.* ', '', sent$words[-nrow(sent)])
sent$words <- as.character(sent$words)
sent$words
#[1] "just right size" "size love quality"
#[3] "laptop worth price" "price amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "buy priceless when"
sent$words[-1] <- with(sent, ifelse(w1==w2, sub('\\w+ ', '',words[-1]),
words[-1]))
sent$words
#[1] "just right size" "love quality"
#[3] "laptop worth price" "amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "priceless when"