I have imported a movie dataset in csv format, few of the columns are full of special symbols along with the data I need(Example is attached below along with the image of the Movie dataset). Now, do I have to remove those special characters individually OR is there anyway(shortcut) to remove them while importing the file into R. Thanks
Movie.csv Image
GENRE
[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]
Spoken Languages
[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\u00f1ol"}]
Related
I would like to change a field in my json file as specified by another json file. My input file is something like:
{"id": 10, "name": "foo", "some_other_field": "value 1"}
{"id": 20, "name": "bar", "some_other_field": "value 2"}
{"id": 25, "name": "baz", "some_other_field": "value 10"}
I have an external override file that specifies how name in certain objects should be overridden, for example:
{"id": 20, "name": "Bar"}
{"id": 10, "name": "foo edited"}
As shown above, the override may be shorter than input, in which case the name should be unchanged. Both files can easily fit into available memory.
Given the above input and the override, I would like to obtain the following output:
{"id": 10, "name": "foo edited", "some_other_field": "value 1"}
{"id": 20, "name": "Bar", "some_other_field": "value 2"}
{"id": 25, "name": "baz", "some_other_field": "value 10"}
Being a beginner with jq, I wasn't really sure where to start. While there are some questions that cover similar ground (the closest being this one), I couldn't figure out how to apply the solutions to my case.
There are many possibilities, but probably the simplest, efficient solution would use the built-in function: INDEX/2, e.g. as follows:
jq -n --slurpfile dict f2.json '
(INDEX($dict[]; .id) | map_values(.name)) as $d
| inputs
| .name = ($d[.id|tostring] // .name)
' f1.json
This uses inputs with the -n option to read the first file so that each JSON object can be processed in turn.
Since the solution is so short, it should be easy enough to figure it out with the aid of the online jq manual.
Caveat
This solution comes with a caveat: that there are no "collisions" between ids in the dictionary as a result of the use of "tostring" (e.g. if {"id": 10} and {"id": "10"} both occurred).
If the dictionary does or might have such collisions, then the above solution can be tweaked accordingly, but it is a bit tricky.
I've got a JSON file that looks like this
I am trying to import it into R using the jsonlite package.
#Load package for import
library(jsonlite)
df <- fromJSON("test.json")
But it throws an error
Error in parse_con(txt, bigint_as_char) : parse error: trailing
garbage
ome in at a later time." } { "id": "e5fa37f44557c62ee
(right here) ------^
I've tried looking at all solutions on stackoverflow, but haven't been able to figure this out.
Any inputs would be very helpful.
The JSON file you linked contains two JSON objects. Perhaps you want an array:
[
{
"id": "71bb8883780bb152e4bb4db976bedc62",
"metadata": {
"abc_bad_date": "true",
"abc_client": "Hydra Corp",
"abc_doc_id": 1,
"abc_file": "Hydra Corp 2016.txt",
"abc_interview_type": "Post Analysis",
"abc_interviewee_role": "Director Corporate Engineering; Greater Chicago Area; Global Procurement Director Facilities and MRO",
"abc_interviewer": "Piper Thomas",
"abc_services_provided": "Food",
"section": "on_expectations"
},
"text": "Gerrit: There were a number ...."
},
{
"id": "e5fa37f44557c62eef44baafb13128f0",
"metadata": {
"abc_bad_date": "true",
"abc_client": "Hydra Corp",
"abc_doc_id": 1,
"abc_file": "Hydra Corp 2016.txt",
"abc_interview_type": "Post Analysis",
"abc_interviewee_role": "Director Corporate Engineering; Greater Chicago Area; Global Procurement Director Facilities and MRO",
"abc_interviewer": "Piper Thomas",
"abc_services_provided": "Painting",
"section": "on_relationships"
},
"text": "Gerrit: I thought the ABC ..."
}
]
in my HIVE table MYTABLE I have one column "MYCOL" that contains this:
{"id": "a651b57f",
"items": {
"ITEM1": {
"code": "CODE1",
"name": "NAME1"},
"ITEM2": {
"code": "CODE2",
"name": "NAME2"}},
"myinfo": {
"c7daf1a9": {
"id": "c7daf1a9",
"name": "newname",
"type": "newtype",
"appliedto": ["ITEM1", "ITEM2"]}},
"info2": 12}
I would like to access the elements into "myinfo" and I tried something like this:
select GET_JSON_OBJECT(t.MYCOL,'$.myinfo') FROM MYTABLE
but it doesn't work....
may someone help me?
thanks
Make sure the data in HDFS file have one line for each json row (not multiple new lines for one row).
If json row is having multiple new lines then we need to replace all newlines for each row before storing into HDFS.
Example:
HDFS file data:
{"id": "a651b57f","items": {"ITEM1": {"code": "CODE1","name": "NAME1"},"ITEM2": {"code": "CODE2","name": "NAME2"}},"myinfo": {"c7daf1a9": {"id": "c7daf1a9","name": "newname","type": "newtype","appliedto": ["ITEM1", "ITEM2"]}},"info2": 12}
Hive:
with cte as (select string('{"id": "a651b57f","items": {"ITEM1": {"code": "CODE1","name": "NAME1"},"ITEM2": {"code": "CODE2","name": "NAME2"}},"myinfo": {"c7daf1a9": {"id": "c7daf1a9","name": "newname","type": "newtype","appliedto": ["ITEM1", "ITEM2"]}},"info2": 12}')my_col) --sample data
select get_json_object(my_col,'$.myinfo')jsn from cte;
Output:
{"c7daf1a9":{"id":"c7daf1a9","name":"newname","type":"newtype","appliedto":["ITEM1","ITEM2"]}}
Update
--to access name subfield we need to specify the path of json object
hive> select get_json_object(my_col,'$.myinfo.c7daf1a9.name')jsn from <table_name>;
--result
newname
hive> select get_json_object(my_col,'$.myinfo.c7daf1a9.appliedto')jsn from <table_name>;
--result
["ITEM1","ITEM2"]
hive> select get_json_object(my_col,'$.myinfo.c7daf1a9.appliedto[0]')jsn from <table_name>;
--result
ITEM1
I have a .csv, file unfortunately one of the columns contains a dictionary that has commas in it , for example:
{"name": "Umbulharjo", "type": "Kecamatan", "level": "3", "region1": "Yogyakarta", "region2": "Yogyakarta", "region3": "Umbulharjo", "postcode": "55161"}
How can i put a " before every { and after every } in R? then i can set " as quote when i am using read.csv or read.csv2 or read.table
Your data looks to be JSON-ish. If you're doing a lot of JSON stuff, I suggest using a library that understands JSON.
paste('{"name": "Umbulharjo", "type": "Kecamatan", "level": "3", "region1": "Yogyakarta", "region2": "Yogyakarta", "region3": "Umbulharjo", "postcode": "55161"}', '')
#[1] "{\"name\": \"Umbulharjo\", \"type\": \"Kecamatan\", \"level\": \"3\", \"region1\": \"Yogyakarta\", \"region2\": \"Yogyakarta\", \"region3\": \"Umbulharjo\", \"postcode\": \"55161\"} "
I need to delete multiple keys at once from some JSON (using jq), and I'm trying to learn if there is a better way of doing this, than calling map and del every time. Here's my input data:
test.json
[
{
"label": "US : USA : English",
"Country": "USA",
"region": "US",
"Language": "English",
"locale": "en",
"currency": "USD",
"number": "USD"
},
{
"label": "AU : Australia : English",
"Country": "Australia",
"region": "AU",
"Language": "English",
"locale": "en",
"currency": "AUD",
"number": "AUD"
},
{
"label": "CA : Canada : English",
"Country": "Canada",
"region": "CA",
"Language": "English",
"locale": "en",
"currency": "CAD",
"number": "CAD"
}
]
For each item, I want to remove the number, Language, and Country keys. I can do that with this command:
$ cat test.json | jq 'map(del(.Country)) | map(del(.number)) | map(del(.Language))'
That works fine, and I get the desired output:
[
{
"label": "US : USA : English",
"region": "US",
"locale": "en",
"currency": "USD"
},
{
"label": "AU : Australia : English",
"region": "AU",
"locale": "en",
"currency": "AUD"
},
{
"label": "CA : Canada : English",
"region": "CA",
"locale": "en",
"currency": "CAD"
}
]
However, I'm trying to understand if there is a jq way of specifying multiple labels to delete, so I don't have to have multiple map(del()) directives?
You can provide a stream of paths to delete:
$ cat test.json | jq 'map(del(.Country, .number, .Language))'
Also, consider that, instead of blacklisting specific keys, you might prefer to whitelist the ones you do want:
$ cat test.json | jq 'map({label, region, locale, currency})'
There is no need to use both map and del.
You can pass multiple paths to del, separated by commas.
Here is a solution using "dot-style" path notation:
jq 'del( .[] .Country, .[] .number, .[] .Language )' test.json
doesn't require quotation marks (which you may feel makes it more readable)
doesn't group the paths (requires you to retype .[] once per path)
Here is an example using "array-style" path notation, which allows you to combine paths with a common prefix like so:
jq 'del( .[] ["Country", "number", "Language"] )' test.json
Combines subpaths under the "last-common ancestor" (which in this case is the top-level list iterator .[])
peak's answer uses map and delpaths, though it seems you can also use delpaths on its own:
jq '[.[] | delpaths( [["Country"], ["number"], ["Language"]] )]' test.json
Requires both quotation marks and array of singleton arrays
Requires you to put it back into a list (with the start and end square brackets)
Overall, here I'd go for the array-style notation for brevity, but it's always good to know multiple ways to do the same thing.
A better compromise between "array-style" and "dot-style" notation mentioned in by Louis in his answer.
del(.[] | .Country, .number, .Language)
jqplay
This form can also be used to delete a list of keys from a nested object (see russholio's answer):
del(.a | .d, .e)
Implying that you can also pick a single index to delete keys from:
del(.[1] | .Country, .number, .Language)
Or multiple:
del(.[2,3,4] | .Country,.number,.Language)
You can delete a range using the range() function (slice notation doesn't work):
del(.[range(2;5)] | .Country,.number,.Language) # same as targetting indices 2,3,4
Some side notes:
map(del(.Country,.number,.Language))
# Is by definition equivalent to
[.[] | del(.Country,.number,.Language)]
If the key contains special characters or starts with a digit, you need to surround it with double quotes like this: ."foo$", or else .["foo$"].
This question is very high in the google results, so I'd like to note that some time in the intervening years, del has apparently been altered so that you can delete multiple keys with just:
del(.key1, .key2, ...)
So don't tear your hair out trying to figure out the syntax work-arounds, assuming your version of jq is reasonably current.
In addition to #user3899165's answer, I found that to delete a list of keys from "sub-object"
example.json
{
"a": {
"b": "hello",
"c": "world",
"d": "here's",
"e": "the"
},
"f": {
"g": "song",
"h": "that",
"i": "I'm",
"j": "singing"
}
}
$ jq 'del(.a["d", "e"])' example.json
delpaths is also worth knowing about, and is perhaps a little less mysterious:
map( delpaths( [["Country"], ["number"], ["Language"]] ))
Since the argument to delpaths is simply JSON, this approach is particularly useful for programmatic deletions, e.g. if the key names are available as JSON strings.