Split row string into multiple columns in R - r

I used R and I have this string as a row and I need to split it to be as columns
'id': 1050442590754103297, 'id_str': '1050442590754103297', 'name':
'ام رودينا ', 'screen_name': 'uclkGkQ5', 'location': None,
'url': None, 'description': '\u200f\u200fمن زوي الاحتياجات
الخاصه', 'translator_type': 'none', 'protected': False,
'verified': False, 'followers_count': 1567, 'friends_count': 4019,
'listed_count': 0, 'favourites_count': 6669, 'statuses_count': 9279,
'created_at': 'Thu Oct 11 17:46:44 +0000 2018', 'utc_offset': None,
'time_zone': None, 'geo_enabled': False, 'lang': 'ar',
'contributors_enabled': False, 'is_translator': False,
'profile_background_color': 'F5F8FA', 'profile_background_image_url':
'', 'profile_background_image_url_https': '',
'profile_background_tile': False, 'profile_link_color': '1DA1F2',
'profile_sidebar_border_color': 'C0DEED',
'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color':
'333333', 'profile_use_background_image': True, 'profile_image_url':
'http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg',
'profile_image_url_https':
'https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg',
'profile_banner_url':
'https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015',
'default_profile': True, 'default_profile_image': False, 'following':
None, 'follow_request_sent': None, 'notifications': None
I tried this code is worked but I need to specify the number of columns that I need and also I need to rename the columns at the end, so it is difficult and takes time
d<-str_split_fixed(try$user, ",", 4)
the result I got is and it is without columns name :
'id': 1050442590754103297 'id_str': '1050442590754103297' 'name': 'ام رودينا <U+267F>'
and the fourth column has the rest of the string
'screen_name': 'uclkGkQ5', 'location': None, 'url': None, 'description': '\u200f\u200fمن زوي الاحتياجات الخاصه<U+267F>', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1567, 'friends_count': 4019, 'listed_count': 0, 'favourites_count': 6669, 'statuses_count': 9279, 'created_at': 'Thu Oct 11 17:46:44 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'ar', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None
I need a code that split the rows based on comma and makes the column name the word before (:) as shown here:
id id_str name screen_name
1050442590754103297 1050442590754103297 ام رودينا \u267f uclkGkQ5
and the same for the rest of string
hope you understand me and thank you

This mimics (but is not legal) JSON. One method (if assumptions are true) would be to "convert to JSON" and parse from there.
Up front: I have an issue on my R session where the arabic letters are not stored correctly in the string. This happens before the calls to gsub, etc, so I believe that it might work just fine on your machine. You will see empty strings in the output as a consequence of this. (Vice hunting this down locally, I wanted to give you a shot at the code first.)
Assumptions:
there are no embedded double-quotes
the True/False/None literal strings are never embedded within text as other than the logical values (e.g., 'screen_name':'Is None' does not happen)
you are willing to find NULL within your data, where there previously was :None
Important conversions:
True and False to lowercase
None to null
encapsulate the entire thing in a dictionary with surrounding { and }
convert all single-quotes to double-quotes
One can use the magrittr pipe here for readability, or you can just nest all of the functions (barely faster):
out <- jsonlite::fromJSON(
paste0("{", gsub(":\\s*True\\s*(,?)", ":true\\1",
gsub(":\\s*False\\s*(,?)", ":false\\1",
gsub(":\\s*None\\s*(,?)", ":null\\1",
gsub("'", '"', s)))),
"}"))
# or
library(magrittr)
out <- s %>%
gsub(":\\s*True\\s*(,?)", ":true\\1", .) %>%
gsub(":\\s*False\\s*(,?)", ":false\\1", .) %>%
gsub(":\\s*None\\s*(,?)", ":null\\1", .) %>%
gsub("'", '"', .) %>%
paste0("{", ., "}") %>%
jsonlite::fromJSON(.)
The results (compacted with str):
str(out)
# List of 39
# $ id : num 1.05e+18
# $ id_str : chr "1050442590754103297"
# $ name : chr " "
# $ screen_name : chr "uclkGkQ5"
# $ location : NULL
# $ url : NULL
# $ description : chr "<U+200F><U+200F> "
# $ translator_type : chr "none"
# $ protected : logi FALSE
# $ verified : logi FALSE
# $ followers_count : int 1567
# $ friends_count : int 4019
# $ listed_count : int 0
# $ favourites_count : int 6669
# $ statuses_count : int 9279
# $ created_at : chr "Thu Oct 11 17:46:44 +0000 2018"
# $ utc_offset : NULL
# $ time_zone : NULL
# $ geo_enabled : logi FALSE
# $ lang : chr "ar"
# $ contributors_enabled : logi FALSE
# $ is_translator : logi FALSE
# $ profile_background_color : chr "F5F8FA"
# $ profile_background_image_url : chr ""
# $ profile_background_image_url_https: chr ""
# $ profile_background_tile : logi FALSE
# $ profile_link_color : chr "1DA1F2"
# $ profile_sidebar_border_color : chr "C0DEED"
# $ profile_sidebar_fill_color : chr "DDEEF6"
# $ profile_text_color : chr "333333"
# $ profile_use_background_image : logi TRUE
# $ profile_image_url : chr "http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg"
# $ profile_image_url_https : chr "https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg"
# $ profile_banner_url : chr "https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015"
# $ default_profile : logi TRUE
# $ default_profile_image : logi FALSE
# $ following : NULL
# $ follow_request_sent : NULL
# $ notifications : NULL
Notes:
reminder: empty spaces here are a problem with my local "locale" setting, not necessarily the process here (sorry, I don't often process non-ASCII on my console)
the large-integer id field is converted to numeric and cannot be an integer without use of the bit64 package, I don't know if this will be a problem for you
I tried to be "careful" with replacements, and be very specific with the regex patterns, looking for a preceding colon, no ticks, and an optional follow-on comma (because the last None is not followed by anything); being as specific as possible mitigates the possibility of incorrectly replacing these literals within a string
I stress again the assumption of only-single-quotes ... if double-quotes are present, they will mess up the whole thing, and I haven't put thought into dealing with them

Related

How to read file with irregularly nested quotations?

I have a file with irregular quotes like the following:
"INDICATOR,""CTY_CODE"",""MGN_CODE"",""EVENT_NR"",""EVENT_NR_CR"",""START_DATE"",""PEAK_DATE"",""END_DATE"",""MAX_EXT_ON"",""DURATION"",""SEVERITY"",""INTENSITY"",""AVERAGE_AREA"",""WIDEST_AREA_PERC"",""SCORE"",""GRP_ID"""
"Spi-3,""AFG"","""",1,1,""1952-10-01"",""1952-11-01"",""1953-06-01"",""1952-11-01"",9,6.98,0.78,19.75,44.09,5,1"
It seems irregular because the first column is only wrapped in single quotes, whereas every subsequent column is wrapped in double quotes. I'd like to read it so that every column is imported without quotes (neither in the header, nor the data).
What I've tried is the following:
# All sorts of tidyverse imports
tib <- readr::read_csv("file.csv")
And I also tried the suggestions offered here:
# Base R import
DF0 <- read.table("file.csv", as.is = TRUE)
DF <- read.csv(text = DF0[[1]])
# Data table import
DT0 <- fread("file.csv", header =F)
DT <- fread(paste(DT0[[1]], collapse = "\n"))
But even when it imports the file in the latter two cases, the variable names and some of the elements are wrapped in quotation marks.
I used data.table::fread with the quote="" option (which is "as is").
Then I cleaned the names and data by eliminating all the quotes.
The dates could be converted too, but I didn't do that.
library(data.table)
library(magrittr)
DT0 <- fread('file.csv', quote = "")
DT0 %>% setnames(names(.), gsub('"', '', names(.)))
string_cols <- which(sapply(DT0, class) == 'character')
DT0[, (string_cols) := lapply(.SD, function(x) gsub('\\"', '', x)),
.SDcols = string_cols]
str(DT0)
Classes ‘data.table’ and 'data.frame': 1 obs. of 16 variables:
$ INDICATOR : chr "Spi-3"
$ CTY_CODE : chr "AFG"
$ MGN_CODE : chr ""
$ EVENT_NR : int 1
$ EVENT_NR_CR : int 1
$ START_DATE : chr "1952-10-01"
$ PEAK_DATE : chr "1952-11-01"
$ END_DATE : chr "1953-06-01"
$ MAX_EXT_ON : chr "1952-11-01"
$ DURATION : int 9
$ SEVERITY : num 6.98
$ INTENSITY : num 0.78
$ AVERAGE_AREA : num 19.8
$ WIDEST_AREA_PERC: num 44.1
$ SCORE : int 5
$ GRP_ID : chr "1"
- attr(*, ".internal.selfref")=<externalptr>

jsonlite array of arrays

when using jsonlite to import a json that has an array inside other array I get an undesired unnamed list. Exemple below:
myjson=jsonlite::fromJSON('{
"class" : "human",
"type" : [{
"shape":"thin",
"face":[{"eyes":"blues","hair":"brown"}]
}]
}')
str(myjson)
List of 2
$ class: chr "human"
$ type :'data.frame': 1 obs. of 2 variables:
..$ shape: chr "thin"
..$ face :List of 1
.. ..$ :'data.frame': 1 obs. of 2 variables:
.. .. ..$ eyes: chr "blues"
.. .. ..$ hair: chr "brown"
I would like to access the "eyes" field as below (however it doesn't work):
myjson[["type"]][["face"]][["eyes"]]
NULL
Instead, I need to add "[[1]]" in order to make it works:
myjson[["type"]][["face"]][[1]][["eyes"]]
[1] "blues"
Any ideas how could I format the json to get rid of this unnamed list?
The thing is, unnamed lists are used whenever there is a JSON vector [{}, {}, ...]. The fact that your first vector is turned into a named list and the second, inner one, is turned into an unnamed list is because jsonlite::fromJSON has arguments simplifyDataFrame = TRUE and flatten = TRUE by default, which have this behavior. I haven't looked into the source code, but it seems that the simplification involved (transforming a vector with only one element into a named list) only simplify the top-level objects.
A work around is to apply a function that turns any unnamed list with only a single object into the object itself.
my_json <- lapply(my_json, function(x) {
if (is.list(x)) # if element is a list, replace it with its first element
return(lapply(x, function(y) {
return(y[[1]])
}))
else
return(x)
})

issue importing json file in r

I am trying to import the data from this api https://api.ycombinator.com/companies/export.json?callback=true
and i am getting the following error:
Error in parse_con(txt, bigint_as_char) lexical error: invalid char in json text.
setupCompanies([{"name":"Parake
(right here) ------^
i thought the error was because of emoticons, so i downloaded the file as text and did the manual removal. It didnt work
Remove the ?callback=true from your URL, and it works without error:
aa <- jsonlite::fromJSON("https://api.ycombinator.com/companies/export.json")
str(aa)
# 'data.frame': 2055 obs. of 8 variables:
# $ name : chr "Parakey" "Dinesafe" "Pengram" "Demeanor.co" ...
# $ url : chr "http://parakey.com" "https://dinesafe.org" "http://pengramar.com" "https://demeanor.co" ...
# $ batch : chr "s2005" "s2018" "w2019" "s2018" ...
# $ vertical : chr NA "B2B" "Augmented Reality" "Media" ...
# $ description: chr "" "We crowdsource food poisoning reports and help detect and prevent outbreaks." "Pengram provides indoor navigation in augmented reality on your phone. " "Now part of thentwrk.com" ...
# $ dead : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
# $ has_ff : logi NA FALSE FALSE FALSE FALSE FALSE ...
# $ all_ff : logi NA FALSE FALSE FALSE FALSE FALSE ...
I'm guessing that the "callback API" is setting up the return value so that it is effectively a function call (i.e., setupCompanies(...)), not just data.

How to specify end of header line with read.table

I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "

POST to API using httr in R results in error

I'm trying to pull data directly from an API into R using the httr package. The API doesn't require any authentication, and accepts JSON strings of lat, long, elevation, variable sets, and time period to estimate climate variables for any location. This is my first time using an API, but the code below is what I've cobbled together from various Stack Overflow posts.
library(jsonlite)
library(httr)
url = "http://apibc.climatewna.com/api/clmApi"
body <- data.frame(lat = c(48.98,50.2), ##two example locations
lon = c(-115.02, -120),
el = c(1000,100),
prd = c("Normal_1961_1990.nrm","Normal_1961_1990.nrm"),
varYSM = c("Y","SST"))
requestBody <- toJSON(list("output" = body),auto_unbox = TRUE) ##convert to JSON string
result <- POST("http://apibc.climatewna.com/api/clmApi", ##post to API
body = requestBody,
add_headers(`Content-Type`="application/json"))
content(result)
I've tried various different versions of this (e.g. writing the JSON string manually, putting the body as a list in POST with encode = "json"), and it always runs, but the content always contains the below error message:
$Message
[1] "An error has occurred."
$ExceptionMessage
[1] "Object reference not set to an instance of an object."
$ExceptionType
[1] "System.NullReferenceException"
If I use GET and specify the variables directly in the URL
url = "http://apibc.climatewna.com/api/clmApi/LatLonEl?lat=48.98&lon=-115.02&el=1000&prd=Normal_1961_1990&varYSM=Y"
result <- GET(url)
content(result)
it produces the correct output, but then I can only obtain information for one location at a time. There isn't currently any public documentation about this API as it's very new, but I've attached a draft of the section explaining it using JS below. I would very much appreciate any help/suggestions on what I'm doing wrong!
Thank you!
The main problem is that jQuery.ajax encodes the data using jQuery.param before sending it to the API, so what it's sending looks something like [0][lat]=48.98&[0][lon]=-115.02.... I don't know of a package in R that does a similar encoding as jQuery.param, so we'll have to hack something together.
Modifying your example slightly:
library(httr)
body <- data.frame(lat = c(48.98,50.2), ##two example locations
lon = c(-115.02, -120),
el = c(1000,100),
prd = c("Normal_1961_1990","Normal_1961_1990"),
varYSM = c("Y","Y"))
Now, we do the encoding, like so:
out <- sapply(1:nrow(body), function(i) {
paste(c(
paste0(sprintf("[%d][lat]", i - 1), "=", body$lat[i]),
paste0(sprintf("[%d][lon]", i - 1), "=", body$lon[i]),
paste0(sprintf("[%d][el]", i - 1), "=", body$el[i]),
paste0(sprintf("[%d][prd]", i - 1), "=", body$prd[i]),
paste0(sprintf("[%d][varYSM]", i - 1), "=", body$varYSM[i])
), collapse = "&")
})
out <- paste(out, collapse = "&")
so now out is in a form that the API likes. Finally
result <- POST(url = "http://apibc.climatewna.com/api/clmApi", ##post to API
body = out,
add_headers(`Content-Type`="application/x-www-form-urlencoded"))
noting the Content-Type. We get
df <- do.call(rbind, lapply(content(result), as.data.frame, stringsAsFactors = FALSE))
str(df)
# 'data.frame': 2 obs. of 29 variables:
# $ lat : chr "48.98" "50.2"
# $ lon : chr "-115.02" "-120"
# $ elev : chr "1000" "100"
# $ prd : chr "Normal_1961_1990" "Normal_1961_1990"
# $ varYSM : chr "Y" "Y"
# $ MAT : chr "5.2" "8"
# $ MWMT : chr "16.9" "20.2"
# $ MCMT : chr "-6.7" "-5.6"
# $ TD : chr "23.6" "25.7"
# $ MAP : chr "617" "228"
# $ MSP : chr "269" "155"
# $ AHM : chr "24.7" "79.1"
# $ SHM : chr "62.9" "130.3"
# $ DD_0 : chr "690" "519"
# $ DD5 : chr "1505" "2131"
# $ DD_18 : chr "4684" "3818"
# $ DD18 : chr "60" "209"
# $ NFFD : chr "165" "204"
# $ bFFP : chr "150" "134"
# $ eFFP : chr "252" "254"
# $ FFP : chr "101" "120"
# $ PAS : chr "194" "34"
# $ EMT : chr "-36.3" "-32.7"
# $ EXT : chr "37.1" "41.2"
# $ Eref : chr "14.7" "13.6"
# $ CMD : chr "721" "862"
# $ MAR : chr "347" "679"
# $ RH : chr "57" "57"
# $ Version: chr "ClimateBC_API_v5.51" "ClimateBC_API_v5.51"

Resources