I've been trying to import a .csv file (comma seperated), however there is one column in JSON format. This gives problems while trying to import the data as a dataframe in one go. I've been trying read.table and read.csv, but I cannot find the right solution (or similar questions on stack).
Is there an easy way to load the dataframe, and depict the 1 JSON column as string column? e.g. "[{"..." : "..." , "...": "..."}]" Basically everything between the "[{ }]" should end up in 1 column in the final dataframe.
It's extra challenging (for me) since the ',' is present in the JSON column, and should not be regarded as split there, but should for splitting the rest of the columns.
Desired output:
df =
V1 V2 V3 JSONCOLUMN
x y z "[{"..." : "..." , "...": "..."}]"
Here is a somewhat hacky way:
# Read csv
df <- read.csv(text =
'"Date", "Time", "Number", "Fun", "JSON_COLUMN"
"2-2-1900", "14:09:56", 4, TRUE, "[{"message":"nothing","description": "hello", "otherField": "ciao"}])"');
# Add double quotes for keys and values
df$JSON_COLUMN <- gsub("(\\w+):(\\s*)(\\w+)", "\"\\1\":\\2\"\\3\"", df$JSON_COLUMN)
df;
#Date Time Number Fun
#1 2-2-1900 14:09:56 4 TRUE
# JSON_COLUMN
#1 [{"message":"nothing","description": "hello", "otherField": "ciao"}])
Explanation: Since read.csv strips double quotes from the JSON string, we add them back in using a regular expression.
Related
I have a CSV with a JSON string with inconsistent ordering of the fields. So it looks like this:
Row 1: '{"name":"John", "age":30, "car":null}'
Row 2: '{"name":"Chuck", "car":black, "age":25}'
Row 3: '{"car":blue, "age":54, "name":"David"}'
I’m hoping to use R to parse this out into columns with the appropriate data. So I’d like to create a ‘name’ column, ‘age’ column, and ‘car’ column and have them populate with the appropriate data. Is there anyway to do this using JSONlite, or would I need to figure out a way to essentially query the JSON string for the property name (car, name, age) and populate that column with the subsequent value?
you can use jsonlite library, but however in order to parse the data you must make some "adjustments" to your string. Lets say that you have the df as follows
my_df <- data.frame(column_1 =
c('{"name":"John", "age":30, "car":null}',
'{"name":"Chuck", "car":"black", "age":25}',
'{"car":"blue", "age":54, "name":
"David"}')
)
You must have a valid json format in order to parse the data properly. In this case is an array json format, so the data must have [ and ]. Also each element must be separated by ,. Be careful with the strings, each one must have "<string>". (You didn't add it in your example with blue and black data)
With that in mind we can now make some code:
# Base R
# Add "commas" to sep each element
new_json_str <- paste(my_df$column_1, collapse = ",")
# Add "brackets" to the string
new_json_str <- paste("[", new_json_str, "]")
# Parse the JSON string with jsonlite
jsonlite::fromJSON(new_json_str)
# With dplyr library
my_df %>%
pull(column_1) %>% # Get column as "vector"
paste(collapse = ",") %>% # Add "commas"
paste("[", . ,"]") %>% # Add "bracket" (`.` represents the current value, in this case vectors sep by ",")
jsonlite::fromJSON() # Parse json to df
# Result
# name age car
# 1 John 30 <NA>
# 2 Chuck 25 black
# 3 David 54 blue
Alternatively, the RcppSimdJson package can be used. Depending on the format of the data file we can
either convert the data row by row using fparse()
or read and convert the file in one go using fload()
Converting the data row by row
If the data file json_data_1.csv has the format
{"name":"John", "age":30, "car":null}
{"name":"Chuck", "car":"black", "age":25}
{"car":"blue", "age":54, "name":"David"}
(note that blue and black have been enclosed in double quotes to obey JSON syntax rules)
the JSON data need to be converted row by row, e.g.
library(magrittr) # piping used to improve readability
readLines("json_data_1.csv") %>%
lapply(RcppSimdJson::fparse) %>%
data.table::rbindlist(fill = TRUE)
name age car
1: John 30 <NA>
2: Chuck 25 black
3: David 54 blue
Reading and converting the file in one go
If the data file json_data_2.csv has the format
[
{"name":"John", "age":30, "car":null},
{"name":"Chuck", "car":"black", "age":25},
{"car":"blue", "age":54, "name":"David"}
]
(note the square brackets and the commas which indicate an array in JSON syntax)
the file can be read and converted by one line of code:
RcppSimdJson::fload("json_data_2.csv")
name age car
1 John 30 <NA>
2 Chuck 25 black
3 David 54 blue
I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.
You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.
Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
I am pulling 10-Ks off the SEC website using the EDGAR package in R. Fortunately, the text files come with a consistent file naming convention: CIK number (this is a unique filing ID)_File type_Date.
Ultimately I want to analyze these by SIC/industry group, so I think the best way to do this would be to add the SIC industry code to this filename rule.
I am including an image of what I would like to do below. It is kind of like a database join except my file names would be taking the new field. Not sure how to do that, I am pretty new to R and file scripting.
I am assuming that you have a data.frame with a column filenames. (Or a vector containing all the filenames) See the code below:
# A data.frame with a character column 'filenames'
df$CIK <- sapply(df$filenames, FUN = function(x) {unlist(strsplit(x, split = "_"))[1]})
df$CIK <- as.character(df$CIK)
Now, let us assume that you have another data.frame with two columns: CIK and SIC.
# A data.frame with two character columns: 'CIK' and 'SIC'
# df2.
#
# We add another column to the first data.frame: 'new_filenames'
df$new_filename <- sapply(1:nrow(df), FUN = function(idx, CIK, filenames, df2) {
SIC <- df2$SIC[which(df2$CIK == CIK[idx])]
new_filename <- as.character(paste(SIC, "_", filenames[idx], sep = ""))
new_filenames
}, CIK = df$CIK, filenames = df$filenames, df2 = df2)
# Now the new filenames are available in df$new_filenames
View(df)
I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"
I have several ASCII files I Need to Import into R with return data for different asset classes. The structure of the ASCII files is as follows (with 2 sample data)
How can I Import this? I wasn't succesfull with read.table, but I would like to have it in a data.frame Format.
<Security Name> <Ticker> <Per> <Date> <Close>
Test Description,Test,D,19700101,1.0000
Test Description,Test,D,19700102,1.5
If you really want to force the column names into R, you could use something like that:
# Data
dat <- read.csv("/path/to/data.dat", header = FALSE, skip = 1)
dat
V1 V2 V3 V4 V5
1 Test Description Test D 19700101 1.0
2 Test Description Test D 19700102 1.5
# Column names
dat.names <- readLines("/path/to/data.dat", n = 1)
names(dat) <- unlist(strsplit(gsub(">", " ", gsub("<", "", dat.names)), " "))
dat
Security Name Ticker Per Date Close
1 Test Description Test D 19700101 1.0
2 Test Description Test D 19700102 1.5
Although I think there might be better solutions, e.g. manually adding the header...
You can easily read this data using read.csv. Since your column names are not comma separated then you will need to use the header=FALSE argument and then add the names once the data is in R or oyu can manually edit the data before reading it by omitting the <> characters and adding a comma between each column name.