How can I parse a json string on a txt file? - r

I need to parse two columns from a txt file in which I have:
first column: id (with only one value)
second column: parameters (which has more json fields inside, SEE BELOW).
Here an example:
ID;PARAMETERS
Y;"""Pr""=>""22"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""99"", ""Opt1""=>""67"", ""Opt2""=>""0"",
S;"""Pr""=>""5"", ""Err""=>""255"", ""Opt1""=>""0"", ""Opt2""=>""0"", ""Opt3""=>""55"", ""Opt4""=>""0"",
K;"""Pr""=>""1"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""21"", ""Opt1""=>""0"", ""Opt2""=>""0"",
P;"""Pr""=>""90"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""20"", ""Opt1""=>""0"", ""Opt2""=>""0"",
My dataset is in csv format, but I have tried also to transform it in a txt file or in json file and I tried to import it in R but I cannot parse the two columns.
I would like to obtain each parameter in one column, and If an ID does not have a parameter I need NA
Can you help me please?
I have tried this R code but it does not work:
setwd("D:/")
df <- read.delim("file name", header = TRUE, na.strings = -1)
install.packages("jsonlite")
library(jsonlite)
filepath<-"file name"
prova <- fromJSON(filepath)
prova <- fromJSON(filepath)
Can you help me please?
Thanks

You can import the csv directly in Rstudio with multiples functions :
data <- read.csv("your_data.csv")
Otherwise you can load it by looking in 'Import dataset' in the Environment tab. It will open a new window in which you can browse to your file without setting working directory.
To set NA for 0 values, here is an example
# I create a new data for example,
# you will have another one if you succed in importing your data.
data <- data.frame(ID = c("H1","H2","H3"), PARAMETERS_Y = c(42,0,3))
data[which(data$PARAMETERS_Y == 0),2] = NA
It look for rows in the Param column which are equal to 0 and replace them with NAs. Don't forget to change the column name to match your data loaded previously

Related

How to import a CSV with a last empty column into R?

I wrote an R script to make some scientometric analyses of Journal Citation Report data (JCR), which I have been using and updating in the past years.
Today, Clarivate has just introduced some changes in its database and now the exported CSV file contains one last empty column, which spoils my script. Because of this last empty column, read.csv automatically assumes that the first column contains the row names.
As before, there is also one first useless row, which is automatically removed in my script with skip = 1.
One simple solution to this "empty column situation" would be to manually remove this last column in Excel, and then proceed with my script as usual.
However, is there a way to add this removal to my script using base R?
The beginning of my script is:
jcreco = read.csv("data/jcr ecology 2020.csv",
na = "n/a", skip = 1, header = T)
The original CSV file downloaded from JCR is available in my Dropbox.
Could you please help me? Thank you!
The real problem is that empty column doesn't have a header. If they had only had the extra comma at the end of the header line this probably wouldn't be as messy. But you can also do a bit of column shuffling with fill=TRUE. For example
dd <- read.table("~/../Downloads/jcr ecology 2020.csv", sep=",",
skip=2, fill=T, header=T, row.names=NULL)
names(dd)[-ncol(dd)] <- names(dd)[-1]
dd <- dd[,-ncol(dd)]
This reads in the data but puts the rows names in the data.frame and fills the last column with NA. Then you shift all the column names over to the left and drop the last column.
Here is a way.
Read the data as text lines;
Discard the first line;
Remove the end comma with sub;
Create a text connection;
And read in the data from the connection.
The variable fl holds the file, on my disk I had to set the directory.
fl <- "jcr_ecology_2020.csv"
txt <- readLines(fl)
txt <- txt[-1]
txt <- sub(",$", "", txt)
con <- textConnection(txt)
df1 <- read.csv(con)
close(con)
head(df1)

Importing Excel-tables in R

Is there a way to import a named Excel-table into R as a data.frame?
I typically have several named Excel-tables on a single worksheet, that I want to import as data.frames, without relying on static row - and column references for the location of the Excel-tables.
I have tried to set namedRegion which is an available argument for several Excel-import functions, but that does not seem to work for named Excel-tables. I am currently using the openxlxs package, which has a function getTables() that creates a variable with Excel-table names from a single worksheet, but not the data in the tables.
To get your named table is a little bit of work.
First you need to load the workbook.
library(openxlsx)
wb <- loadWorkbook("name_excel_file.xlsx")
Next you need to extract the name of your named table.
# get the name and the range
tables <- getTables(wb = wb,
sheet = 1)
If you have multiple named tables they are all in tables. My named table is called Table1.
Next you to extract the column numbers and row numbers, which you will later use to extract the named table from the Excel file.
# get the range
table_range <- names(tables[tables == "Table1"])
table_range_refs <- strsplit(table_range, ":")[[1]]
# use a regex to extract out the row numbers
table_range_row_num <- gsub("[^0-9.]", "", table_range_refs)
# extract out the column numbers
table_range_col_num <- convertFromExcelRef(table_range_refs)
Now you re-read the Excel file with the cols and rows parameter.
# finally read it
my_df <- read.xlsx(xlsxFile = "name_excel_file.xlsx",
sheet = 1,
cols = table_range_col_num[1]:table_range_col_num[2],
rows = table_range_row_num[1]:table_range_row_num[2])
You end up with a data frame with only the content of your named table.
I used this a while ago. I found this code somewhere, but I don't know anymore from where.
This link is might be useful for you
https://stackoverflow.com/a/17709204/10235327
1. Install XLConnect package
2. Save a path to your file in a variable
3. Load workbook
4. Save your data to df
To get table names you can use function
getTables(wb,sheet=1,simplify=T)
Where:
wb - your workbook
sheet - sheet name or might be the number as well
simplify = TRUE (default) the result is simplified to a vector
https://rdrr.io/cran/XLConnect/man/getTables-methods.html
Here's the code (not mine, copied from the topic above, just a bit modified)
require(XLConnect)
sampleFile = "C:/Users/your.name/Documents/test.xlsx"
wb = loadWorkbook(sampleFile)
myTable <- getTables(wb,sheet=1)
df<-readTable(wb, sheet = 1, table = myTable)
You can check next packages:
library(xlsx)
Data <- read.xlsx('YourFile.xlsx',sheet=1)
library(readxl)
Data <- read_excel('YourFile.xlsx',sheet=1)
Both options allow you to define specific regions to load the data into R.
I use read.xlsx from package openxlsx. For example:
library(openxlsx)
fileA <- paste0(some.directory,'excel.file.xlsx')
A <- read.xlsx(fileA, startRow = 3)
hope it helps

How to bind filename as column

The code so far looks like this:
abc <- import_list(dir("MyData/", pattern = "*.xlsx",
full.names = TRUE), rbind = TRUE, rbind_label = "source")
Using the "rio" package this code imports many excel files at once putting one table under the other. The columns are sorted by column name (rbind = TRUE) in order to avoid a situation where data is put into the wrong columns (e.g. if some tables have more columns than others).
I want to have a FIRST column that entails the name of the excel file so that I know from where the data comes. However, there are two problems with rbind_label = "source"
It creates a column but in that column it's not the name of the file, but the whole path of it (pretty long)
The column is not at the beginning of the newly created table, but somewhere in the middle.
How can I solve these two problems?
Assuming the name of the source column is source. This will make it the first column:
abc <- abc[c('source', setdiff(names(abc),'source'))]
This will change that columns value from the full path to the filename:
abc$source <- basename(abc$source)

Manipulating Excel files using R

I want to edit an existing excel file using R. For example, ExcelFile_1 has the data, and I need to place the data from ExcelFile_1 into another file called ExcelFile_2. This is based on the column and row names.
ExcelFile_1:
Store Shipped Qty
1111 100
2222 200
ExcelFile_2:
Store Shipped Qty
1111
2222
If I m working with a data frame, I generally do
ExcelFile_2$Shipped Qty <-
ExcelFile_1$Shipped Qty[match(ExcelFile_1$Store #, ExcelFile_2$Store #)
The above line works for my data frame, but I donot know how to place this formula while writing into a worksheet using XLConnect package. All I see is the below mentioned options.
writeWorksheet(object,data,sheet,startRow,startCol,header,rownames)
I do not want to edit as a data frame and save the data frame as another "worksheet" in an existing/new Excel File, as I want to preserve the ExcelFile_2 formats.
For example: I want to change the value of ExcelFile_2 cell "B2" using the values from another sheet.
Could anyone please help me with the above problem?
Assuming your files are stored in your home directory and named one.xlsx and two.xlsx, you can do the following:
library(XLConnect)
# Load content of the first sheet of one.xlsx
df1 <- readWorksheetFromFile("~/one.xlsx", 1)
# Do what you like to df1 ...
# Write df1 to the first sheet of two.xlsx
wb2 <- loadWorkbook("~/two.xlsx")
writeWorksheet(wb2, df1, sheet = 1)
saveWorkbook(wb2)
If needed, you can also use startRow and startCol in both readWorksheetFromFile() and writeWorksheet() to specify exact rows and columns and header to specify if you want to read/write the headers.

Convert XML Doc to CSV using R - Or search items in XML file using R

I have a large, un-organized XML file that I need to search to determine if a certain ID numbers are in the file. I would like to use R to do so and because of the format, I am having trouble converting it to a data frame or even a list to extract to a csv. I figured I can search easily if it is in a csv format. So , I need help understanding how to do convert it and extract it properly, or how to search the document for values using R. Below is the code I have used to try and covert the doc,but several errors occur with my various attempts.
## Method 1. I tried to convert to a data frame, but the each column is not the same length.
require(XML)
require(plyr)
file<-"EJ.XML"
doc <- xmlParse(file,useInternalNodes = TRUE)
xL <- xmlToList(doc)
data <- ldply(xL, data.frame)
datanew <- read.table(data, header = FALSE, fill = TRUE)
## Method 2. I tried to convert it to a list and the file extracts but only lists 2 words on the file.
data<- xmlParse("EJ.XML")
print(data)
head(data)
xml_data<- xmlToList(data)
class(data)
topxml <- xmlRoot(data)
topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
row.names=NULL)
write.csv(xml_df, file = "MyData.csv",row.names=FALSE)
I am going to do some research on how to search within R as well, but I assume the file needs to be in a data frame or list to so either way. Any help is appreciated! Attached is a screen shot of the data. I am interested in finding matching entity id numbers to a list I have in a excel doc.

Resources