How to read file with irregularly nested quotations? - r

I have a file with irregular quotes like the following:
"INDICATOR,""CTY_CODE"",""MGN_CODE"",""EVENT_NR"",""EVENT_NR_CR"",""START_DATE"",""PEAK_DATE"",""END_DATE"",""MAX_EXT_ON"",""DURATION"",""SEVERITY"",""INTENSITY"",""AVERAGE_AREA"",""WIDEST_AREA_PERC"",""SCORE"",""GRP_ID"""
"Spi-3,""AFG"","""",1,1,""1952-10-01"",""1952-11-01"",""1953-06-01"",""1952-11-01"",9,6.98,0.78,19.75,44.09,5,1"
It seems irregular because the first column is only wrapped in single quotes, whereas every subsequent column is wrapped in double quotes. I'd like to read it so that every column is imported without quotes (neither in the header, nor the data).
What I've tried is the following:
# All sorts of tidyverse imports
tib <- readr::read_csv("file.csv")
And I also tried the suggestions offered here:
# Base R import
DF0 <- read.table("file.csv", as.is = TRUE)
DF <- read.csv(text = DF0[[1]])
# Data table import
DT0 <- fread("file.csv", header =F)
DT <- fread(paste(DT0[[1]], collapse = "\n"))
But even when it imports the file in the latter two cases, the variable names and some of the elements are wrapped in quotation marks.

I used data.table::fread with the quote="" option (which is "as is").
Then I cleaned the names and data by eliminating all the quotes.
The dates could be converted too, but I didn't do that.
library(data.table)
library(magrittr)
DT0 <- fread('file.csv', quote = "")
DT0 %>% setnames(names(.), gsub('"', '', names(.)))
string_cols <- which(sapply(DT0, class) == 'character')
DT0[, (string_cols) := lapply(.SD, function(x) gsub('\\"', '', x)),
.SDcols = string_cols]
str(DT0)
Classes ‘data.table’ and 'data.frame': 1 obs. of 16 variables:
$ INDICATOR : chr "Spi-3"
$ CTY_CODE : chr "AFG"
$ MGN_CODE : chr ""
$ EVENT_NR : int 1
$ EVENT_NR_CR : int 1
$ START_DATE : chr "1952-10-01"
$ PEAK_DATE : chr "1952-11-01"
$ END_DATE : chr "1953-06-01"
$ MAX_EXT_ON : chr "1952-11-01"
$ DURATION : int 9
$ SEVERITY : num 6.98
$ INTENSITY : num 0.78
$ AVERAGE_AREA : num 19.8
$ WIDEST_AREA_PERC: num 44.1
$ SCORE : int 5
$ GRP_ID : chr "1"
- attr(*, ".internal.selfref")=<externalptr>

Related

Convert chr to time format using chorn

I am trying to transform my date dataset from chr to time format.
Here is my code :
time<-f1%>%
mutate(`ymd_hms(Timestamp)`=strftime(f1$`ymd_hms(Timestamp)`, tz="GMT", format = "%H:%M:%S"))%>%
transform(as.numeric(f1$SpotPrice))
chron(times=f1$`ymd_hms(Timestamp)`)
its showing error:
Error in chron(., times = f1$`ymd_hms(Timestamp)`) :
. and f1$`ymd_hms(Timestamp)` must have equal lengths
if i comment it
time<-f1%>%
mutate(`ymd_hms(Timestamp)`=strftime(f1$`ymd_hms(Timestamp)`, tz="GMT", format = "%H:%M:%S"))%>%
transform(as.numeric(f1$SpotPrice))
#chron(times=f1$`ymd_hms(Timestamp)`)
output is
'data.frame': 10078 obs. of 4 variables:
$ InstanceType : chr " a1.2xlarge" " a1.2xlarge" " a1.2xlarge" " a1.4xlarge" ...
$ ProductDescription: chr " Linux/UNIX" " Red Hat Enterprise Linux" " SUSE Linux" " Linux/UNIX" ...
$ SpotPrice : num 0.0671 0.1971 0.2171 0.1343 0.2643 ...
$ ymd_hms(Timestamp): chr "06:17:23" "06:17:23" "06:17:23" "12:15:54" ...
Also but i use this commend as a single variable as follows:
f2<-chron(times=DfUse$`ymd_hms(Timestamp)`)
it works and shows the output
'times' num [1:10078] 06:17:23 06:17:23 06:17:23 12:15:54 12:15:54 ...
- attr(*, "format")= chr "h:m:s"`
but i want to pipe it with my whole dataset not to use it alone as a single variable.
Thanks in advance.
There are few syntax errors and room for improvement in the code.
Although having ymd_hms(Timestamp) = on the left hand side works but you don't need that since it creates a column with that name. Only using Timestamp should be good.
Don't use $ in dplyr pipe.
Usually it is better to keep base R functions and dplyr functions separate. transform is in base R but you can use the same mutate pipe to turn SpotPrice to numeric.
library(dplyr)
time<-f1%>%
mutate(Timestamp = strftime(ymd_hms(Timestamp), tz="GMT", format = "%H:%M:%S"),
SpotPrice = as.numeric(SpotPrice),
Timestamp = chron::chron(times=Timestamp))

How to specify end of header line with read.table

I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "

Change datatype of multiple columns in dataframe in R

I have the following dataframe:
str(dat2)
data.frame: 29081 obs. of 105 variables:
$ id: int 20 34 46 109 158....
$ reddit_id: chr "t1_cnas90f" "t1_cnas90t" "t1_cnas90g"....
$ subreddit_id: chr "t5_cnas90f" "t5_cnas90t" "t5_cnas90g"....
$ link_id: chr "t3_c2qy171" "t3_c2qy172" "t3_c2qy17f"....
$ created_utc: chr "2015-01-01" "2015-01-01" "2015-01-01"....
$ ups: int 3 1 0 1 2....
...
How can i change the datatype of reddit_id, subreddit_id and link_id from character to factor? I know how to do it one column by column, but as this is tedious work, i am searching for a faster way to do it.
I have tried the following, without success:
dat2[2:4] <- data.frame(lapply(dat2[2:4], factor))
From this approach. Its end up giving me an error message: invalid "length" argument
Another approach was to do it this way:
dat2 <- as.factor(data.frame(dat2$reddit_id, dat2$subreddit_id, dat2$link_id))
Result: Error in sort.list(y): "x" must be atomic for "sort.
After reading the error i also tried it the other way around:
dat2 <- data.frame(as.factor(dat2$reddit_id, dat2$subreddit_id, dat2$link_id))
Also without success
If some information are missing, I am sorry. I am a newbie to R and Stackoverflow...Thank you for your help!!!
Try with:
library("tidyverse")
data %>%
mutate_at(.vars = vars(reddit_id, subreddit_id, link_id)),
.fun = factor)
To take advantage of partial matching, use
data %>%
mutate_at(.vars = vars(contains("reddit"), link_id),
.fun = factor)

Replace row in data.frame

I have a dataframe which looks like that:
'data.frame': 3036 obs. of 751 variables:
$ X : chr "01.01.2002" "02.01.2002" "03.01.2002" "04.01.2002" ...
$ A: chr "na" "na" "na" "na" ...
$ B: chr "na" "1,827437365" "0,833922973" "-0,838923572" ...
$ C: chr "na" "1,825300613" "0,813299479" "-0,866639008" ...
$ D: chr "na" "1,820482187" "0,821374034" "-0,875963104" ...
...
I have converted the X row into a date format.
dates <- as.Date(dataFrame$X, '%d.%m.%Y')
Now I want to replace this row. The thing is I cannot create a new dataframe because I after D there are coming over 1000 more rows...
What would be a possible way to do that easily?
I think what you want is simply:
dataFrame$X <- dates
if you you want to do is replace column X with dates. If you want to remove column X, simply do the following:
dataFrame$X <- NULL
(edited with more concise removal method provided by user #shujaa)

Bad interpretation of #N/A using `fread`

I am using data.table fread() function to read some data which have missing values and they were generated in Excel, so the missing values string is "#N/A". However, when I use the na.strings command the final str of the read data is still character. To replicate this, here is code and data.
Data:
Date,a,b,c,d,e,f,g
1/1/03,#N/A,0.384650146,0.992190069,0.203057232,0.636296656,0.271766148,0.347567706
1/2/03,#N/A,0.461486974,0.500702057,0.234400718,0.072789936,0.060900352,0.876749487
1/3/03,#N/A,0.573541006,0.478062582,0.840918789,0.061495666,0.64301024,0.939575302
1/4/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/5/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/6/03,#N/A,0.66678429,0.897482818,0.569609033,0.524295691,0.132941158,0.194114347
1/7/03,#N/A,0.576835985,0.982816576,0.605408973,0.093177815,0.902145012,0.291035649
1/8/03,#N/A,0.100952961,0.205491093,0.376410642,0.775917986,0.882827749,0.560508499
1/9/03,#N/A,0.350174456,0.290225065,0.428637309,0.022947911,0.7422805,0.354776101
1/10/03,#N/A,0.834345466,0.935128099,0.163158666,0.301310627,0.273928596,0.537167776
1/11/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/12/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/13/03,#N/A,0.325914633,0.68192633,0.320222677,0.249631582,0.605508964,0.739263677
1/14/03,#N/A,0.715104989,0.639040211,0.004186366,0.351412982,0.243570606,0.098312443
1/15/03,#N/A,0.750380716,0.264929325,0.782035411,0.963814327,0.93646428,0.453694758
1/16/03,#N/A,0.282389354,0.762102103,0.515151803,0.194083842,0.102386764,0.569730516
1/17/03,#N/A,0.367802161,0.906878948,0.848538256,0.538705673,0.707436236,0.186222899
1/18/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/19/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/20/03,#N/A,0.79933188,0.214688799,0.37011313,0.189503843,0.294051763,0.503147404
1/21/03,#N/A,0.620066341,0.329949446,0.123685075,0.69027192,0.060178071,0.599825005
(data saved in temp.csv)
Code:
library(data.table)
a <- fread("temp.csv", na.strings="#N/A")
gives (I have larger dataset so neglect the number of observations):
Classes ‘data.table’ and 'data.frame': 144 obs. of 8 variables:
$ Date: chr "1/1/03" "1/2/03" "1/3/03" "1/4/03" ...
$ a : chr NA NA NA NA ...
$ b : chr "0.384650146" "0.461486974" "0.573541006" NA ...
$ c : chr "0.992190069" "0.500702057" "0.478062582" NA ...
$ d : chr "0.203057232" "0.234400718" "0.840918789" NA ...
$ e : chr "0.636296656" "0.072789936" "0.061495666" NA ...
$ f : chr "0.271766148" "0.060900352" "0.64301024" NA ...
$ g : chr "0.347567706" "0.876749487" "0.939575302" NA ...
- attr(*, ".internal.selfref")=<externalptr>
This code works fine
a <- read.csv("temp.csv", header=TRUE, na.strings="#N/A")
Is it a bug? Is there some smart workaround?
The documentation from ?fread for na.strings reads:
na.strings A character vector of strings to convert to NA_character_. By default for columns read as type character ",," is read as a blank string ("") and ",NA," is read as NA_character_. Typical alternatives might be na.strings=NULL or perhaps na.strings = c("NA","N/A","").
You should convert them to numeric yourself after, I suppose. At least this is what I understand from the documentation.
Something like this?
cbind(a[, 1], a[, lapply(.SD[, -1], as.numeric)])

Resources