R converts character variables into integer variables automatically - r

I have a data file that contains several character variables that only consist of numbers. They need to remain character variables as some of them start with a 0 and when converting to integer/numeric, the leading zeros are cut-off. For some strange reason, when I use fwrite to save my data file as csv and then open it again with fread, the character variables that only consisted of numbers are suddenly integer variables. How can I keep R from doing this?
> str(Dataset_Master)
Classes ‘data.table’ and 'data.frame': 12178669 obs. of 4 variables:
$ Date_of_goods_arrival_at_the_customer: int 20160527 20160527 20160527...
$ Sales_document : chr "0505399186" "0505435949"...
$ Warehouse : chr "8150" "8150" "8150" "8150" ...
$ Sold_to_country : chr "DE" "DE" "DE" "DE" ...
- attr(*, ".internal.selfref")=<externalptr>
> ##Save document
> fwrite(Dataset_Master, "Dataset_Master_3.csv")
> ##Load data
> Dataset_Master <- fread("Dataset_Master_3.csv")
|--------------------------------------------------|
|==================================================|
> str(Dataset_Master)
Classes ‘data.table’ and 'data.frame': 12178669 obs. of 4 variables:
$ Date_of_goods_arrival_at_the_customer: int 20160527 20160527 20160527...
$ Sales_document : int 505399186 505435949 505435949...
$ Warehouse : int 8150 8150 8150 8150 8150 8150...
$ Sold_to_country : chr "DE" "DE" "DE" "DE" ...
- attr(*, ".internal.selfref")=<externalptr>

Related

Error with ymd() in r: "All formats failed to parse. No formats found. "

I'm trying to convert a column in a table from integer to date with ymd().
The table is large one creating from merging several csv. The table structure is as follows:
Classes ‘data.table’ and 'data.frame': 49229 obs. of 46 variables:
PEP : chr
PN_Oper_M : chr
Desig_Oper_M : chr
Refer_M : int
Estado_M : chr
Conc_SAP : chr
Inc_SAP : chr
Conc_SICAP : chr
Incu_SICAP : chr
Avance : chr
Quemado : chr
RTD : chr
F_Ini_Plan_ : int 20200303 20201021 20210211 20210211 20210211...
F_Fin_Plan_ : int 20200424 20201021 20210211 20210211 20210211...
F_Cie_Plan_ : int 20200430 20201027 20210217 20210217 20210217...
Grupos : chr
Localiza : chr
Zona : chr
AG : chr
GNTs : chr
Empresas : chr
Hitos : chr
Req_Pdtes : int
Falta : int
CADO : chr
UDC : chr
New : int
Estr : int
N_HNC : chr
F_Libe_R_ : int 20191211 20200727 20201202 20201202 20201202...
F_Ini_R_ : int 20200303 20210308 20210216 20210204 20210218...
F_CERR_ : int 20200430 20210323 20210305 20210305 20210316...
F_Fin_Prod_ : int 20200424 20210316 20210216 20210204 20210218...
F_Fin_Cal_ : int 20200429 20210318 20210222 20210204 20210313...
Est_SICAP : chr
Est_SIPLA : chr
ZPLA : int
ZING : int
Comentarios : chr
IntExt : chr
Perfil : chr
H_Planif : chr
DOP_DI_Number : chr
DOP_DI_Status : chr
DOP_DI_Ubicacion: chr
Manual_JC : int
- attr(*, ".internal.selfref")=<externalptr>
All dates are merged as integer and I want to convert them into date. I have two questions:
- I extract one of the columns and try to convert it with ymd() using the following code:
d1 <- all[ , 30]
d1 <- ymd(d1)
But I get the following error:
"Warning message:
All formats failed to parse. No formats found."
There are empty values, could it be the problem?
Is there a quick way to convert several columns format?? The dataframe has no headers so I have to do it calling the column position.
Many thanks
Hugo
I use ymd() in other contexts. I think what you want, is to define the columns as dates.
Try:
df$F_Fin_Cal_ <- as.Date(df$F_Fin_Cal_, format="%Y%m%d") # df is the name of your data.frame
for all colums seperatly or with lapply() for all columns at once.
cols <- c("col1", "col2",...) # names of all relevant columns
cols <- c(1,2,3,...) # alternative adressing of columns
df[cols] <- lapply(df[cols], as.Date, format="%Y%m%d")
You could also use the lubridate package
df$F_Fin_Cal_ <- as.character(df$F_Fin_Cal_) #in case your cols are not char, convert to char
df$F_Fin_Cal_ <- lubridate::as_date(df$F_Fin_Cal_)
I convert the column to character because lubridate or even as.Date works best with char. I am not sure what the col type of your column is.
You can also use lapply like #Clem showed

How to specify end of header line with read.table

I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "

Error in changing data into Date in [R]

I have a problem in converting a vector into date one by using as.Date.
Data is as below.
> new3<-read.csv("Total Load - Day Ahead _ Actual.csv",stringsAsFactors=F)
> colnames(new3)<- c("Date","Hour","Dayahead","Actual")
> str(new3)
'data.frame': 35044 obs. of 4 variables:
$ Date : chr "01-01-2015" "01-01-2015" "01-01-2015" "01-01-2015" ...
$ Hour : chr "0:00" "0:15" "0:30" "0:45" ...
$ Dayahead: chr "42955" "42412" "41901" "41355" ...
$ Actual : int 42425 42021 42068 41874 41230 40810 40461 40160 39958
...
Here, I tried as.Data
new3$Date<-as.Date(new3$Date,"%d/%m/%Y")
The order of d,m,Y is right. But when I do this, it shows me NA in date info as below
> str(new3)
'data.frame': 35044 obs. of 4 variables:
$ Date : Date, format: NA NA NA NA ...
$ Hour : chr "0:00" "0:15" "0:30" "0:45" ...
$ Dayahead: chr "42955" "42412" "41901" "41355" ...
$ Actual : int 42425 42021 42068 41874 41230 40810 40461 40160 39958
...
I don't know what to do to fix it.
Can anyone help me out here? Thank you
The step doesn't seem right
new3$Date<-as.Date(new3$Date,"%d/%m/%Y")
You should try using
new3$Date<-as.Date(new3$Date,"%d-%m-%Y")
The separator for date in your date seems to be - and not /
I'll suggest looking into lubridate package as well. It allows you easy ways to convert date from character to date format.

How do I combine two data frames with different row lengths?

I have two data sets:
str(a)
'data.frame': 525930 obs. of 3 variables:
$ reg_code : int 11542359 10077860 10050401 10988998 11465162 10933454 11170863 11291673 12086780 10248250 ...
$ begin_date: Date, format: "2008-10-01" "1994-06-01" ...
$ pair_id : chr "115423591" "100778601" "100504011" "109889981" ...
str(b)
'data.frame': 618655 obs. of 3 variables:
$ reg_code: int 10077860 10050401 10988998 11465162 10933454 11170863 11291673 10248250 10998100 10837319 ...
$ end_date: Date, format: "2006-03-09" "2000-11-16" ...
$ pair_id : chr "100778601" "100504011" "109889981" "114651621" ...
merge:
abc<-merge(x=df1,y=df2,by="id")
but it is throwing an error:
Error in data.frame(..., check.names = FALSE) :
arguments imply
differing number of rows:15930, 28655, 1
This might seem silly, but just to confirm, are you trying to merge based on "pair_id"? It looks like you're using "id" for the by argument.
If you're simply trying to add one to the other and they have the same columns, you can use rbind().

R dataframe define column names at creation

I get monthly price value for the two assets below from Yahoo:
if(!require("tseries") | !require(its) ) { install.packages(c("tseries", 'its')); require("tseries"); require(its) }
startDate <- as.Date("2000-01-01", format="%Y-%m-%d")
MSFT.prices = get.hist.quote(instrument="msft", start= startDate,
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="m", retclass="its")
SP500.prices = get.hist.quote(instrument="^gspc", start=startDate,
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="m", retclass="its")
I want to put these two into a single data frame with specified columnames (Pandas allows this now - a bit ironic since they take the data.frame concept from R). As below, I assign the two time series with names:
MSFTSP500.prices <- data.frame(msft = MSFT.prices, sp500= SP500.prices )
However, this does not preserve the column names [msft, snp500] I have appointed. I need to define column names in a separate line of code:
colnames(MSFTSP500.prices) <- c("msft", "sp500")
I tried to put colnames and col.names inside the data.frame() call but it doesn't work. How can I define column names while creating the data frame?
I found ?data.frame very unhelpful...
The code fails with an error message indicating no availability of as.its. So I added the missing code (which appears to have been successful after two failed attempts.) Once you issue the missing require() call you can use str to see what sort of object get.hist.quote actually returns. It is neither a dataframe nor a zoo object, although it resembles a zoo-object in many ways:
> str(SP500.prices)
Formal class 'its' [package "its"] with 2 slots
..# .Data: num [1:180, 1] 1394 1366 1499 1452 1421 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:180] "2000-01-02" "2000-01-31" "2000-02-29" "2000-04-02" ...
.. .. ..$ : chr "AdjClose"
..# dates: POSIXct[1:180], format: "2000-01-02 16:00:00" "2000-01-31 16:00:00" ...
If you run cbind on those two objects you get a regular matrix with dimnames:
> str(cbind(SP500.prices, MSFT.prices) )
num [1:180, 1:2] 1394 1366 1499 1452 1421 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:180] "2000-01-02" "2000-01-31" "2000-02-29" "2000-04-02" ...
..$ : chr [1:2] "AdjClose" "AdjClose"
You will still need to change the column names since there does not seem to be a cbind.its that lets you assign column-names. I would caution about using the data.frame method, since the object is might get confusing in its behavior:
> str( MSFTSP500.prices )
'data.frame': 180 obs. of 2 variables:
$ AdjClose :Formal class 'AsIs', 'its' [package ""] with 1 slot
.. ..# .S3Class: chr "AsIs" "its"
$ AdjClose.1:Formal class 'AsIs', 'its' [package ""] with 1 slot
.. ..# .S3Class: chr "AsIs" "its"
The columns are still S4 objects. I suppose that might be useful if you were going to pass them to other its-methods but could be confusing otherwise. This might be what you were shooting for:
> MSFTSP500.prices <- data.frame(msft = as.vector(MSFT.prices),
sp500= as.vector(SP500.prices) ,
row.names= as.character(MSFT.prices#dates) )
> str( MSFTSP500.prices )
'data.frame': 180 obs. of 2 variables:
$ msft : num 35.1 32 38.1 25 22.4 ...
$ sp500: num 1394 1366 1499 1452 1421 ...
> head(rownames(MSFTSP500.prices))
[1] "2000-01-02 16:00:00" "2000-01-31 16:00:00" "2000-02-29 16:00:00"
[4] "2000-04-02 17:00:00" "2000-04-30 17:00:00" "2000-05-31 17:00:00"
MSFT.prices is a zoo object, which seems to be a data-frame-alike, with its own column name which gets transferred to the object. Confer
tmp <- data.frame(a=1:10)
b <- data.frame(lost=tmp)
which loses the second column name.
If you do
MSFTSP500.prices <- data.frame(msft = as.vector(MSFT.prices),
sp500=as.vector(SP500.prices))
then you will get the colnames you want (though you won't get zoo-specific behaviours). Not sure why you object to renaming columns in a second command, though.

Resources