Create column in R in a large database - r

My apologies if this question has already been answered, but I haven't found it. I'll post all my ideas to solve it. The problem is that the database is large and my PC cannot perform this calculation (core i7 and 8 GB RAM). I'm using Microsoft R Open 3.3.2 and RStudio 1.0.136.
I've trying to create a new column on a large database in R called tcm.RData (471 MB). My need is a column which divides Shape_Area by the sum of Shape_Area by COD (which I called ShapeSum). I first tried to do it in a single formula but, as it failed, I tried again in two steps with 1) summing up Shape_Area by COD and, if that succeed, to divide Shape_Area by ShapeSum.
> str(tcm)
Classes ‘data.table’ and 'data.frame': 26835293 obs. of 15 variables:
$ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ LAT : num -15.7 -15.7 -15.7 -15.7 -15.7 ...
$ LONG : num -58.1 -58.1 -58.1 -58.1 -58.1 ...
$ UF : chr "MT" "MT" "MT" "MT" ...
$ COD : num 510562 510562 510562 510562 510562 ...
$ AREA_97 : num 1130 1130 1130 1130 1130 ...
$ Shape_Area: num 255266.7 14875 25182.2 5503.9 95.5 ...
$ TYPE : chr "2" "2" "2" "2" ...
$ Nomes : chr NA NA NA NA ...
$ NEAR_DIST : num 376104 371332 371410 371592 371330 ...
$ tc_2004 : chr "AREA_URBANA" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" ...
$ tc_2008 : chr "AREA_URBANA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" ...
$ tc_2010 : chr "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_LIMPO" ...
$ tc_2012 : chr "AREA_URBANA" "PASTO_SUJO" "PASTO_SUJO" "PASTO_SUJO" ...
$ tc_2014 : chr "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_SUJO" ...
- attr(*, ".internal.selfref")=<externalptr>
> tcm$ShapeSum <- tcm[, Shape_Area := sum(tcm$Shape_Area), by="COD"]
Error: cannot allocate vector of size 204.7 Mb
Error during wrapup: cannot allocate vector of size 542.3 Mb
I also tried the following codes, but all of them failed:
> tcm$ShapeSum <- apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by="COD")
Error in apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by = "COD") :
dim(X) must have a positive lenght
> tcm$ShapeSum <- mutate(tcm, ShapeSum = sum(Shape_Area), by="COD", package = "dplyr")
Error: cannot allocate vector of size 204.7 Mb
Error during wrapup: cannot allocate vector of size 542.3 Mb
> tcm$ShapeSum <- tcm[, transform(tcm, ShapeSum = sum(Shape_Area)), by="COD"]
> tcm$ShapeSum <- transform(tcm, aggregate(tcm$AreaShape, by=list(Category=tcm$COD), FUN=sum))
Error in aggregate.data.frame(as.data.frame(x), ...): no rows to aggregate
I thank very much for attention and for any suggestions to solve this problem.

We can use the data.table methods for creating the column as it is more efficient with the assignment (:=) which happens in place
library(data.table)
tcm[, ShapeSum := sum(Shape_Area), by = COD]
Or as #user20650 suggested it could be (based on the OP's description)
tcm[, ShapeSum := Shape_Area/sum(Shape_Area), by = COD]

library(data.table)
tcm <- fread("yout_tcm_file.txt")
tcm[, newColumn:=oldColumnPlusOne+1]
more:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

Related

How to specify end of header line with read.table

I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "

Plotting time (HMS) with ggplot2

I'm trying to plot a running sessionsI want to make a ggplot with:
x=distance (2.2KM, 5KM, 10KM , 12.8KM, Ziel)
Y= time (HMS)
I have the following data:
'data.frame': 16333 obs. of 6 variables:
$ Numéro : chr "6526" "5427" "6528" "6529" ...
$ X2.2km : chr "00:10:47.4" "00:08:58.2" "00:11:10.4" "00:09:27.3" ...
$ X5km : chr "00:26:05.0" "00:21:46.1" "00:27:13.5" "00:22:35.3" ...
$ X10km : chr "00:56:30.1" "00:45:59.3" "00:58:53.1" "00:47:51.7" ...
$ X12.8km : chr "01:14:24.7" "00:59:50.7" "01:17:35.0" "01:01:42.6" ...
$ Zielzeit: chr "01:37:40.0" "01:16:38.1" "01:41:53.0" "01:19:02.5" ...
the next step is to use melt function from library reshape2 and lubridate
xx<-melt(xx,id="Numéro")
####Using lubridate ####
xx$value<-hms(xx$value)
My problem is here when i try to plot simple graphics, i receive the following message
> ggplot(xx,aes(variable,value))+geom_point()
Error in x < range[1] : cannot compare Period to Duration:
coerce with 'as.numeric' first.
> ggplot(xx,aes(variable,value))+geom_line()
Error in x < range[1] : cannot compare Period to Duration:
coerce with 'as.numeric' first.)
DATASET
xx <- read.table(header=TRUE, text="
Numéro variable value
1 6526 X2.2km 10M 47.4S
2 5427 X2.2km 8M 58.2S
3 6528 X2.2km 11M 10.4S
4 6529 X2.2km 9M 27.3S
5 6530 X2.2km 8M 29.3S")
Thank for any kind of contributions .

R dataframe define column names at creation

I get monthly price value for the two assets below from Yahoo:
if(!require("tseries") | !require(its) ) { install.packages(c("tseries", 'its')); require("tseries"); require(its) }
startDate <- as.Date("2000-01-01", format="%Y-%m-%d")
MSFT.prices = get.hist.quote(instrument="msft", start= startDate,
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="m", retclass="its")
SP500.prices = get.hist.quote(instrument="^gspc", start=startDate,
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="m", retclass="its")
I want to put these two into a single data frame with specified columnames (Pandas allows this now - a bit ironic since they take the data.frame concept from R). As below, I assign the two time series with names:
MSFTSP500.prices <- data.frame(msft = MSFT.prices, sp500= SP500.prices )
However, this does not preserve the column names [msft, snp500] I have appointed. I need to define column names in a separate line of code:
colnames(MSFTSP500.prices) <- c("msft", "sp500")
I tried to put colnames and col.names inside the data.frame() call but it doesn't work. How can I define column names while creating the data frame?
I found ?data.frame very unhelpful...
The code fails with an error message indicating no availability of as.its. So I added the missing code (which appears to have been successful after two failed attempts.) Once you issue the missing require() call you can use str to see what sort of object get.hist.quote actually returns. It is neither a dataframe nor a zoo object, although it resembles a zoo-object in many ways:
> str(SP500.prices)
Formal class 'its' [package "its"] with 2 slots
..# .Data: num [1:180, 1] 1394 1366 1499 1452 1421 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:180] "2000-01-02" "2000-01-31" "2000-02-29" "2000-04-02" ...
.. .. ..$ : chr "AdjClose"
..# dates: POSIXct[1:180], format: "2000-01-02 16:00:00" "2000-01-31 16:00:00" ...
If you run cbind on those two objects you get a regular matrix with dimnames:
> str(cbind(SP500.prices, MSFT.prices) )
num [1:180, 1:2] 1394 1366 1499 1452 1421 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:180] "2000-01-02" "2000-01-31" "2000-02-29" "2000-04-02" ...
..$ : chr [1:2] "AdjClose" "AdjClose"
You will still need to change the column names since there does not seem to be a cbind.its that lets you assign column-names. I would caution about using the data.frame method, since the object is might get confusing in its behavior:
> str( MSFTSP500.prices )
'data.frame': 180 obs. of 2 variables:
$ AdjClose :Formal class 'AsIs', 'its' [package ""] with 1 slot
.. ..# .S3Class: chr "AsIs" "its"
$ AdjClose.1:Formal class 'AsIs', 'its' [package ""] with 1 slot
.. ..# .S3Class: chr "AsIs" "its"
The columns are still S4 objects. I suppose that might be useful if you were going to pass them to other its-methods but could be confusing otherwise. This might be what you were shooting for:
> MSFTSP500.prices <- data.frame(msft = as.vector(MSFT.prices),
sp500= as.vector(SP500.prices) ,
row.names= as.character(MSFT.prices#dates) )
> str( MSFTSP500.prices )
'data.frame': 180 obs. of 2 variables:
$ msft : num 35.1 32 38.1 25 22.4 ...
$ sp500: num 1394 1366 1499 1452 1421 ...
> head(rownames(MSFTSP500.prices))
[1] "2000-01-02 16:00:00" "2000-01-31 16:00:00" "2000-02-29 16:00:00"
[4] "2000-04-02 17:00:00" "2000-04-30 17:00:00" "2000-05-31 17:00:00"
MSFT.prices is a zoo object, which seems to be a data-frame-alike, with its own column name which gets transferred to the object. Confer
tmp <- data.frame(a=1:10)
b <- data.frame(lost=tmp)
which loses the second column name.
If you do
MSFTSP500.prices <- data.frame(msft = as.vector(MSFT.prices),
sp500=as.vector(SP500.prices))
then you will get the colnames you want (though you won't get zoo-specific behaviours). Not sure why you object to renaming columns in a second command, though.

Fail to create couponbonds object in termstrc package using R

I am trying to use R package termstrc to estimate the term structure. To do that I have to prepare the data as the couponbonds class required by the package. I used some fake data to prevent the potential problem of the real data. Though I tried a lot, it still didn't work.
Any idea what is going wrong?
structure of the official demo data which works
data("govbonds")
str(govbonds)
List of 3
$ GERMANY:List of 8
..$ ISIN : chr [1:52] "DE0001141414" "DE0001137131" "DE0001141422" "DE0001137149" ...
..$ MATURITYDATE: Date[1:52], format: "2008-02-15" "2008-03-14" "2008-04-11" ...
..$ ISSUEDATE : Date[1:52], format: "2002-08-14" "2006-03-08" "2003-04-11" ...
..$ COUPONRATE : num [1:52] 0.0425 0.03 0.03 0.0325 0.0413 ...
..$ PRICE : num [1:52] 100 99.9 99.8 99.8 100.1 ...
..$ ACCRUED : num [1:52] 4.09 2.66 2.43 2.07 2.39 ...
..$ CASHFLOWS :List of 3
.. ..$ ISIN: chr [1:384] "DE0001141414" "DE0001137131" "DE0001141422" "DE0001137149" ...
.. ..$ CF : num [1:384] 104 103 103 103 104 ...
.. ..$ DATE: Date[1:384], format: "2008-02-15" "2008-03-14" "2008-04-11" ...
..$ TODAY : Date[1:1], format: "2008-01-30"
#another two are omitted here
- attr(*, "class")= chr "couponbonds"
> ns_res <- estim_nss(govbonds, c("GERMANY"), method = "ns",tauconstr=list(c(0.2, 5, 0.1)))
[1] "Searching startparameters for GERMANY"
beta0 beta1 beta2 tau1
5.008476 -1.092510 -3.209695 2.400100
my code to prepare fake data
bond=list()
bond$CHINA=list()
n=30*12#suppose I have n bond
enddate=as.Date('2014/11/7')
isin=sprintf('DE%010d',1:n)#some fake ISIN
bond$CHINA$ISIN=isin
bond$CHINA$MATURITYDATE=enddate+(1:n)*30
bond$CHINA$ISSUEDATE=rep(enddate,n)
bond$CHINA$COUPONRATE=rep(5/100,n)
bond$CHINA$PRICE=rep(100,n)
bond$CHINA$ACCRUED=rep(0,n)
bond$CHINA$CASHFLOWS=list()
bond$CHINA$CASHFLOWS$ISIN=isin
bond$CHINA$CASHFLOWS$CF=100+(1:n)*5/12
bond$CHINA$CASHFLOWS$DATE=enddate+(1:n)*30
bond$CHINA$TODAY=enddate
class(bond)='couponbonds'
ns_res <- estim_nss(bond, c("CHINA"), method = "ns",tauconstr=list(c(0.2, 5, 0.1)))
the output
Error in `colnames<-`(`*tmp*`, value = c("DE0000000001", "DE0000000002", :
attempt to set 'colnames' on an object with less than two dimensions
The problem was finally solved by adding one cashflow with amount zero to the CASHFLOW$CF.
Put it in another way, at least one bond should have at least two cashflows.
Then you may face another error caused by uniroot function. Be sure to only include the cashflow after TODAY. The termstrc doesn't filter the cashflow for you by using TODAY.

read.zoo works but then as.xts fails with "currently unsupported data type"

I've a csv file of daily bars, with just two lines:
"datestamp","Open","High","Low","Close","Volume"
"2012-07-02",79.862,79.9795,79.313,79.509,48455
(That file was an xts that was converted to a data.frame then passed on to write.csv)
I load it with this:
z=read.zoo(file='tmp.csv',sep=',',header=T,format = "%Y-%m-%d")
And it is fine as print(z) shows:
Open High Low Close Volume
2012-07-02 79.862 79.9795 79.313 79.509 48455
But then as.xts(z) gives: Error in coredata.xts(x) : currently unsupported data type
Here is the str(z) output:
‘zoo’ series from 2012-07-02 to 2012-07-02
Data:List of 5
$ : num 79.9
$ : num 80
$ : num 79.3
$ : num 79.5
$ : int 48455
- attr(*, "dim")= int [1:2] 1 5
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "Open" "High" "Low" "Close" ...
Index: Date[1:1], format: "2012-07-02"
I've so far confirmed it is not that 4 columns are num and one column is int, as I still get the error even after removing the Volume column. But, then, what could that error message be talking about?
As Sebastian pointed out in the comments, the problem is in the single row. Specifically the coredata is a list when read.zoo reads a single row, but something else (a matrix?) when there are 2+ rows.
I replaced the call to read.zoo with the following, and it works fine whether 1 or 2+ rows:
d=read.table(fname,sep=',',header=T)
x=as.xts(subset(d,select=-datestamp),order.by=as.Date(d$datestamp))

Resources