I've been trying for quite some time to get my test data to split.
> FDF <- read.csv.ffdf(file='C:\\Users\\William\\Desktop\\R Data\\TestData0812.txt', header = FALSE, colClasses=c('factor','factor','numeric','numeric','numeric','numeric'), sep=',')
> names(FDF)<- c('Date','Time','Open','High','Low','Close')
>
> # ID
> FDF2 <-FDF[1:100,]
> FDF2 <- as.ffdf(FDF2)
> a <- nrow(FDF2)
> # Take section of import for testing
> FDF2[1:3,]
Date Time Open High Low Close
1 1987.08.28 12:00 1.6238 1.6240 1.6237 1.6239
2 1987.08.28 12:01 1.6239 1.6240 1.6235 1.6236
3 1987.08.28 12:02 1.6236 1.6239 1.6235 1.6238
>
> ID <- data.frame(matrix(1:a, nrow = a, ncol=1 ))
> ID <- as.ffdf(ID)
> names(ID) <- c('ID')
> FDF3 <- cbind.ffdf2(ID, FDF2)
> # Create ID column and binds together
> FDF3[1:3,]
ID Date Time Open High Low Close
1 1 1987.08.28 12:00 1.6238 1.6240 1.6237 1.6239
2 2 1987.08.28 12:01 1.6239 1.6240 1.6235 1.6236
3 3 1987.08.28 12:02 1.6236 1.6239 1.6235 1.6238
The file I will be using this on is an ffdf object, as it is 700mb. I would like to know how I could split the dataset?
My current code is;
T = ffdfdply(FDF3, split(FDF3$ID, rep(1:10,each=10)))
I have done quite a few variation of this and research across the forum and other. However, for simplicity I've just included the above example.
Upon operation the code above gives me the following error;
Error in ffdfdply(FDF3, split(FDF3$ID, rep(1:10, each = 10))) :
split needs to be the same length as the number of rows in x
I can't seem to understand why a split of rep(1:10, each = 10) is not working in a data set that is > dim(FDF3)
[1] 100 7
I would like the split to perform even if there are not a full amount of rows for each split also, lets say: T = ffdfdply(FDF3, split(FDF3$ID, rep(1:10,each=3)))
I've been on this for at least 20 hours.
I couldn't figure out the correct usage of the ffdfdplyr package, and I am still unaware of whether it would have been a correct usage or not. However, I have constructed a work around and hope someone finds it useful. I would add, it is indeed ugly, therefore I'm open to suggestion on how to simply this and would appreciate your comments.
ffdfEnd <- 5
# Variable
ffdfrows = nrow(FDF3)
ffdfStart <- 1
ffdfLoop <- ffdfStart
ffdfSplitSize <- ffdfEnd
# Creates constants and varaibles
splitNum <- ffdfrows/ffdfEnd
# Calculates the number of split required
ffdf.names <- paste('FFDF', ffdfSplitSize, ffdfLoop:splitNum,sep='.')
# Creates names to be pasted to resulting tables
for (i in ffdfLoop:splitNum) {
assign(ffdf.names[i], as.ffdf(FDF3[ffdfStart:ffdfEnd,]))
ffdfStart = (ffdfEnd)
ffdfEnd = (ffdfEnd + ffdfSplitSize)}
# loops over until requirments are fulfilled`
Related
I am pulling individual logs in that show changes in production tanks through an API. When trying to create on data frame with all of these logs I have been running into various issues. Below I have a section of my code:
Binded_TL<-do.call(rbind,TL_JSON_TEXT1)
TL_JSON <-purrr::map(Binded_TL, jsonlite::fromJSON)
TL_JSON2 <- TL_JSON[[1]]$Data
I have no issue with the above code, TL_JSON2 prints as a data frame with the correct headers but when I run TL_JSON2 as a for loop to try and combine them all:
for (i in 1:length(TL_JSON_TEXT1)){
TL_JSON2[[i]] <- as.data.frame(TL_JSON[[i]]$Data)
}
Is where I am running into an issue. Not sure if for loop is the way to go or if I should be doing something completely different.
I have tried the following:
TL_JSON2 <- data.frame()
for (i in 1:length(TL_JSON)){
TL_JSON2[[i]] <- paste0(TL_JSON[[i]]$Data)}
But I get the error of "replacement has 43 rows, data has 0"
Reproducible code
tank1 <- data.frame(TankName = c("tank1", "tank1", "tank1"), Capacity = c(100,100,100), PercentFull = c(10,13,20), Date = c("1/2/22", "1/3/22", "1/5/22"))
tank2 <- data.frame(TankName = c("tank2"), Capacity = c(200), PercentFull= c(50), Date = c("2/7/22"))
tank3 <- data.frame(TankName = c("tank3", "tank3"), Capacity = c(300, 300), PercentFull = c(80, 60), Date = c("1/3/22","1/6/22"))
Nested_DF <- list(tank1, tank2, tank3)
I have something similar to the Nested_DF and I am trying to create a combined df that looks like
TankName Capacity PercentFull Date
1 tank1 100 10 1/2/22
2 tank1 100 13 1/3/22
3 tank1 100 20 1/5/22
4 tank2 200 50 2/7/22
5 tank3 300 80 1/3/22
6 tank3 300 60 1/6/22
I have the fmatch function working in a loop but I was wondering if it's possible to apply this functionality to the vector all once rather than looping through.
Here is the code running through the loop, which currently works.
library(readxl)
library(data.table)
library(plyr)
library(tidyr)
library(dplyr)
library(tibble)
library(fastmatch)
library(stringr)
library(magrittr)
library(RcppBDT)
##library(anytime)
## Load time zone data sheet
TZData <- read_excel("TZDataFile.xlsx")
TZData <- as.data.table(TZData)
TZRange <- TZData[,1]
TZRange <- as.data.frame(TZRange)
##Bring in test data
TD <- read_excel("Test dates.xlsx", col_types = c("text", "text"))
TD <- as.data.table(TD)
####Start Time Conversion Code####
## Define variables
Station <- TD[,1] ##Station
GMT <- TD[,2] ##Date/time stamp in GMT to be converted to local
z <- nrow(TD)+0
APLDateTime <- data.frame(RawLocal = double(), RawLocalDateTime = as.Date(character()))
for (i in 1:z) {
STA <- as.character(Station[i,1]) ## Get Station
APCode <- as.integer(fmatch(STA, TZRange[,1])) ## Match station on Time Zone Data sheet
When I try to just run
STA <- as.character(Station[,1]) ## Get Station
APCode <- as.integer(fmatch(STA, TZRange[,1])) ## Match station on Time Zone Data sheet
I get NA_integer_ for APCode.
Sample Data:
> STA
[1] "c(\"LHR\", \"PHL\", \"DFW\", \"PHX\", \"LAX\", \"BCN\")"
> head(TZRange,10)
Code
1 369
2 04G
3 06A
4 06C
5 06N
6 09J
7 0A9
8 0G6
9 0G7
10 0P2
1183 DFW
2748 LHR
3809 PHL
I am looking for a result like
APCode = c(2748, 3809, 1183, etc.)
Thanks for the help.
First I assume you want STA to be a character vector. As I do not have your full data I will use the one provided and convert it
STA<-"c(\"LHR\", \"PHL\", \"DFW\", \"PHX\", \"LAX\", \"BCN\")"%>%substring(2)%>%str_replace_all("[[:punct:]]","")
> STA
[1] "LHR PHL DFW PHX LAX BCN"
Let me put an extra value so it finds a match to TZRange:
STA=c(STA,"04G")
> STA
[1] "LHR PHL DFW PHX LAX BCN" "04G"
For TZRange I have kept the 10 first values your provided
> TZRange
code
1 369
2 04G
3 06A
4 06C
5 06N
6 09J
7 0A9
8 0G6
9 0G7
10 0P2
Now you can specify the index of TZRange where matches with STA are found
APCode <- na.omit(fmatch(STA,TZRange[,1]))[1]
> APCode
[1] 2
Hope this helps
I got it working! I had to set both STA and TZRange as.data.frame then run the code APCode <- fmatch(STA[,1], TZRange[,1]) and it worked perfectly. I did first try the na.omit Antonis suggested, but it did not give me the list of indexes like I was looking for. Thanks for the assistance.
I have the following xlsx file df.xlsx which looks like this:
client id dax dpd
1 2000-05-30 7
1 2000-12-31 6
2 2003-05-21 6
3 1999-12-30 5
3 2000-10-30 6
3 2001-12-30 5
4 1999-12-30 5
4 2002-05-30 6
It's about a loan migration from a snapshot to another. The problem is that I don't have all the months in between. (ie: client_id = 1 , dax is from 2000-05-30 and 2000-12-30) . I have tried several approaches but no result. I need to populate by client_id all the months in between dax and keep the same "dpd" as the first month. (ie client_id = 1 , dax is from 2000-05-30 and 2000-12-30, dpd=7 for all months except the last one "2000-12-31" where dpd= 6). If the client_id appears only once (like client_id = 2 ) it should remain the same.
(dpd means days past due aka rating bucket)
I have tried this code:
df2 <- data.frame(dax=seq(min(df$dax), max(df$dax), by="month"))
df3 <- merge(x=df2a, y=df, by="dax", all.x=T)
idx <- which(is.na(df3$values))
for (client_id in idx)
df3$values[client_id] <- df3$values[client_id-1]
df3
but the results were not quite okay for what i need.
i appreciate any advice. thank you very much!
If I understand your question correctly, you want to generate seqence of dates, given the start/end date.
R code to do this would be (insert values from your dataframe):
seq(as.Date("2017-01-30"), as.Date("2017-12-30"), "month")
Edit after comment:
In this case you can split your data by clients first and then generate the sequences:
new_data <- data.frame()
customerslist <- split(YOURDATA, YOURDATA$id)
for(i in 1:length(customerslist)){
dates <- seq(min(as.Date(customerslist[[i]]$dax)), max(as.Date(customerslist[[i]]$dax)), "month")
id <- rep(customerslist[[i]]$id[1], length(dates))
dpd <- rep(customerslist[[i]]$dpd[1], length(dates))
add <- cbind(id, as.character(dates), dpd)
new_data <- rbind(new_data, add)
}
new_data$V2 <- as.Date(new_data$V2)
I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.
I have a dataset called cpue with 3.3 million rows. I have made a subset of this dataframe called dat.frame. (See below for the heads of cpue and dat.frame.) I have added two new fields to dat.frame: "ssh_vec" and "ssh_mag". Although the heads of cpue and dat.frame look the same, the rest of the rows are not actually in the same order.
head(cpue)
code event Lat Long stat_area Day Month Year id
1 BCO 447602 -43.45 182.73 49 17 3 1995 1
head(dat.frame)
code event Lat Long stat_area Day Month Year id cal.jdate ssh_vec ssh_mag
1 BCO 447602 -43.45 182.73 49 17 3 1995 1 2449857 56.83898 4.499350
Currently, I am running a loop to add the ssh_vec and ssh_mag variables to "cpue" using the unique identifier "id":
cpue$ssh<- NA
cpue$sshmag<- NA
for(i in 1:nrow(dat.frame))
{
ndx<- dat.frame$id[i]
cpue_full$ssh[ndx]<- dat.frame$ssh_vec[i]
cpue_full$sshmag[ndx]<- dat.frame$ssh_mag[i]
}
This has been running over the weekend and is only up to:
i
[1] 132778
... out of:
nrow(dat.frame)
[1] 2797789
Within the loop, there is nothing that looks too computationally demanding. Is there a better alternative?
Are you sure you need a for loop at all? I think this might be equivalent:
cpue_full$ssh[dat.frame$id]<- dat.frame$ssh_vec
cpue_full$sshmag[dat.frame$id]<- dat.frame$ssh_mag
I would recommend taking a look at data.table. Since I don't have your data, here is a simple example using dummy data.
library(data.table)
N = 10^6
dat <- data.table(
x = rnorm(1000),
g = sample(LETTERS, N, replace = TRUE)
)
dat2 <- dat[,list(mx = mean(x)),g]
h = merge(dat, dat2, 'g')
Do you even need to loop? From the code fragment posted it would appear not.
cpue_full$ssh[dat.frame$id] <- dat.frame$ssh_vec
cpue_full$sshmag[dat.frame$id]<- dat.frame$ssh_mag
should work. A quick (and small) dummy example:
set.seed(666)
ssh <- rnorm(10^4)
datf <- data.frame(id = sample.int(10000L), ssh = NA)
system.time(datf$ssh[datf$id] <- ssh) # user 0, system 0, elapsed 0
# Reset dummy data
datf$ssh <- NA
system.time({
for (i in 1:nrow(datf) ) {
ndx <- datf$id[i]
datf$ssh[ndx] <- ssh[i]
}
} ) # user 2.26, system 0.02, elapsed 2.28
PS - I've not used the data.table package, so I don't follow Ramnath's answer. In general you should avoid loops if possible (see fortune(142) and Circle 3 of The R Inferno).