Function to identify changes done previously - r

BACKGROUND
I have a list of 16 data frames. A data frame in it looks like this. All the other data frames have the similar format. DateTime column is of Date class while Value column is of time series class
> head(train_data[[1]])
DateTime Value
739 2009-07-31 49.9
740 2009-08-31 53.5
741 2009-09-30 54.4
742 2009-10-31 56.0
743 2009-11-30 54.4
744 2009-12-31 55.3
I am performing forecasting for the Value column across all the data.frames in this list . The following line of code feeds data into UCM model.
train_dataucm <- lapply(train_data, transform, Value = ifelse(Value > 50000 , Value/100000 , Value ))
The transform function is used to reduce large values because UCM has some issues rounding off large values ( I don't know why though ). I just understood that from user #KRC in this link
One data frame got affected because it had large values which got transformed to log values. All the other dataframes remained unaffected.
> head(train_data[[5]])
DateTime Value
715 2009-07-31 139901
716 2009-08-31 139492
717 2009-09-30 138818
718 2009-10-31 138432
719 2009-11-30 138659
720 2009-12-31 138013
I got to know this because I manually checked each one of the 15 data frames
PROBLEM
Is there any function which can call out the data frames which got
affected due to the condition which I inserted?
The function must be able to list down the data frames which got affected and should be able to put them into a list.
If I will be able to do this, then I can apply anti log function on the values and get the actual values.
This way I can give the correct forecasts with minimal human intervention.
I hope I am clear in specifying the problem .
Thank You.

Simply check whether any of your values in a data frame is too high:
has_too_high_values = function (df)
any(df$Value > 50000)
And then collect them, e.g. using Filter:
Filter(has_too_high_values, train_data)

Related

What is the appropriate frequency and start/end date in R for ts()?

I have the following dataset which I want to make a time series object of for auto.arima forecasting:
head(df)
total_score return
1539 121.77
1074 422.18
901 -229.79
843 96.30
1101 -55.25
961 -48.28
This data set contains of 13104 rows with each row representing sentiment score of tweets and BTC return on hourly basis, i.e. first row is 2021-01-01 00:00 and second row is 2021-01-01 01:00 and so on up until 2022-06-30 23:00. I have looked up how many hours fits in this range and that is 13103. How can I make my ts function such that I can use it for forecasting purposes in R auto.arima?
Moreover, I understand that auto.arima takes homoscedastic errors, whereas I need it to work for heteroscedastic errors. I also read that for this, I might use a GARCH model. However, if my auto.arima functions results in using a order of (2,0,0), does this mean that my GARCH model should be a (0,0,2)?
PS: I am still confused on why my data seems to be stationary, I was under the impression that crypto currencies are most likely NOT stationary, that is, the returns as well. But that is something for another time.

Import data from a subset of subjects in R

I am working with a data set with a combined 300 million rows, split over 5 csv files. The data contains weight measurements of users over 5 years (one file per year). As calculations take ages in this massive data set, I would like to work with a subset of users to create the code. I've used the nrows function to import only the first 50000 lines of each file. However, one user may have 400 weight measurements in the file for year 2014 but only 240 in year 2015. I therefore don't get the same set of users from each file when I import with the nrows function. I am wondering whether there is a way to import the data of the first 1000 users in each file?
The data looks like this in all files:
user_ID date_local weight_kg
0002a3e897bd47a575a720b84aad6e01632d2069 2016-01-07 99.2
0002a3e897bd47a575a720b84aad6e01632d2069 2016-02-08 99.6
0002a3e897bd47a575a720b84aad6e01632d2069 2016-02-10 99.5
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-03-13 99.1
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-04-20 78.2
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-05-02 78.3
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-05-07 78.9
0002b526e65ecdd01f3a373988e63a44d034c5d4 2016-08-15 82.1
0002b526e65ecdd01f3a373988e63a44d034c5d4 2016-08-22 82.6
Thanks a lot in advance!
If you have grep on your system you can combine it with pipe and read.table to read only rows that match a pattern. Using your example data, for example, you could read only users 001 and 002 like this. You'll need to add the headers back later as they won't match the pattern.
mydata <- read.csv(pipe('grep "^00[12]" "mydata.csv"'),
colClasses = c("character", "Date", "numeric"),
header = FALSE)
I'm not sure what the pattern is for your user_ID: you give 001 as an example but state that you want the first 1000. If that is 0001 - 1000, a pattern for grep might be something like ^[01][0-9]{3}.

Rolling subset of data frame within for loop in R

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

Many dataframes, different row lengths, similiar columns and dataframe titles, how to bind?

This takes a bit to explain and the post itself may be a bit too long to be answered.
I have MANY data frames of individual chess players and their specific ratings at points in time.
Here is what my data looks like. Please forgive me for my poor formatting of separating the datasets. Carlsen and Nakamura are separate dataframes.
Player1
Nakamura, Hikaru Year
2364 2001-01-01
2430 2002-01-01
2520 2003-01-01
2571 2004-01-01
2613 2005-01-01
2644 2006-01-01
2651 2007-01-01
2670 2008-01-01
2699 2009-01-01
2708 2010-01-01
2751 2011-01-01
2759 2012-01-01
2769 2013-01-01
2789 2014-01-01
2776 2015-01-01
2787 2016-01-01
Player2
Carlsen, Magnus Year
2127 2002-01-01
2279 2003-01-01
2484 2004-01-01
2553 2005-01-01
2625 2006-01-01
2690 2007-01-01
2733 2008-01-01
2776 2009-01-01
2810 2010-01-01
2814 2011-01-01
2835 2012-01-01
2861 2013-01-01
2872 2014-01-01
2862 2015-01-01
2844 2016-01-01
You can download the two sets here:
Download Player2
Download Player1
Between the above code, and below, Ive deleted two columns and reassigned an observation as a column title.
Hikaru Nakamura/Magnus Carlsen's chess rating over time
Hikaru's data is assigned to a dataframe, Player1.
Magnus's data is assigned to a dataframe, Player2.
What I want to be able to do is get what you see below, a dataframe of them combined.
The code I used to produce this frame is
merged<- merge(Player1, Player2, by = c("Year"), all = TRUE)
Now, this is all fun and dandy for two data sets, but I am having very annoying difficulties to add more players to this combined data set.
For example, maybe I would like to add 5, 10, 15 more players to this set. Examples of these players would be Kramnik, Anand, Gelfand ( Examples of famous chess players). As you'd expect, for 5 players, the dataframe would have 6 columns, 10 would have 11, 15 would have 16, all ordered nicely by the Year variable.
Fortunately, the number of observations for each Player is less than 100 always. Also, each individual player is assigned his/her own dataset.
For example,
Nakamura is the Player1 dataframe
Carlsen is the Player2 dataframe
Kramnik is the Player3 dataframe
Anand is the Player4 dataframe
Gelfand is the Player5 dataframe
all of which I have created using a for loop assigning process using this code
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
I don't want to write out something like below:
merged<- merge(Player1, Player2,.....Player99 ,Player100, by = c("Year"), all = TRUE)
I want to able to merge all 5, 10, 15...i number of Player"i" objects that I created in the loop together by Year.
Also, once it leaves the loop initially, each dataset looks like this.
So what ends up happening is that I assign all of the data sets to a list by using the following snippet:
lst <- mget(ls(pattern='^Player\\d+'))
list2env(lapply(lst,`[`,-2), envir =.GlobalEnv)
lst <- mget(ls(pattern='^Player\\d+'))
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
names(lst[[i]]) [names(lst[[i]]) == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
}
This is what my list looks like.
Is there a way I write a table with YEAR as the way its merged by, so that it[cbinds, bind_cols, merges, etc] each of the Player"i" dataframes, which are necessarily not equal in length , in my lists are such a way that I get a combined/merged set like the one you saw below the merged(player1, player2) set?
Here is the diagram again, but it would have to be for many players, not just Carlsen and Nakmura.
Also, is there a way I can avoid using the list function, and just straight up do
names(Player"i") [names(Player"i") == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
which just renames the titles of all of the dataframes that start with "Player".
merge(player1, player2, player3,...., player99, player100, by = c("YEAR"), all = TRUE)
which would merge all of the "Player""i" datasets?
If anything is unclear, please mention it.
It was pretty funny that one line of code did the trick. After I assigned all of the Player1, Player 2....Player i into the list, I just joined all of the sets contained in the list by Year.
For loop that generates all of unique datasets.
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
Puts them into a list
lst <- mget(ls(pattern='^Player\\d+'))
Merge, or join by common value
df <- join_all(lst, by = 'Year')
Unfortunately, unlike merge(datasets...., all= TRUE), it drops certain observations for an unknown reason, will have to see why this happens.

Merging in R based on dates

I'm using getSymbols to import stock data from Yahoo to R.
When I store it in a data frame, it's in the following format.
IDEA.BO.Open IDEA.BO.High IDEA.BO.Low IDEA.BO.Close IDEA.BO.Volume
2007-03-09 92.40 94.25 84.00 85.55 63599400
2007-03-12 85.55 89.95 85.55 87.40 12490900
2007-03-13 88.50 91.25 86.20 89.85 16785000
2007-03-14 87.05 90.85 86.60 87.75 7763800
2007-03-15 90.00 94.00 88.80 91.45 14808200
2007-03-16 92.40 93.65 91.25 92.40 6365600
Now the date column has no name.
I want to import 2 stock data and merge closing prices (between any random set of rows) on the basis of dates. The problem is, the date column is not being recognized.
I want my final result to be like this.
IDEA.BO.Close BHARTIARTL.BO.Close
2007-03-12 123 333
2007-03-13 456 645
2007-03-14 789 999
I tried the following:
> c <- merge(Cl(IDEA.BO),Cl(BHARTIARTL.BO))
> c['2013-08/']
IDEA.BO.Close BHARTIARTL.BO.Close
2013-08-06 NA 323.40
2013-08-07 NA 326.80
2013-08-08 157.90 337.40
2013-08-09 157.90 337.40
The same data on excel looks like this:
8/6/2013 156.75 8/6/2013 323.4
8/7/2013 153.1 8/7/2013 326.8
8/8/2013 157.9 8/8/2013 337.4
8/9/2013 157.9 8/9/2013 337.4
I don't understand the reason behind the NA values in R and the way to obtain a merged data free of NA Values.
You need to do more reading about xts and zoo data structures. They are matrices with indices that are ordered. When you convert to data.frames they become lists with a 'rownames' attribute which gets displayed by print.data.frame with no header. The list elements are given names based on ht naming of the matrix columns. (I do understand Joshua's visible annoyance at this question since he has posted many SO examples of how to use xts-objects.)

Resources