Converting a data in a table in R to numeric - r

I am still new to using R and I was hoping someone would be able to help me transform a whole table into a numeric format.
Originally my data was numeric, but when I used the t() function to flip the rows and columns, the formatting has changed.
My data is [after using t()]:
> T_DS
2016-04-01 2016-05-01 2016-06-01
DS_Mobile 11940 7711 7690
DS_Desktop 8250 7458 5598
DS_Tablet 2680 1953 1739
and I need the data to transform without losing the column or row names.
I have tried data.matrix(T_DS) and these would change the data.
The code below was the closest I was able to change it, but I would lose the row names and the first column name.
T_DS2 <- as.matrix(apply(T_DS[,-1],2,as.numeric))
row.names(T_DS2) <- T_DS[,1]
> 2016-05-01 2016-06-01
11940 7711 7690
8250 7458 5598
2680 1953 1739
if you guys have any suggestions, would be immensely helpful!
Thank you

Related

Function to identify changes done previously

BACKGROUND
I have a list of 16 data frames. A data frame in it looks like this. All the other data frames have the similar format. DateTime column is of Date class while Value column is of time series class
> head(train_data[[1]])
DateTime Value
739 2009-07-31 49.9
740 2009-08-31 53.5
741 2009-09-30 54.4
742 2009-10-31 56.0
743 2009-11-30 54.4
744 2009-12-31 55.3
I am performing forecasting for the Value column across all the data.frames in this list . The following line of code feeds data into UCM model.
train_dataucm <- lapply(train_data, transform, Value = ifelse(Value > 50000 , Value/100000 , Value ))
The transform function is used to reduce large values because UCM has some issues rounding off large values ( I don't know why though ). I just understood that from user #KRC in this link
One data frame got affected because it had large values which got transformed to log values. All the other dataframes remained unaffected.
> head(train_data[[5]])
DateTime Value
715 2009-07-31 139901
716 2009-08-31 139492
717 2009-09-30 138818
718 2009-10-31 138432
719 2009-11-30 138659
720 2009-12-31 138013
I got to know this because I manually checked each one of the 15 data frames
PROBLEM
Is there any function which can call out the data frames which got
affected due to the condition which I inserted?
The function must be able to list down the data frames which got affected and should be able to put them into a list.
If I will be able to do this, then I can apply anti log function on the values and get the actual values.
This way I can give the correct forecasts with minimal human intervention.
I hope I am clear in specifying the problem .
Thank You.
Simply check whether any of your values in a data frame is too high:
has_too_high_values = function (df)
any(df$Value > 50000)
And then collect them, e.g. using Filter:
Filter(has_too_high_values, train_data)

Calculating annual return from monthly return, why does it generate Nan?

I have some irregular data below, that gives me the monthly return correctly. Now I wish to transform it into annual geometric return, but for some reason it returns NaN for many values.
I suspect it is because of how the index values in the xts objet is ordered, but I do not know how to turn them into descending (couldn't find an answer to it either online).
What is wrong? And how can it be resolved?
Code below
Browse[2]> Data.xts[,1]
monthly.returns
2012-09-27 -0.02469261
2012-10-30 -0.05129329
2012-11-29 0.05129329
2012-12-30 0.11778304
2013-01-30 -0.14310084
2013-02-27 -0.08004271
2013-03-28 -0.02817088
2013-04-29 -0.02898754
2013-05-30 0.16251893
2013-06-27 0.00000000
2013-07-30 0.52324814
2013-08-29 0.86927677
2013-09-29 -0.01250016
2013-10-30 0.02484600
2013-11-28 -0.06986968
2013-12-30 -0.06453852
2014-01-30 0.17055672
2014-02-27 0.22195942
2014-03-30 0.02342027
2014-04-29 -0.11258822
2014-05-29 0.03061464
2014-06-29 -0.08381867
2014-07-30 -0.06782260
2014-08-28 -0.08541775
2014-09-29 -0.10394609
2014-10-30 -0.29833937
2014-11-27 0.05556985
2014-12-30 -0.22652765
Browse[2]> annualReturn(Data.xts[, 'monthly.returns'] , type = 'log')
yearly.returns
2012-12-30 NaN
2013-12-30 NaN
2014-12-30 1.255605
I don't know the inner workings of the annualReturn() function, whether it be a custom function you wrote or whether it came from some package, but one likely explanation for what you are seeing NaN in multiple places is that you are doing some logarithmic function with negative numbers.
Just using base R, we can see that:
> log(-2)
[1] NaN
Taking the log of a negative number yields NaN, and I believe this may be what you are seeing with your data.
The ?annualReturn Description says,
Given a set of prices, return periodic returns.
So it expects prices and you give it returns, which it treats as prices. And as Tim Biegeleisen shows in his answer, taking the log of a negative number results in NaN.
You can use apply.yearly to aggregate your returns directly, assuming your returns are arithmetic.
> apply.yearly(1+x, prod)-1
[,1]
2012-12-30 0.08731379
2013-12-30 1.16831187
2014-12-30 -0.46319006
If they're log returns, you should sum them.
> apply.yearly(x, sum)
[,1]
2012-12-30 0.09309043
2013-12-30 1.15267951
2014-12-30 -0.47633945

Many dataframes, different row lengths, similiar columns and dataframe titles, how to bind?

This takes a bit to explain and the post itself may be a bit too long to be answered.
I have MANY data frames of individual chess players and their specific ratings at points in time.
Here is what my data looks like. Please forgive me for my poor formatting of separating the datasets. Carlsen and Nakamura are separate dataframes.
Player1
Nakamura, Hikaru Year
2364 2001-01-01
2430 2002-01-01
2520 2003-01-01
2571 2004-01-01
2613 2005-01-01
2644 2006-01-01
2651 2007-01-01
2670 2008-01-01
2699 2009-01-01
2708 2010-01-01
2751 2011-01-01
2759 2012-01-01
2769 2013-01-01
2789 2014-01-01
2776 2015-01-01
2787 2016-01-01
Player2
Carlsen, Magnus Year
2127 2002-01-01
2279 2003-01-01
2484 2004-01-01
2553 2005-01-01
2625 2006-01-01
2690 2007-01-01
2733 2008-01-01
2776 2009-01-01
2810 2010-01-01
2814 2011-01-01
2835 2012-01-01
2861 2013-01-01
2872 2014-01-01
2862 2015-01-01
2844 2016-01-01
You can download the two sets here:
Download Player2
Download Player1
Between the above code, and below, Ive deleted two columns and reassigned an observation as a column title.
Hikaru Nakamura/Magnus Carlsen's chess rating over time
Hikaru's data is assigned to a dataframe, Player1.
Magnus's data is assigned to a dataframe, Player2.
What I want to be able to do is get what you see below, a dataframe of them combined.
The code I used to produce this frame is
merged<- merge(Player1, Player2, by = c("Year"), all = TRUE)
Now, this is all fun and dandy for two data sets, but I am having very annoying difficulties to add more players to this combined data set.
For example, maybe I would like to add 5, 10, 15 more players to this set. Examples of these players would be Kramnik, Anand, Gelfand ( Examples of famous chess players). As you'd expect, for 5 players, the dataframe would have 6 columns, 10 would have 11, 15 would have 16, all ordered nicely by the Year variable.
Fortunately, the number of observations for each Player is less than 100 always. Also, each individual player is assigned his/her own dataset.
For example,
Nakamura is the Player1 dataframe
Carlsen is the Player2 dataframe
Kramnik is the Player3 dataframe
Anand is the Player4 dataframe
Gelfand is the Player5 dataframe
all of which I have created using a for loop assigning process using this code
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
I don't want to write out something like below:
merged<- merge(Player1, Player2,.....Player99 ,Player100, by = c("Year"), all = TRUE)
I want to able to merge all 5, 10, 15...i number of Player"i" objects that I created in the loop together by Year.
Also, once it leaves the loop initially, each dataset looks like this.
So what ends up happening is that I assign all of the data sets to a list by using the following snippet:
lst <- mget(ls(pattern='^Player\\d+'))
list2env(lapply(lst,`[`,-2), envir =.GlobalEnv)
lst <- mget(ls(pattern='^Player\\d+'))
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
names(lst[[i]]) [names(lst[[i]]) == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
}
This is what my list looks like.
Is there a way I write a table with YEAR as the way its merged by, so that it[cbinds, bind_cols, merges, etc] each of the Player"i" dataframes, which are necessarily not equal in length , in my lists are such a way that I get a combined/merged set like the one you saw below the merged(player1, player2) set?
Here is the diagram again, but it would have to be for many players, not just Carlsen and Nakmura.
Also, is there a way I can avoid using the list function, and just straight up do
names(Player"i") [names(Player"i") == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
which just renames the titles of all of the dataframes that start with "Player".
merge(player1, player2, player3,...., player99, player100, by = c("YEAR"), all = TRUE)
which would merge all of the "Player""i" datasets?
If anything is unclear, please mention it.
It was pretty funny that one line of code did the trick. After I assigned all of the Player1, Player 2....Player i into the list, I just joined all of the sets contained in the list by Year.
For loop that generates all of unique datasets.
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
Puts them into a list
lst <- mget(ls(pattern='^Player\\d+'))
Merge, or join by common value
df <- join_all(lst, by = 'Year')
Unfortunately, unlike merge(datasets...., all= TRUE), it drops certain observations for an unknown reason, will have to see why this happens.

Merging in R based on dates

I'm using getSymbols to import stock data from Yahoo to R.
When I store it in a data frame, it's in the following format.
IDEA.BO.Open IDEA.BO.High IDEA.BO.Low IDEA.BO.Close IDEA.BO.Volume
2007-03-09 92.40 94.25 84.00 85.55 63599400
2007-03-12 85.55 89.95 85.55 87.40 12490900
2007-03-13 88.50 91.25 86.20 89.85 16785000
2007-03-14 87.05 90.85 86.60 87.75 7763800
2007-03-15 90.00 94.00 88.80 91.45 14808200
2007-03-16 92.40 93.65 91.25 92.40 6365600
Now the date column has no name.
I want to import 2 stock data and merge closing prices (between any random set of rows) on the basis of dates. The problem is, the date column is not being recognized.
I want my final result to be like this.
IDEA.BO.Close BHARTIARTL.BO.Close
2007-03-12 123 333
2007-03-13 456 645
2007-03-14 789 999
I tried the following:
> c <- merge(Cl(IDEA.BO),Cl(BHARTIARTL.BO))
> c['2013-08/']
IDEA.BO.Close BHARTIARTL.BO.Close
2013-08-06 NA 323.40
2013-08-07 NA 326.80
2013-08-08 157.90 337.40
2013-08-09 157.90 337.40
The same data on excel looks like this:
8/6/2013 156.75 8/6/2013 323.4
8/7/2013 153.1 8/7/2013 326.8
8/8/2013 157.9 8/8/2013 337.4
8/9/2013 157.9 8/9/2013 337.4
I don't understand the reason behind the NA values in R and the way to obtain a merged data free of NA Values.
You need to do more reading about xts and zoo data structures. They are matrices with indices that are ordered. When you convert to data.frames they become lists with a 'rownames' attribute which gets displayed by print.data.frame with no header. The list elements are given names based on ht naming of the matrix columns. (I do understand Joshua's visible annoyance at this question since he has posted many SO examples of how to use xts-objects.)

Checking multiple value ranges in R

I have a column of 17000 values that I would like to classify into 48 groups by their ranges (classifying SIC codes into Fama French industries).
df$SIC
[1] 5080 4911 7359 2834 3674 6324 2810 4512 4400 6331 3728 3350 2911 2085 7340 6311 6199 6321 2771 3844 2870 3823 2836 3825
The only way I can think of to do this is to write a bunch of if then statements and place them all in a for loop. However, this will take forever to run.
for(i in c(1:(dim(df)[1])){
if(df$SIC[i] >= 0100 && df$SIC[i] <= 0299){df$FF_IND <- "AGRI"}
}
## and so on for all groups
Do you know of a less taxing way to perform this task?
Many thanks!
Something like:
cut(df$SIC,breaks=c(100,299,...),labels=c("AGRI",...))
A more thorough solution (which I don't have time for right now) would extract the table found via http://boards.fool.com/famafrench-industry-codes-26799316.aspx (downloading http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Siccodes49.zip and extracting the table) and finding the breakpoints programmatically.

Resources