I am still new to using R and I was hoping someone would be able to help me transform a whole table into a numeric format.
Originally my data was numeric, but when I used the t() function to flip the rows and columns, the formatting has changed.
My data is [after using t()]:
> T_DS
2016-04-01 2016-05-01 2016-06-01
DS_Mobile 11940 7711 7690
DS_Desktop 8250 7458 5598
DS_Tablet 2680 1953 1739
and I need the data to transform without losing the column or row names.
I have tried data.matrix(T_DS) and these would change the data.
The code below was the closest I was able to change it, but I would lose the row names and the first column name.
T_DS2 <- as.matrix(apply(T_DS[,-1],2,as.numeric))
row.names(T_DS2) <- T_DS[,1]
> 2016-05-01 2016-06-01
11940 7711 7690
8250 7458 5598
2680 1953 1739
if you guys have any suggestions, would be immensely helpful!
Thank you
BACKGROUND
I have a list of 16 data frames. A data frame in it looks like this. All the other data frames have the similar format. DateTime column is of Date class while Value column is of time series class
> head(train_data[[1]])
DateTime Value
739 2009-07-31 49.9
740 2009-08-31 53.5
741 2009-09-30 54.4
742 2009-10-31 56.0
743 2009-11-30 54.4
744 2009-12-31 55.3
I am performing forecasting for the Value column across all the data.frames in this list . The following line of code feeds data into UCM model.
train_dataucm <- lapply(train_data, transform, Value = ifelse(Value > 50000 , Value/100000 , Value ))
The transform function is used to reduce large values because UCM has some issues rounding off large values ( I don't know why though ). I just understood that from user #KRC in this link
One data frame got affected because it had large values which got transformed to log values. All the other dataframes remained unaffected.
> head(train_data[[5]])
DateTime Value
715 2009-07-31 139901
716 2009-08-31 139492
717 2009-09-30 138818
718 2009-10-31 138432
719 2009-11-30 138659
720 2009-12-31 138013
I got to know this because I manually checked each one of the 15 data frames
PROBLEM
Is there any function which can call out the data frames which got
affected due to the condition which I inserted?
The function must be able to list down the data frames which got affected and should be able to put them into a list.
If I will be able to do this, then I can apply anti log function on the values and get the actual values.
This way I can give the correct forecasts with minimal human intervention.
I hope I am clear in specifying the problem .
Thank You.
Simply check whether any of your values in a data frame is too high:
has_too_high_values = function (df)
any(df$Value > 50000)
And then collect them, e.g. using Filter:
Filter(has_too_high_values, train_data)
This takes a bit to explain and the post itself may be a bit too long to be answered.
I have MANY data frames of individual chess players and their specific ratings at points in time.
Here is what my data looks like. Please forgive me for my poor formatting of separating the datasets. Carlsen and Nakamura are separate dataframes.
Player1
Nakamura, Hikaru Year
2364 2001-01-01
2430 2002-01-01
2520 2003-01-01
2571 2004-01-01
2613 2005-01-01
2644 2006-01-01
2651 2007-01-01
2670 2008-01-01
2699 2009-01-01
2708 2010-01-01
2751 2011-01-01
2759 2012-01-01
2769 2013-01-01
2789 2014-01-01
2776 2015-01-01
2787 2016-01-01
Player2
Carlsen, Magnus Year
2127 2002-01-01
2279 2003-01-01
2484 2004-01-01
2553 2005-01-01
2625 2006-01-01
2690 2007-01-01
2733 2008-01-01
2776 2009-01-01
2810 2010-01-01
2814 2011-01-01
2835 2012-01-01
2861 2013-01-01
2872 2014-01-01
2862 2015-01-01
2844 2016-01-01
You can download the two sets here:
Download Player2
Download Player1
Between the above code, and below, Ive deleted two columns and reassigned an observation as a column title.
Hikaru Nakamura/Magnus Carlsen's chess rating over time
Hikaru's data is assigned to a dataframe, Player1.
Magnus's data is assigned to a dataframe, Player2.
What I want to be able to do is get what you see below, a dataframe of them combined.
The code I used to produce this frame is
merged<- merge(Player1, Player2, by = c("Year"), all = TRUE)
Now, this is all fun and dandy for two data sets, but I am having very annoying difficulties to add more players to this combined data set.
For example, maybe I would like to add 5, 10, 15 more players to this set. Examples of these players would be Kramnik, Anand, Gelfand ( Examples of famous chess players). As you'd expect, for 5 players, the dataframe would have 6 columns, 10 would have 11, 15 would have 16, all ordered nicely by the Year variable.
Fortunately, the number of observations for each Player is less than 100 always. Also, each individual player is assigned his/her own dataset.
For example,
Nakamura is the Player1 dataframe
Carlsen is the Player2 dataframe
Kramnik is the Player3 dataframe
Anand is the Player4 dataframe
Gelfand is the Player5 dataframe
all of which I have created using a for loop assigning process using this code
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
I don't want to write out something like below:
merged<- merge(Player1, Player2,.....Player99 ,Player100, by = c("Year"), all = TRUE)
I want to able to merge all 5, 10, 15...i number of Player"i" objects that I created in the loop together by Year.
Also, once it leaves the loop initially, each dataset looks like this.
So what ends up happening is that I assign all of the data sets to a list by using the following snippet:
lst <- mget(ls(pattern='^Player\\d+'))
list2env(lapply(lst,`[`,-2), envir =.GlobalEnv)
lst <- mget(ls(pattern='^Player\\d+'))
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
names(lst[[i]]) [names(lst[[i]]) == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
}
This is what my list looks like.
Is there a way I write a table with YEAR as the way its merged by, so that it[cbinds, bind_cols, merges, etc] each of the Player"i" dataframes, which are necessarily not equal in length , in my lists are such a way that I get a combined/merged set like the one you saw below the merged(player1, player2) set?
Here is the diagram again, but it would have to be for many players, not just Carlsen and Nakmura.
Also, is there a way I can avoid using the list function, and just straight up do
names(Player"i") [names(Player"i") == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
which just renames the titles of all of the dataframes that start with "Player".
merge(player1, player2, player3,...., player99, player100, by = c("YEAR"), all = TRUE)
which would merge all of the "Player""i" datasets?
If anything is unclear, please mention it.
It was pretty funny that one line of code did the trick. After I assigned all of the Player1, Player 2....Player i into the list, I just joined all of the sets contained in the list by Year.
For loop that generates all of unique datasets.
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
Puts them into a list
lst <- mget(ls(pattern='^Player\\d+'))
Merge, or join by common value
df <- join_all(lst, by = 'Year')
Unfortunately, unlike merge(datasets...., all= TRUE), it drops certain observations for an unknown reason, will have to see why this happens.
I have a column of 17000 values that I would like to classify into 48 groups by their ranges (classifying SIC codes into Fama French industries).
df$SIC
[1] 5080 4911 7359 2834 3674 6324 2810 4512 4400 6331 3728 3350 2911 2085 7340 6311 6199 6321 2771 3844 2870 3823 2836 3825
The only way I can think of to do this is to write a bunch of if then statements and place them all in a for loop. However, this will take forever to run.
for(i in c(1:(dim(df)[1])){
if(df$SIC[i] >= 0100 && df$SIC[i] <= 0299){df$FF_IND <- "AGRI"}
}
## and so on for all groups
Do you know of a less taxing way to perform this task?
Many thanks!
Something like:
cut(df$SIC,breaks=c(100,299,...),labels=c("AGRI",...))
A more thorough solution (which I don't have time for right now) would extract the table found via http://boards.fool.com/famafrench-industry-codes-26799316.aspx (downloading http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Siccodes49.zip and extracting the table) and finding the breakpoints programmatically.
I have a data frame as seen below with over 1000 rows. I would like to subset the data into bins by 1m intervals (0-1m, 1-2m, etc.). Is there an easy way to do this without finding the minimum depth and using the subset command multiple times to place the data into the appropriate bins?
Temp..ÂșC. Depth..m. Light time date
1 17.31 -14.8 255 09:08 2012-06-19
2 16.83 -21.5 255 09:13 2012-06-19
3 17.15 -20.2 255 09:17 2012-06-19
4 17.31 -18.8 255 09:22 2012-06-19
5 17.78 -13.4 255 09:27 2012-06-19
6 17.78 -5.4 255 09:32 2012-06-19
Assuming that the name of your data frame is df, do the following:
split(df, findInterval(df$Depth..m., floor(min(df$Depth..m.)):0))
You will then get a list where each element is a data frame containing the rows that have Depth..m. within a particular 1 m interval.
Notice however that empty bins will be removed. If you want to keep them you can use cut instead of findInterval. The reason is that findInterval returns an integer vector, making it impossible for split to know what the set of valid bins is. It only knows the values it has seen and discards the rest. cut on the other hand returns a factor, which has all valid bins defined as levels.