How to reverse the order of a dataframe in R - r

I've endlessly looked for this and somehow nothing has solved this simple problem.
I have a dataframe called Prices in which there are 4 columns, one of which is a list of historical dates - the other 3 are lists of prices for products.
1 10/10/2016 53.14 50.366 51.87
2 07/10/2016 51.93 49.207 50.38
3 06/10/2016 52.51 49.655 50.98
4 05/10/2016 51.86 49.076 50.38
5 04/10/2016 50.87 48.186 49.3
6 03/10/2016 50.89 48.075 49.4
7 30/09/2016 50.19 47.384 48.82
8 29/09/2016 49.81 46.924 48.4
9 28/09/2016 49.24 46.062 47.65
10 27/09/2016 46.52 43.599 45.24
The list is 252 prices long. How can I have my output stored with the latest date at the bottom of the list and the corresponding prices listed with the latest prices at the bottom of the list?

Another tidyverse solution and I think the simplest one is:
df %>% map_df(rev)
or using just purrr::map_df we can do map_df(df, rev).

If you just want to reverse the order of the rows in a dataframe, you can do the following:
df<- df[seq(dim(df)[1],1),]

Just for completeness sake. There is actually no need to call seq here. You can just use the :-R-logic:
### Create some sample data
n=252
sampledata<-data.frame(a=sample(letters,n,replace=TRUE),b=rnorm(n,1,0.7),
c=rnorm(n,1,0.6),d=runif(n))
### Compare some different ways to reorder the dataframe
myfun1<-function(df=sampledata){df<-df[seq(nrow(df),1),]}
myfun2<-function(df=sampledata){df<-df[seq(dim(df)[1],1),]}
myfun3<-function(df=sampledata){df<-df[dim(df)[1]:1,]}
myfun4<-function(df=sampledata){df<-df[nrow(df):1,]}
### Microbenchmark the functions
microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4(),times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
myfun1() 63.994 67.686 117.61797 71.3780 87.3765 5818.494 1000
myfun2() 63.173 67.686 99.29120 70.9680 87.7865 2299.258 1000
myfun3() 56.610 60.302 92.18913 62.7635 76.9155 3241.522 1000
myfun4() 56.610 60.302 99.52666 63.1740 77.5310 4440.582 1000
The fastest way in my trial here was to use df<-df[dim(df)[1]:1,]. However using nrow instead of dim is only slightly slower. Making this a question of personal preference.
Using seq here definitely slows the process down.
UPDATE September 2018:
From a speed view there is little reason to use dplyr here. For maybe 90% of users the basic R functionality should suffice. The other 10% need to use dplyr for querying a database or need code translation into another language.
## hmhensen's function
dplyr_fun<-function(df=sampledata){df %>% arrange(rev(rownames(.)))}
microbenchmark::microbenchmark(myfun3(),myfun4(),dplyr_fun(),times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
myfun3() 55.8 69.75 132.8178 103.85 139.95 8949.3 1000
myfun4() 55.9 68.40 115.6418 100.05 135.00 2409.1 1000
dplyr_fun() 1364.8 1541.15 2173.0717 1786.10 2757.80 8434.8 1000

Yet another tidyverse solution is:
df %>% arrange(desc(row_number()))

Another option is to order the list by the vector you want to sort it by,
> data[order(data$Date), ]
# A tibble: 10 x 4
Date priceA priceB priceC
<dttm> <dbl> <dbl> <dbl>
1 2016-09-27 00:00:00 46.5 43.6 45.2
2 2016-09-28 00:00:00 49.2 46.1 47.6
3 2016-09-29 00:00:00 49.8 46.9 48.4
4 2016-09-30 00:00:00 50.2 47.4 48.8
5 2016-10-03 00:00:00 50.9 48.1 49.4
6 2016-10-04 00:00:00 50.9 48.2 49.3
7 2016-10-05 00:00:00 51.9 49.1 50.4
8 2016-10-06 00:00:00 52.5 49.7 51.0
9 2016-10-07 00:00:00 51.9 49.2 50.4
10 2016-10-10 00:00:00 53.1 50.4 51.9
Then if you are so inclined, you want to flip the order, reverse it,
> data[rev(order(data$Date)), ]
# A tibble: 10 x 4
Date priceA priceB priceC
<dttm> <dbl> <dbl> <dbl>
1 2016-10-10 00:00:00 53.1 50.4 51.9
2 2016-10-07 00:00:00 51.9 49.2 50.4
3 2016-10-06 00:00:00 52.5 49.7 51.0
4 2016-10-05 00:00:00 51.9 49.1 50.4
5 2016-10-04 00:00:00 50.9 48.2 49.3
6 2016-10-03 00:00:00 50.9 48.1 49.4
7 2016-09-30 00:00:00 50.2 47.4 48.8
8 2016-09-29 00:00:00 49.8 46.9 48.4
9 2016-09-28 00:00:00 49.2 46.1 47.6
10 2016-09-27 00:00:00 46.5 43.6 45.2

If you wanted to do this in base R use:
df <- df[rev(seq_len(nrow(df))), , drop = FALSE]
All other base R solutions posted here will have problems in the edge cases of zero row data frames (seq(0,1) == c(0, 1), that's why we use seq_len) or single column data frames (data.frame(a=7:9)[3:1,] == 9:7, that's why we use , drop = FALSE).

If you want to stick with base R, you could also use lapply().
do.call(cbind, lapply(df, rev))

Related

How do I convert a data frame dataset to time series?

I seem to have some trouble converting my data frame data into a time series. I have a typical data set consisting of date, export quantity, GDP, FDI etc.
# A tibble: 252 x 10
Date `Maize Exports (m/t)` `Rainfall (mm)` `Temperature ©` `Exchange rate (R/$)` `Maize price (R)` `FDI (Million R)` GDP (Million~1 Oil p~2 Infla~3
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000-05-01 00:00:00 21000 30.8 14.4 0.144 678. 4337 9056 192. 5.1
2 2000-06-01 00:00:00 54000 14.9 14.0 0.147 583. -4229 9056 205. 5.1
3 2000-07-01 00:00:00 134000 11.1 12.6 0.144 518. -4229 8841 196. 5.9
4 2000-08-01 00:00:00 213000 6.1 15.3 0.143 526. -4229 8841 205. 6.8
5 2000-09-01 00:00:00 123000 38.5 17.8 0.138 576. 6315 8841 234. 6.8
6 2000-10-01 00:00:00 94000 61.9 20.1 0.132 636. 6315 4487 231. 7.1
7 2000-11-01 00:00:00 192000 93.9 19.9 0.129 685. 6315 4487 250. 7.1
8 2000-12-01 00:00:00 134000 85.6 22.3 0.132 747. -2143 4487 192. 7
9 2001-01-01 00:00:00 133000 92.4 23.4 0.0875 1066. -5651 7365 226. 5
10 2001-02-01 00:00:00 168000 51 22.0 0.0879 1042. -5651 7365 233. 5.9
I've installed the right packages (readxl), I've used the as.Date function so ensure my Date is recognized as such, and I've used the as.ts function to convert the dataset. However, after using the as.ts function, the date column is all muddled up into a random number and not a date anymore. What am I doing wrong? Please help!
Date Maize Exports (m/t) Rainfall (mm) Temperature © Exchange rate (R/$) Maize price (R) FDI (Million R) GDP (Million R) Oil prices (R/barrel)
[1,] 957139200 21000 30.8 14.36 0.1435235 677.88 4337 9056 192.35
[2,] 959817600 54000 14.9 13.96 0.1474926 583.48 -4229 9056 205.36
[3,] 962409600 134000 11.1 12.61 0.1437298 518.10 -4229 8841 196.38
[4,] 965088000 213000 6.1 15.27 0.1433075 525.59 -4229 8841 204.66
[5,] 967766400 123000 38.5 17.83 0.1382170 576.08 6315 8841 233.64
[6,] 970358400 94000 61.9 20.10 0.1322751 635.79 6315 4487 231.27
In short nothing is wrong - and while this response should really be a comment, I wanted to use a full answer to have a bit more space to explain.
Behind each date is a numeric value tethered to an origin, so this is just R's way of handling it. And since you imported from excel originally, those origins may not line up if you tried to cross check it (see below).
You didn't make your question reproducible, but I put some similar data together to demonstrate what's going on:
Data
df <- data.frame(date = as.Date(c("2000-05-01",
"2000-06-01",
"2000-07-01",
"2000-08-01",
"2000-09-01",
"2000-10-01",
"2000-11-01")),
maize = c(21, 54, 132, 213, 123, 94, 192) * 1000,
rainfall = c(30, 14, 11, 6, 38, 61, 93))
tb <- tidyr::as_tibble(df)
Turning this into a time series object using as.ts()
tb_ts <- as.ts(tb)
# Time Series:
# Start = 1
# End = 7
# Frequency = 1
# date maize rainfall
# 1 11078 21000 30
# 2 11109 54000 14
# 3 11139 132000 11
# 4 11170 213000 6
# 5 11201 123000 38
# 6 11231 94000 61
# 7 11262 192000 93
Since I created these data in R, the "origin" is January 1, 1970, and we can see this in numerical dates from the time series object and convert them back into date formats:
as.Date(tb_ts[1:7], origin = '1970-01-01')
# [1] "2000-05-01" "2000-06-01" "2000-07-01" "2000-08-01"
# [5] "2000-09-01" "2000-10-01" "2000-11-01"
Note that if you import data from Excel, Excel's origin is December 30th, 1899 (i.e., as.Date(xx, origin = "1899-12-30")), so if you tried that you get the wrong dates:
as.Date(tb_ts[1:7], origin = "1899-12-30")
# [1] "1930-04-30" "1930-05-31" "1930-06-30" "1930-07-31"
# [5] "1930-08-31" "1930-09-30" "1930-10-31
The function worked as it's supposed to. Keeping the date format you're familiar with isn't practical for execution, so it converts the dates to a different value, usually something like the number of days (or minutes or seconds) since a certain year, usually Jan. 1 1970. For example, here is a little set to make the point:
# a test vector of dates
> del1 <- seq(as.Date("2012-04-01"), length.out=4, by=30)
# looks like
> del1
[1] "2012-04-01" "2012-05-01" "2012-05-31" "2012-06-30"
# use the as.ts
> as.ts(del1)
Time Series:
Start = 1
End = 4
Frequency = 1
[1] 15431 15461 15491 15521
So you can see the dates, which are 30 days apart, are converted to a series of values that are 30 integers apart.

Replace from NA to random values

I wanna replace from NA to random values. This data frame have a columns like "Dayofweek" and I don't know how can i complete this data frame. I try by function missforest but this function work on columns with integer I think. Do you have any idea how I can complete all of the columns?
travel <- read.csv("https://openmv.net/file/travel-times.csv")
library(missForest)
summary(travel)
set.seed(82)
travel1 <- prodNA(travel, noNA = 0.2)
travel2 <- missForest(travel1)
You can use the imputeTS package for inserting random values to your time series. The function na_random can be used for this. The function can be used for numeric columns (the other columns will be left untouched, which might be useful, since you probably do not need random texts for the comments column)
You can call
library("imputeTS")
na_random(yourData)
and the function will look for the lowest and highest value of each column and insert random values between this bounds for you.
But you can also define your own bounds for the random values like this:
library("imputeTS")
na_random(yourData, lower_bound = 0, upper_bound = 25)
For your data this could look like this:
library("imputeTS")
# To read the input correctly and have the right data types
travel <- read.csv("https://openmv.net/file/travel-times.csv", na.strings = "")
travel$FuelEconomy <- as.numeric(travel$FuelEconomy)
# To perform the missing data replacement
travel <- na_random(travel)
First, if you want to read "" strings as NAs, you need an additional argument na.strings = "" in read.csv. Then, do you mean replacing an NA observation of a variable with the other random observation of the same variable? If so, consider the following procedure:
travel <- read.csv("https://openmv.net/file/travel-times.csv", na.strings = "")
set.seed(82)
res <- data.frame(lapply(travel, function(x) {
is_na <- is.na(x)
replace(x, is_na, sample(x[!is_na], sum(is_na), replace = TRUE))
}))
res looks like this
Date StartTime DayOfWeek GoingTo Distance MaxSpeed AvgSpeed AvgMovingSpeed FuelEconomy TotalTime MovingTime Take407All Comments
1 1/6/2012 16:37 Friday Home 51.29 127.4 78.3 84.8 8.5 39.3 36.3 No Medium amount of rain
2 1/6/2012 08:20 Friday GSK 51.63 130.3 81.8 88.9 8.5 37.9 34.9 No Put snow tires on
3 1/4/2012 16:17 Wednesday Home 51.27 127.4 82.0 85.8 8.5 37.5 35.9 No Heavy rain
4 1/4/2012 07:53 Wednesday GSK 49.17 132.3 74.2 82.9 8.31 39.8 35.6 No Accident blocked 407 exit
5 1/3/2012 18:57 Tuesday Home 51.15 136.2 83.4 88.1 9.08 36.8 34.8 No Rain, rain, rain
6 1/3/2012 07:57 Tuesday GSK 51.80 135.8 84.5 88.8 8.37 36.8 35.0 No Backed up at Bronte
7 1/2/2012 17:31 Monday Home 51.37 123.2 82.9 87.3 - 37.2 35.3 No Pumped tires up: check fuel economy improved?
8 1/2/2012 07:34 Monday GSK 49.01 128.3 77.5 85.9 - 37.9 34.3 No Pumped tires up: check fuel economy improved?
9 12/23/2011 08:01 Friday GSK 52.91 130.3 80.9 88.3 8.89 39.3 36.0 No Police slowdown on 403
10 12/22/2011 17:19 Thursday Home 51.17 122.3 70.6 78.1 8.89 43.5 39.3 No Start early to run a batch

BatchGetSymbols - reshape output

I like to use the advanted of BatchgetSymbols.
Any advice how I can best manipulate the output to receive the format below?
symbols_RP <- c('VDNR.L','VEUD.L','VDEM.L','IDTL.L','IEMB.L','GLRE.L','IGLN.L')
#Setting price download date range
from_date <- as.Date('2019-01-01')
to_date <- as.Date(Sys.Date())
get.symbol.adjclose <- function(ticker) {
l.out <- BatchGetSymbols(symbols_RP, first.date = from_date, last.date = to_date, do.cache=TRUE, freq.data = "daily", do.complete.data = TRUE, do.fill.missing.prices = TRUE, be.quiet = FALSE)
return(l.out$df.tickers)
}
prices <- get.symbol.adjclose(symbols_RP)
Output Batchgetsymbols
$df.tickers
price.open price.high price.low price.close volume price.adjusted ref.date ticker ret.adjusted.prices ret.closing.prices
1 60.6000 61.7950 60.4000 61.5475 4717 60.59111 2019-01-02 VDNR.L NA NA
2 60.7200 60.9000 60.5500 60.6650 22015 59.72233 2019-01-03 VDNR.L -1.433838e-02 -1.433852e-02
3 60.9050 60.9500 60.9050 61.8875 1010 60.92583 2019-01-04 VDNR.L 2.015164e-02 2.015165e-02
4 62.3450 62.7850 62.3400 62.7300 820 61.75524 2019-01-07 VDNR.L 1.361339e-02 1.361340e-02
Desired output below:
VTI PUTW VEA VWO TLT VNQI GLD EMB UST FTAL
2019-01-02 124.6962 25.18981 35.72355 36.92347 118.6449 48.25209 121.33 97.70655 55.18464 45.76
2019-01-03 121.8065 25.05184 35.43429 36.34457 119.9950 48.32627 122.43 98.12026 56.01122 45.54
2019-01-04 125.8384 25.39677 36.52383 37.49271 118.6061 49.38329 121.44 98.86311 55.10592 46.63
2019-01-07 127.1075 25.57416 36.63954 37.56989 118.2564 49.67072 121.86 99.28625 54.81071 46.54
2019-01-08 128.4157 25.61358 36.89987 37.78215 117.9456 50.06015 121.53 99.21103 54.54502 47.05
2019-01-09 129.0210 25.56431 37.35305 38.33209 117.7610 50.39395 122.31 99.38966 54.56470 47.29
as I know from other languages, I could use for loop, but I know there is faster ways in r.
Maybe one could hint me the r-way?
Improved version:
get.symbol.adjclose <- function(ticker) {
l.out <- BatchGetSymbols(symbols_RP, first.date = from_date, last.date = to_date, do.cache=TRUE, freq.data = "daily", do.complete.data = TRUE, do.fill.missing.prices = TRUE, be.quiet = FALSE)
return(as.data.frame(l.out$df.tickers[c("ticker","ref.date","price.open","price.high","price.low","price.close","volume","price.adjusted")]))
}
Using dplyr and tidyr. I'm selecting price.adjusted, but you can use any of the prices you need.
library(dplyr)
library(tidyr)
prices %>%
select(ref.date, ticker, price.adjusted) %>% # select columns before pivot_wider
pivot_wider(names_from = ticker, values_from = price.adjusted)
# A tibble: 352 x 7
ref.date GLRE.L IDTL.L IGLN.L VDEM.L VDNR.L VEUD.L
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-01-02 NA NA 25.2 51.0 60.6 30.2
2 2019-01-03 32.2 4.50 25.3 50.3 59.7 30.1
3 2019-01-04 32.6 4.47 25.2 51.7 60.9 30.9
4 2019-01-07 32.8 4.47 25.3 51.8 61.8 31.0
5 2019-01-08 32.8 4.44 25.2 51.9 62.0 31.3
6 2019-01-09 33.3 4.43 25.3 53.0 62.7 31.7
7 2019-01-10 33.5 4.41 25.3 53.2 62.7 31.7
8 2019-01-11 33.8 4.40 25.3 53.1 62.8 31.6
9 2019-01-14 33.8 4.41 25.3 52.7 62.7 31.4
10 2019-01-15 34.0 4.41 25.3 53.1 63.1 31.4
# ... with 342 more rows
Note from BatchGetSymbols :
IEMB.L OUT: not enough data (thresh.bad.data = 75%)

Use time values for x-axis labels

I have some climate data with temperature and humidity as well as a timestamp which is transformed to the time in %H:%M.
When using ggplot2 for visualization, the time gets sorted - screwing the order of measurements as the first measurement was taken at 14:00 (2pm) and the last one at 10:27 (10:27am) the following day.
How do I prevent ggplot2 from sorting the x-values? (see plot)
MVE:
library(tidyverse)
df = read_csv('./climate_stats_incl_time.csv')
colnames(df)[1] <- c('sample')
head(df)
tail(df)
ggplot(data=df, mapping=aes(x=time)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))
> head(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 0 1581253210. 21.9 47.6 14:00
2 1 1581253275. 21.7 47.8 14:01
3 2 1581253336. 21.7 47.8 14:02
4 3 1581253397. 21.8 47.8 14:03
5 4 1581253457. 21.7 47.8 14:04
6 5 1581253520. 21.8 47.8 14:05
> tail(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 1203 1581326567. 19.1 49.8 10:22
2 1204 1581326628. 19.1 49.7 10:23
3 1205 1581326688. 19.1 49.9 10:24
4 1206 1581326749. 19.1 49.9 10:25
5 1207 1581326812. 19.1 49.7 10:26
6 1208 1581326873. 19.1 49.8 10:27
Format your timestamps to a proper date-time (assuming the origin is 1970):
df$date_time <- as.POSIXct(df$timestamp, origin="1970-01-01", tz = "GMT")
Then use this new date_time variable instead of time for plotting
Edit:
I accidentally submitted a wrong solution (I re-formated the date-time to a date) . Now the solution should work for your problem (i.e. it makes a date-time!)
A workaround
df %>%
mutate(orig_seq = seq(1,nrow(df),1)) %>%
ggplot(mapping=aes(x=reorder(time, orig_seq)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))

How to iterate over column values in a dataframe, take the mean, and create a new dataframe?

I have a large dataframe in R and I want to plot the change in temperature over time. I've tried this before but since there is so much data the graph is really noisy and impossible to read.
I experimented with other plot types to try and get around this but they didn't really work. So I decided instead I will plot the mean temperature for each hour.
I've uploaded the data from a csv file and there are about 56k rows, an hour is about 720 rows give or take.
> head(wormData)
Time Date Day.of.Week Humidity.1 Temp.1 Vapor.Density.1 Base.Temp.1
1 0:18:44 1/7/2016 Friday 69.7 26.4 17.43 85.00
2 0:18:49 1/7/2016 Friday 69.7 26.4 17.43 27.44
3 0:18:54 1/7/2016 Friday 69.6 26.4 17.40 27.44
4 0:18:59 1/7/2016 Friday 69.6 26.4 17.40 27.44
5 0:19:05 1/7/2016 Friday 69.5 26.4 17.38 27.44
6 0:19:10 1/7/2016 Friday 69.5 26.4 17.38 27.44
The column I am interested in is Temp.1 so what I want to do is take the mean of every 720 values in the Temp.1 column, then put each of those mean values into a new dataframe so I can plot a cleaner graph.
I thought of just doing it by hand but that would be about 50 data points and I have many more csv files to do, so any help on how I could do this would be appreciated. I've tried subsetting the data or making vectors with the mean values as well as writing some loops, but I'm struggling to tell R that I want the mean of every 720 rows.
Thanks so much :)
A kind of basic solution on top of matrix:
set.seed(123)
x<-sample(1:10,(720*5),replace=TRUE) # generate dummy data
> str(x)
int [1:3600] 3 8 5 9 10 1 6 9 6 5 ...
# Use wormData$Temp.1 instead of x for your actual datas
z<-matrix(x,nrow=length(x)/719) # divide by 719 to get 720 values per row
rowMeans(z) # 'loop' over each row to get the mean
Output:
[1] 5.654167 5.375000 5.358333 5.477778 5.618056
If your dataset is not a multiple of 720, you'll get a warning and the last point would be false (recycling of the vector to fill the last line).
Here is a solution with dplyr, assuming your row number is a multiple of 720. We create a grouping variable and then compute the mean by group.
library(dplyr)
n <- 2 # replace with n <- 720 with your actual data
mutate(d,group = rep(1:(nrow(d)/n), each=n)) %>%
group_by(group) %>%
summarize(mean=mean(Temp.1))
data
d <- read.table(text = " Time Date Day.of.Week Humidity.1 Temp.1 Vapor.Density.1 Base.Temp.1
1 0:18:44 1/7/2016 Friday 69.7 26.4 17.43 85.00
2 0:18:49 1/7/2016 Friday 69.7 26.4 17.43 27.44
3 0:18:54 1/7/2016 Friday 69.6 26.4 17.40 27.44
4 0:18:59 1/7/2016 Friday 69.6 26.4 17.40 27.44
5 0:19:05 1/7/2016 Friday 69.5 26.4 17.38 27.44
6 0:19:10 1/7/2016 Friday 69.5 26.4 17.38 27.44",stringsAsFactor=FALSE,head=TRUE)
Here is a more complete answer using dplyr. This uses the actual dates and times you have so that you aren't approximating 720 values per hour.
library(tidyverse)
worm_data <- data_frame(time = c("0:18:44","0:18:49","2:18:54",
"0:18:59","0:19:05","2:19:10"),
date = c("2016-07-01","2016-07-01","2016-07-01",
"2016-07-02", "2016-07-02", "2016-07-02"),
temp_1 = c(25,27,290,30,20,2))
worm_data_test <- worm_data %>%
mutate(
date = paste(date, time),
date = as.POSIXct(date, tz="GMT", format="%Y-%m-%d %H:%M:%S")
) %>%
group_by(
datetime = as.POSIXct(cut(date, breaks='hour')) # creates a new variable
) %>%
summarize(
temp_1 = mean(temp_1, na.rm=T)
) %>%
ungroup()
In this case, you are grouping by the hour, then summarizing over those hours. I chose strange values and modified the dates and times to show that it works.
For more on datetime, I suggest: https://www.stat.berkeley.edu/~s133/dates.html

Resources