I wanna replace from NA to random values. This data frame have a columns like "Dayofweek" and I don't know how can i complete this data frame. I try by function missforest but this function work on columns with integer I think. Do you have any idea how I can complete all of the columns?
travel <- read.csv("https://openmv.net/file/travel-times.csv")
library(missForest)
summary(travel)
set.seed(82)
travel1 <- prodNA(travel, noNA = 0.2)
travel2 <- missForest(travel1)
You can use the imputeTS package for inserting random values to your time series. The function na_random can be used for this. The function can be used for numeric columns (the other columns will be left untouched, which might be useful, since you probably do not need random texts for the comments column)
You can call
library("imputeTS")
na_random(yourData)
and the function will look for the lowest and highest value of each column and insert random values between this bounds for you.
But you can also define your own bounds for the random values like this:
library("imputeTS")
na_random(yourData, lower_bound = 0, upper_bound = 25)
For your data this could look like this:
library("imputeTS")
# To read the input correctly and have the right data types
travel <- read.csv("https://openmv.net/file/travel-times.csv", na.strings = "")
travel$FuelEconomy <- as.numeric(travel$FuelEconomy)
# To perform the missing data replacement
travel <- na_random(travel)
First, if you want to read "" strings as NAs, you need an additional argument na.strings = "" in read.csv. Then, do you mean replacing an NA observation of a variable with the other random observation of the same variable? If so, consider the following procedure:
travel <- read.csv("https://openmv.net/file/travel-times.csv", na.strings = "")
set.seed(82)
res <- data.frame(lapply(travel, function(x) {
is_na <- is.na(x)
replace(x, is_na, sample(x[!is_na], sum(is_na), replace = TRUE))
}))
res looks like this
Date StartTime DayOfWeek GoingTo Distance MaxSpeed AvgSpeed AvgMovingSpeed FuelEconomy TotalTime MovingTime Take407All Comments
1 1/6/2012 16:37 Friday Home 51.29 127.4 78.3 84.8 8.5 39.3 36.3 No Medium amount of rain
2 1/6/2012 08:20 Friday GSK 51.63 130.3 81.8 88.9 8.5 37.9 34.9 No Put snow tires on
3 1/4/2012 16:17 Wednesday Home 51.27 127.4 82.0 85.8 8.5 37.5 35.9 No Heavy rain
4 1/4/2012 07:53 Wednesday GSK 49.17 132.3 74.2 82.9 8.31 39.8 35.6 No Accident blocked 407 exit
5 1/3/2012 18:57 Tuesday Home 51.15 136.2 83.4 88.1 9.08 36.8 34.8 No Rain, rain, rain
6 1/3/2012 07:57 Tuesday GSK 51.80 135.8 84.5 88.8 8.37 36.8 35.0 No Backed up at Bronte
7 1/2/2012 17:31 Monday Home 51.37 123.2 82.9 87.3 - 37.2 35.3 No Pumped tires up: check fuel economy improved?
8 1/2/2012 07:34 Monday GSK 49.01 128.3 77.5 85.9 - 37.9 34.3 No Pumped tires up: check fuel economy improved?
9 12/23/2011 08:01 Friday GSK 52.91 130.3 80.9 88.3 8.89 39.3 36.0 No Police slowdown on 403
10 12/22/2011 17:19 Thursday Home 51.17 122.3 70.6 78.1 8.89 43.5 39.3 No Start early to run a batch
Related
I seem to have some trouble converting my data frame data into a time series. I have a typical data set consisting of date, export quantity, GDP, FDI etc.
# A tibble: 252 x 10
Date `Maize Exports (m/t)` `Rainfall (mm)` `Temperature ©` `Exchange rate (R/$)` `Maize price (R)` `FDI (Million R)` GDP (Million~1 Oil p~2 Infla~3
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000-05-01 00:00:00 21000 30.8 14.4 0.144 678. 4337 9056 192. 5.1
2 2000-06-01 00:00:00 54000 14.9 14.0 0.147 583. -4229 9056 205. 5.1
3 2000-07-01 00:00:00 134000 11.1 12.6 0.144 518. -4229 8841 196. 5.9
4 2000-08-01 00:00:00 213000 6.1 15.3 0.143 526. -4229 8841 205. 6.8
5 2000-09-01 00:00:00 123000 38.5 17.8 0.138 576. 6315 8841 234. 6.8
6 2000-10-01 00:00:00 94000 61.9 20.1 0.132 636. 6315 4487 231. 7.1
7 2000-11-01 00:00:00 192000 93.9 19.9 0.129 685. 6315 4487 250. 7.1
8 2000-12-01 00:00:00 134000 85.6 22.3 0.132 747. -2143 4487 192. 7
9 2001-01-01 00:00:00 133000 92.4 23.4 0.0875 1066. -5651 7365 226. 5
10 2001-02-01 00:00:00 168000 51 22.0 0.0879 1042. -5651 7365 233. 5.9
I've installed the right packages (readxl), I've used the as.Date function so ensure my Date is recognized as such, and I've used the as.ts function to convert the dataset. However, after using the as.ts function, the date column is all muddled up into a random number and not a date anymore. What am I doing wrong? Please help!
Date Maize Exports (m/t) Rainfall (mm) Temperature © Exchange rate (R/$) Maize price (R) FDI (Million R) GDP (Million R) Oil prices (R/barrel)
[1,] 957139200 21000 30.8 14.36 0.1435235 677.88 4337 9056 192.35
[2,] 959817600 54000 14.9 13.96 0.1474926 583.48 -4229 9056 205.36
[3,] 962409600 134000 11.1 12.61 0.1437298 518.10 -4229 8841 196.38
[4,] 965088000 213000 6.1 15.27 0.1433075 525.59 -4229 8841 204.66
[5,] 967766400 123000 38.5 17.83 0.1382170 576.08 6315 8841 233.64
[6,] 970358400 94000 61.9 20.10 0.1322751 635.79 6315 4487 231.27
In short nothing is wrong - and while this response should really be a comment, I wanted to use a full answer to have a bit more space to explain.
Behind each date is a numeric value tethered to an origin, so this is just R's way of handling it. And since you imported from excel originally, those origins may not line up if you tried to cross check it (see below).
You didn't make your question reproducible, but I put some similar data together to demonstrate what's going on:
Data
df <- data.frame(date = as.Date(c("2000-05-01",
"2000-06-01",
"2000-07-01",
"2000-08-01",
"2000-09-01",
"2000-10-01",
"2000-11-01")),
maize = c(21, 54, 132, 213, 123, 94, 192) * 1000,
rainfall = c(30, 14, 11, 6, 38, 61, 93))
tb <- tidyr::as_tibble(df)
Turning this into a time series object using as.ts()
tb_ts <- as.ts(tb)
# Time Series:
# Start = 1
# End = 7
# Frequency = 1
# date maize rainfall
# 1 11078 21000 30
# 2 11109 54000 14
# 3 11139 132000 11
# 4 11170 213000 6
# 5 11201 123000 38
# 6 11231 94000 61
# 7 11262 192000 93
Since I created these data in R, the "origin" is January 1, 1970, and we can see this in numerical dates from the time series object and convert them back into date formats:
as.Date(tb_ts[1:7], origin = '1970-01-01')
# [1] "2000-05-01" "2000-06-01" "2000-07-01" "2000-08-01"
# [5] "2000-09-01" "2000-10-01" "2000-11-01"
Note that if you import data from Excel, Excel's origin is December 30th, 1899 (i.e., as.Date(xx, origin = "1899-12-30")), so if you tried that you get the wrong dates:
as.Date(tb_ts[1:7], origin = "1899-12-30")
# [1] "1930-04-30" "1930-05-31" "1930-06-30" "1930-07-31"
# [5] "1930-08-31" "1930-09-30" "1930-10-31
The function worked as it's supposed to. Keeping the date format you're familiar with isn't practical for execution, so it converts the dates to a different value, usually something like the number of days (or minutes or seconds) since a certain year, usually Jan. 1 1970. For example, here is a little set to make the point:
# a test vector of dates
> del1 <- seq(as.Date("2012-04-01"), length.out=4, by=30)
# looks like
> del1
[1] "2012-04-01" "2012-05-01" "2012-05-31" "2012-06-30"
# use the as.ts
> as.ts(del1)
Time Series:
Start = 1
End = 4
Frequency = 1
[1] 15431 15461 15491 15521
So you can see the dates, which are 30 days apart, are converted to a series of values that are 30 integers apart.
I've endlessly looked for this and somehow nothing has solved this simple problem.
I have a dataframe called Prices in which there are 4 columns, one of which is a list of historical dates - the other 3 are lists of prices for products.
1 10/10/2016 53.14 50.366 51.87
2 07/10/2016 51.93 49.207 50.38
3 06/10/2016 52.51 49.655 50.98
4 05/10/2016 51.86 49.076 50.38
5 04/10/2016 50.87 48.186 49.3
6 03/10/2016 50.89 48.075 49.4
7 30/09/2016 50.19 47.384 48.82
8 29/09/2016 49.81 46.924 48.4
9 28/09/2016 49.24 46.062 47.65
10 27/09/2016 46.52 43.599 45.24
The list is 252 prices long. How can I have my output stored with the latest date at the bottom of the list and the corresponding prices listed with the latest prices at the bottom of the list?
Another tidyverse solution and I think the simplest one is:
df %>% map_df(rev)
or using just purrr::map_df we can do map_df(df, rev).
If you just want to reverse the order of the rows in a dataframe, you can do the following:
df<- df[seq(dim(df)[1],1),]
Just for completeness sake. There is actually no need to call seq here. You can just use the :-R-logic:
### Create some sample data
n=252
sampledata<-data.frame(a=sample(letters,n,replace=TRUE),b=rnorm(n,1,0.7),
c=rnorm(n,1,0.6),d=runif(n))
### Compare some different ways to reorder the dataframe
myfun1<-function(df=sampledata){df<-df[seq(nrow(df),1),]}
myfun2<-function(df=sampledata){df<-df[seq(dim(df)[1],1),]}
myfun3<-function(df=sampledata){df<-df[dim(df)[1]:1,]}
myfun4<-function(df=sampledata){df<-df[nrow(df):1,]}
### Microbenchmark the functions
microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4(),times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
myfun1() 63.994 67.686 117.61797 71.3780 87.3765 5818.494 1000
myfun2() 63.173 67.686 99.29120 70.9680 87.7865 2299.258 1000
myfun3() 56.610 60.302 92.18913 62.7635 76.9155 3241.522 1000
myfun4() 56.610 60.302 99.52666 63.1740 77.5310 4440.582 1000
The fastest way in my trial here was to use df<-df[dim(df)[1]:1,]. However using nrow instead of dim is only slightly slower. Making this a question of personal preference.
Using seq here definitely slows the process down.
UPDATE September 2018:
From a speed view there is little reason to use dplyr here. For maybe 90% of users the basic R functionality should suffice. The other 10% need to use dplyr for querying a database or need code translation into another language.
## hmhensen's function
dplyr_fun<-function(df=sampledata){df %>% arrange(rev(rownames(.)))}
microbenchmark::microbenchmark(myfun3(),myfun4(),dplyr_fun(),times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
myfun3() 55.8 69.75 132.8178 103.85 139.95 8949.3 1000
myfun4() 55.9 68.40 115.6418 100.05 135.00 2409.1 1000
dplyr_fun() 1364.8 1541.15 2173.0717 1786.10 2757.80 8434.8 1000
Yet another tidyverse solution is:
df %>% arrange(desc(row_number()))
Another option is to order the list by the vector you want to sort it by,
> data[order(data$Date), ]
# A tibble: 10 x 4
Date priceA priceB priceC
<dttm> <dbl> <dbl> <dbl>
1 2016-09-27 00:00:00 46.5 43.6 45.2
2 2016-09-28 00:00:00 49.2 46.1 47.6
3 2016-09-29 00:00:00 49.8 46.9 48.4
4 2016-09-30 00:00:00 50.2 47.4 48.8
5 2016-10-03 00:00:00 50.9 48.1 49.4
6 2016-10-04 00:00:00 50.9 48.2 49.3
7 2016-10-05 00:00:00 51.9 49.1 50.4
8 2016-10-06 00:00:00 52.5 49.7 51.0
9 2016-10-07 00:00:00 51.9 49.2 50.4
10 2016-10-10 00:00:00 53.1 50.4 51.9
Then if you are so inclined, you want to flip the order, reverse it,
> data[rev(order(data$Date)), ]
# A tibble: 10 x 4
Date priceA priceB priceC
<dttm> <dbl> <dbl> <dbl>
1 2016-10-10 00:00:00 53.1 50.4 51.9
2 2016-10-07 00:00:00 51.9 49.2 50.4
3 2016-10-06 00:00:00 52.5 49.7 51.0
4 2016-10-05 00:00:00 51.9 49.1 50.4
5 2016-10-04 00:00:00 50.9 48.2 49.3
6 2016-10-03 00:00:00 50.9 48.1 49.4
7 2016-09-30 00:00:00 50.2 47.4 48.8
8 2016-09-29 00:00:00 49.8 46.9 48.4
9 2016-09-28 00:00:00 49.2 46.1 47.6
10 2016-09-27 00:00:00 46.5 43.6 45.2
If you wanted to do this in base R use:
df <- df[rev(seq_len(nrow(df))), , drop = FALSE]
All other base R solutions posted here will have problems in the edge cases of zero row data frames (seq(0,1) == c(0, 1), that's why we use seq_len) or single column data frames (data.frame(a=7:9)[3:1,] == 9:7, that's why we use , drop = FALSE).
If you want to stick with base R, you could also use lapply().
do.call(cbind, lapply(df, rev))
I have a large dataframe in R and I want to plot the change in temperature over time. I've tried this before but since there is so much data the graph is really noisy and impossible to read.
I experimented with other plot types to try and get around this but they didn't really work. So I decided instead I will plot the mean temperature for each hour.
I've uploaded the data from a csv file and there are about 56k rows, an hour is about 720 rows give or take.
> head(wormData)
Time Date Day.of.Week Humidity.1 Temp.1 Vapor.Density.1 Base.Temp.1
1 0:18:44 1/7/2016 Friday 69.7 26.4 17.43 85.00
2 0:18:49 1/7/2016 Friday 69.7 26.4 17.43 27.44
3 0:18:54 1/7/2016 Friday 69.6 26.4 17.40 27.44
4 0:18:59 1/7/2016 Friday 69.6 26.4 17.40 27.44
5 0:19:05 1/7/2016 Friday 69.5 26.4 17.38 27.44
6 0:19:10 1/7/2016 Friday 69.5 26.4 17.38 27.44
The column I am interested in is Temp.1 so what I want to do is take the mean of every 720 values in the Temp.1 column, then put each of those mean values into a new dataframe so I can plot a cleaner graph.
I thought of just doing it by hand but that would be about 50 data points and I have many more csv files to do, so any help on how I could do this would be appreciated. I've tried subsetting the data or making vectors with the mean values as well as writing some loops, but I'm struggling to tell R that I want the mean of every 720 rows.
Thanks so much :)
A kind of basic solution on top of matrix:
set.seed(123)
x<-sample(1:10,(720*5),replace=TRUE) # generate dummy data
> str(x)
int [1:3600] 3 8 5 9 10 1 6 9 6 5 ...
# Use wormData$Temp.1 instead of x for your actual datas
z<-matrix(x,nrow=length(x)/719) # divide by 719 to get 720 values per row
rowMeans(z) # 'loop' over each row to get the mean
Output:
[1] 5.654167 5.375000 5.358333 5.477778 5.618056
If your dataset is not a multiple of 720, you'll get a warning and the last point would be false (recycling of the vector to fill the last line).
Here is a solution with dplyr, assuming your row number is a multiple of 720. We create a grouping variable and then compute the mean by group.
library(dplyr)
n <- 2 # replace with n <- 720 with your actual data
mutate(d,group = rep(1:(nrow(d)/n), each=n)) %>%
group_by(group) %>%
summarize(mean=mean(Temp.1))
data
d <- read.table(text = " Time Date Day.of.Week Humidity.1 Temp.1 Vapor.Density.1 Base.Temp.1
1 0:18:44 1/7/2016 Friday 69.7 26.4 17.43 85.00
2 0:18:49 1/7/2016 Friday 69.7 26.4 17.43 27.44
3 0:18:54 1/7/2016 Friday 69.6 26.4 17.40 27.44
4 0:18:59 1/7/2016 Friday 69.6 26.4 17.40 27.44
5 0:19:05 1/7/2016 Friday 69.5 26.4 17.38 27.44
6 0:19:10 1/7/2016 Friday 69.5 26.4 17.38 27.44",stringsAsFactor=FALSE,head=TRUE)
Here is a more complete answer using dplyr. This uses the actual dates and times you have so that you aren't approximating 720 values per hour.
library(tidyverse)
worm_data <- data_frame(time = c("0:18:44","0:18:49","2:18:54",
"0:18:59","0:19:05","2:19:10"),
date = c("2016-07-01","2016-07-01","2016-07-01",
"2016-07-02", "2016-07-02", "2016-07-02"),
temp_1 = c(25,27,290,30,20,2))
worm_data_test <- worm_data %>%
mutate(
date = paste(date, time),
date = as.POSIXct(date, tz="GMT", format="%Y-%m-%d %H:%M:%S")
) %>%
group_by(
datetime = as.POSIXct(cut(date, breaks='hour')) # creates a new variable
) %>%
summarize(
temp_1 = mean(temp_1, na.rm=T)
) %>%
ungroup()
In this case, you are grouping by the hour, then summarizing over those hours. I chose strange values and modified the dates and times to show that it works.
For more on datetime, I suggest: https://www.stat.berkeley.edu/~s133/dates.html
this is my first question on this forum.
I would like to re-model the structure of my dataset.
I would like to split the column "Teams" into two columns. One with the hometeam and another with the awayteam.
I also would like to split the result into two columns. Homegoals and Awaygoals. The new columns should not have a zero infront of the "real" goals scored.
BEFORE
Date Time Teams Results Homewin Draw Awaywin
18 May 19:45 AC Milan - Sassuolo 02:01 1.26 6.22 10.47
18 May 19:45 Chievo - Inter 02:01 3.73 3.42 2.05
18 May 19:45 Fiorentina - Torino 02:02 2.84 3.58 2.39
AFTER
Date Time Hometeam Awayteam Homegoals Awaygoals Homewin Draw Awaywin
18 May 19:45 AC Milan Sassuolo 2 1 1.26 6.22 10.47
18 May 19:45 Chievo Inter 2 1 3.73 3.42 2.05
18 May 19:45 Fiorentina Torino 2 2 2.84 3.58 2.39
Can R fix this problem for me? Which packages do i need?
I want to be able to do this for many excel spreadsheets with different leagues and divisions but all with the same structure.
Can someone help me and my data.frame?
tidyr solution:
separate(your.data.frame, Teams, c('Home', 'Away'), sep = " - ")
Base R solution (following this answer):
df <- data.frame(do.call(rbind, strsplit(as.character(your.df$teams), " - ")))
names(df) <- c("Home", "Away")
Here's an approach that uses cSplit from the splitstackshape package, which uses and returns a data.table. Presuming your original data frame is named df,
library(splitstackshape)
setnames(
cSplit(df, 3:4, c(" - ", ":"))[, c(1:2, 6:9, 3:5), with = FALSE],
3:6,
paste0(c("Home", "Away"), rep(c("Team", "Goals"), each = 2))
)[]
# Date Time HomeTeam AwayTeam HomeGoals AwayGoals Homewin Draw Awaywin
# 1: 18 May 19:45 AC Milan Sassuolo 2 1 1.26 6.22 10.47
# 2: 18 May 19:45 Chievo Inter 2 1 3.73 3.42 2.05
# 3: 18 May 19:45 Fiorentina Torino 2 2 2.84 3.58 2.39
I am stuck on the why that this is happening and have tried searching everywhere for the answer. When I try to plot a timeseries object in R the resulting plot comes out in reverse.
I have the following code:
library(sqldf)
stock_prices <- read.csv('~/stockPrediction/input/REN.csv')
colnames(stock_prices) <- tolower(colnames(stock_prices))
colnames(stock_prices)[7] <- 'adjusted_close'
stock_prices <- sqldf('SELECT date, adjusted_close FROM stock_prices')
head(stock_prices)
date adjusted_close
1 2014-10-20 3.65
2 2014-10-17 3.75
3 2014-10-16 4.38
4 2014-10-15 3.86
5 2014-10-14 3.73
6 2014-10-13 4.09
tail(stock_prices)
date adjusted_close
1767 2007-10-15 8.99
1768 2007-10-12 9.01
1769 2007-10-11 9.02
1770 2007-10-10 9.06
1771 2007-10-09 9.06
1772 2007-10-08 9.08
But when I try the following code:
stock_prices_ts <- ts(stock_prices$adjusted_close, start=c(2007, 1), end=c(2014, 10), frequency=12)
plot(stock_prices_ts, col='blue', lwd=2, type='l')
How the image that results is :
And even if I reverse the time series object with this code:
plot(rev(stock_prices_ts), col='blue', lwd=2, type='l')
I get this
which has arbitrary numbers.
Any idea why this is happening? Any help is much appreciated.
This is happened because your object loose its time serie structure once you apply rev function.
For example :
set.seed(1)
gnp <- ts(cumsum(1 + round(rnorm(100), 2)),
start = c(1954, 7), frequency = 12)
gnp ## gnp has a real time serie structure
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1954 0.37 1.55 1.71 4.31 5.64 5.82
1955 7.31 9.05 10.63 11.32 13.83 15.22 15.60 14.39 16.51 17.47 18.45 20.39
1956 22.21 23.80 25.72 27.50 28.57 27.58 29.20 30.14 30.98 30.51 31.03 32.45
1957
rev(gnp) ## the reversal is just a vector
[1] 110.91 110.38 110.60 110.17 110.45 108.89 106.30 104.60 102.44 ....
In general is a liitle bit painful to manipulate the class ts. One idea is to use an xts object that "generally" conserve its structure one you apply common operation on it.
Even in this case the generic method rev is not implemented fo an xts object, it is easy to coerce the resulted zoo time series to and xts one using as.xts.
par(mfrow=c(2,2))
plot(gnp,col='red',main='gnp')
plot(rev(gnp),type='l',col='red',main='rev(gnp)')
library(xts)
xts_gnp <- as.xts(gnp)
plot(xts_gnp)
## note here that I apply as.xts again after rev operation
## otherwise i lose xts structure
rev_xts_gnp = as.xts(rev(as.xts(gnp)))
plot(rev_xts_gnp)