Rating of locations based on a parameter - r

I am looking for a method that utilizes the traffic count and based on that gives a rating for that particular location ratings should lie between 0 to 10. I can simply just give rankings 1 to 30 where rank 1 given to the location where most of the traffic occurs. I might need something which scales it between 0 to 10. Any suggestions?
> dput(Data)
structure(list(Traffic.Views = c(3175760L, 2949940L, 2685756L,
2535156L, 2437236L, 2210328L, 2085276L, 1974840L, 1961424L, 1923308L,
1844408L, 1781592L, 1761252L, 1675820L, 1582748L, 1475928L, 1399336L,
1311940L, 1309980L, 1305544L, 1160140L, 1144348L, 1137584L, 1106904L,
946304L, 931992L, 891176L, 815812L, 789788L, 662652L), Names = c("df01",
"df02", "df03", "df04", "df05", "df06", "df07", "df08", "df09",
"df10", "df11", "df12", "df13", "df14", "df15", "df16", "df17",
"df18", "df19", "df20", "df21", "df22", "df23", "df24", "df25",
"df26", "df27", "df28", "df29", "df30")), row.names = c(NA, -30L
), class = "data.frame")

You can use :
rating_0_10 <- function(x) (x-min(x))/(max(x)-min(x)) * 10
Data$scaled <- rating_0_10(Data$Traffic.Views)
Data
# Traffic.Views Names scaled
#1 3175760 df01 10.0000000
#2 2949940 df02 9.1014314
#3 2685756 df03 8.0502072
#4 2535156 df04 7.4509492
#5 2437236 df05 7.0613121
#6 2210328 df06 6.1584142
#7 2085276 df07 5.6608152
#8 1974840 df08 5.2213753
#9 1961424 df09 5.1679912
#10 1923308 df10 5.0163224
#11 1844408 df11 4.7023685
#12 1781592 df12 4.4524151
#13 1761252 df13 4.3714795
#14 1675820 df14 4.0315339
#15 1582748 df15 3.6611877
#16 1475928 df16 3.2361363
#17 1399336 df17 2.9313663
#18 1311940 df18 2.5836056
#19 1309980 df19 2.5758065
#20 1305544 df20 2.5581551
#21 1160140 df21 1.9795727
#22 1144348 df22 1.9167342
#23 1137584 df23 1.8898193
#24 1106904 df24 1.7677394
#25 946304 df25 1.1286901
#26 931992 df26 1.0717406
#27 891176 df27 0.9093282
#28 815812 df28 0.6094446
#29 789788 df29 0.5058915
#30 662652 df30 0.0000000

Related

Using R, how can I correctly subtract a second from a POSIXct variable using dplyr's lead and lag functions?

I have 12 dates and times in a table in POSIXct format:
Time
1 2017-03-11 01:10:09
2 2017-03-11 03:07:58
3 2017-03-12 19:16:47
4 2017-03-13 09:52:04
5 2017-03-17 20:36:35
6 2017-03-18 03:10:54
7 2017-03-18 07:29:31
8 2017-03-18 10:13:37
9 2017-03-20 10:19:31
10 2017-03-20 12:11:39
11 2017-03-20 12:11:39
12 2017-03-20 14:16:12
If an entry matches the following entry, I want to remove one second from the time. For example, row 10 should appear as "2017-03-20 12:11:38".
I'm trying to do something like the following but that works:
df %>% mutate(Time = ifelse(Time == lead(Time), Time-1, Time))
Instead of ifelse we can use case_when and with case_when we can also have multiple conditions instead of using a nested ifelse or if_else
library(dplyr)
library(lubridate)
df %>%
mutate(Time = as.POSIXct(Time),
Time = case_when(Time == lead(Time) ~
Time - seconds(1), TRUE ~ Time))
-output
# Time
#1 2017-03-11 01:10:09
#2 2017-03-11 03:07:58
#3 2017-03-12 19:16:47
#4 2017-03-13 09:52:04
#5 2017-03-17 20:36:35
#6 2017-03-18 03:10:54
#7 2017-03-18 07:29:31
#8 2017-03-18 10:13:37
#9 2017-03-20 10:19:31
#10 2017-03-20 12:11:38
#11 2017-03-20 12:11:39
#12 2017-03-20 14:16:12
Or using base R, we create a logical index and assign
i1 <- c(df$Time[-1] == df$Time[-nrow(df)], FALSE)
df$Time <- as.POSIXct(df$Time)
df$Time[i1] <- df$Time[i1] - 1
data
df <- structure(list(Time = c("2017-03-11 01:10:09", "2017-03-11 03:07:58",
"2017-03-12 19:16:47", "2017-03-13 09:52:04", "2017-03-17 20:36:35",
"2017-03-18 03:10:54", "2017-03-18 07:29:31", "2017-03-18 10:13:37",
"2017-03-20 10:19:31", "2017-03-20 12:11:39", "2017-03-20 12:11:39",
"2017-03-20 14:16:12")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
Your approach is correct however, base::ifelse strips off the attribute making POSIXct column as numeric. To avoid that since you are using dplyr you can use dplyr::if_else which preserves the class.
Also note that lead will generate NA for last value in Time so the comparison Time == lead(Time) would generate NA. We could set missing argument as Time in if_else.
library(dplyr)
df %>% mutate(Time = if_else(Time == lead(Time), Time-1, Time, missing = Time))
# Time
#1 2017-03-11 01:10:09
#2 2017-03-11 03:07:58
#3 2017-03-12 19:16:47
#4 2017-03-13 09:52:04
#5 2017-03-17 20:36:35
#6 2017-03-18 03:10:54
#7 2017-03-18 07:29:31
#8 2017-03-18 10:13:37
#9 2017-03-20 10:19:31
#10 2017-03-20 12:11:38
#11 2017-03-20 12:11:39
#12 2017-03-20 14:16:12

Calculate the difference in time in Days for Character Dates and also with Blank Spaces in one of the column

I have the following data, Both columns are dates & have to take the difference in days. However most of the values in one of date column is blank so I have to return NA for them.
a b
02-07-2012
18-08-2012
13-08-2012
16-04-2012
26-04-2012
03-05-2012 12-05-2012
09-06-2012
30-05-2012
22-06-2012
05-07-2012
30-06-2012
09-05-2012
22-06-2012
02-07-2012
17-07-2012
17-08-2012
16-07-2012
01-08-2012
05-08-2012
17-08-2012
30-04-2012
05-07-2012
07-04-2012
27-04-2012
21-06-2012
03-07-2012
21-07-2012
24-04-2012
05-06-2012
03-07-2012
02-04-2012 01-06-2012
06-04-2012
15-04-2012
16-06-2012
01-08-2012
13-05-2012
09-07-2012
09-07-2012
18-04-2012
09-08-2012
10-04-2012
12-05-2012
04-04-2012
04-06-2012 04-06-2012
15-06-2012
02-07-2012
05-07-2012
21-08-2012
19-07-2012
06-08-2012
15-06-2012
06-04-2012
04-06-2012
23-07-2012
06-04-2012
12-04-2012 11-06-2012
24-05-2012
03-08-2012
04-05-2012 09-05-2012
07-05-2012
07-06-2012
06-07-2012
13-07-2012
26-07-2012
26-04-2012
22-06-2012
26-07-2012
12-04-2012
07-08-2012
27-06-2012
03-04-2012 02-06-2012
13-04-2012
28-07-2012
07-05-2012
29-06-2012
03-04-2012 02-06-2012
04-04-2012
04-04-2012 24-05-2012
04-04-2012
05-04-2012
07-04-2012
10-04-2012
11-04-2012
13-04-2012
13-04-2012
13-04-2012
13-04-2012
14-04-2012
14-04-2012
14-04-2012
18-04-2012
19-04-2012
21-04-2012
25-04-2012
25-04-2012
26-04-2012
26-04-2012
26-04-2012
27-04-2012
30-04-2012
04-06-2012
04-06-2012
05-06-2012
05-06-2012
05-06-2012
05-06-2012
05-06-2012 16-07-2012
06-06-2012 29-06-2012
I tried the following but couldnt succeed
date_strings[date_strings==""]<-NA # Replaced blank spaces with NA & removed them
head(date_strings)
newdata<-na.omit(date_strings)
str(newdata)
newdata$a<-as.Date(newdata$a,"%m%d%y")
newdata$b<-as.Date(newdata$b,"%m%d%y")
diff_in_days = difftime(newdata$a, newdata$b, units = "days") # days
Change the dates to date class which will turn blanks to NA automatically and then subtract days using difftime.
date_strings[] <- lapply(date_strings, as.Date, format = '%d-%m-%Y')
date_strings$diff_in_days = difftime(date_strings$b, date_strings$a,
units = "days")
date_strings
# a b diff_in_days
#1 2012-07-02 <NA> NA
#2 2012-08-18 <NA> NA
#3 2012-08-13 <NA> NA
#4 2012-04-16 <NA> NA
#5 2012-04-26 <NA> NA
#6 2012-05-03 2012-05-12 9
Or directly subtract
date_strings$diff_in_days = date_strings$b - date_strings$a
data
date_strings <- structure(list(a = c("02-07-2012", "18-08-2012", "13-08-2012",
"16-04-2012", "26-04-2012", "03-05-2012"), b = c("", "", "",
"", "", "12-05-2012")), class = "data.frame", row.names = c(NA, -6L))
With tidyverse, we can do
library(dplyr)
library(lubridate)
date_strings %>%
mutate(across(everything(), dmy)) %>%
mutate(diff_in_days = b - a)
# a b diff_in_days
#1 2012-07-02 <NA> NA days
#2 2012-08-18 <NA> NA days
#3 2012-08-13 <NA> NA days
#4 2012-04-16 <NA> NA days
#5 2012-04-26 <NA> NA days
#6 2012-05-03 2012-05-12 9 days
data
date_strings <- structure(list(a = c("02-07-2012", "18-08-2012", "13-08-2012",
"16-04-2012", "26-04-2012", "03-05-2012"), b = c("", "", "",
"", "", "12-05-2012")), class = "data.frame", row.names = c(NA, -6L))

Heatwave calculation based on maximum temperature in R

A heatwave is defined if the maximum temperature at a meteorological station is 3 °C or more than the normal temperature consecutively for 3 days or more. I have calculated the daily average (daily normal) from multiple-year daily maximum temperature data like
df <- data.frame("date"= seq(from = as.Date("1970-1-1"), to = as.Date("2000-12-31"), by = "day"),
"MaxT" = runif(length(seq.Date(as.Date("1970-1-1"), as.Date("2000-12-31"), "days")), 20, 40))
df$day <- format(df$date, format='%m-%d')
daily_mean <- aggregate(MaxT ~ day, data=df, FUN=mean)
Now it has to be matched with every year's daily maximum temperature and identify the dates when the maximum temperature is 3 °C or more than the normal daily temperature consecutively for 3 days or more. Those events will be considered as heatwaves. My question is how to implement it in R?
Here is a way using dplyr and zoo's rollapplyr.
library(dplyr)
df_out <- df %>%
left_join(daily_mean %>% rename(mean_temp = MaxT), by = 'day') %>%
mutate(is_heatwave = zoo::rollapplyr(MaxT > (mean_temp + 3),
3, all,fill = NA))
some instances of heatwave :
df_out[31:50, ]
# date MaxT day mean_temp is_heatwave
#31 1970-01-31 26.31675 01-31 28.31451 FALSE
#32 1970-02-01 22.05946 02-01 29.83059 FALSE
#33 1970-02-02 34.22469 02-02 29.84562 FALSE
#34 1970-02-03 33.03264 02-03 29.87919 FALSE
#35 1970-02-04 36.62357 02-04 31.50603 TRUE
#36 1970-02-05 29.82134 02-05 30.22581 FALSE
#37 1970-02-06 28.13625 02-06 29.64073 FALSE
#38 1970-02-07 29.95754 02-07 29.54277 FALSE
#39 1970-02-08 21.40026 02-08 30.96619 FALSE
#40 1970-02-09 33.10983 02-09 28.16146 FALSE
#41 1970-02-10 30.87346 02-10 29.37693 FALSE
#42 1970-02-11 31.08721 02-11 28.89930 FALSE
#43 1970-02-12 27.34925 02-12 29.27882 FALSE
#44 1970-02-13 31.88582 02-13 29.35825 FALSE
#45 1970-02-14 30.05155 02-14 28.24995 FALSE
#46 1970-02-15 35.07049 02-15 29.02716 FALSE
#47 1970-02-16 39.49029 02-16 32.75644 FALSE
#48 1970-02-17 37.41917 02-17 31.44022 TRUE
#49 1970-02-18 36.03564 02-18 29.56212 TRUE
#50 1970-02-19 36.48052 02-19 30.18766 TRUE
TRUE values are where heatwave was present. As we can see in row 33, 34 and 35 we had 3 consecutive days when MaxT was greater than mean_temp by more than 3 degrees. Similarly, we can verify for other days.
To get yearly heatwave occurrences, we can do :
df_year <- df_out %>%
group_by(year = format(date, "%Y")) %>%
summarise(total_heat = with(rle(is_heatwave),
sum(values, na.rm = TRUE)))
and sum(df_year$total_heat) would give overall count.

Merging data frames with different sample frequency

I have two data frames - one which is an hourly data set (dfa) and one which has measurements per day (dfb) at the same hour each day (15:29:05) (see below for examples with 2 days)
I want to merge these data frames so that i keep all the hourly data and the daily joins aligning with the correct hour and when there is no data for the other hours of the day it fills with NA's
Simply applying merge just cuts it to the the daily data and so I loose all the hourly information:
dfc <- merge(dfa, dfb, by = "datetime")
Any help would be appreciated.
e.g. for two days:
#hourly
dfa <- structure(list(datetime = structure(c(1466231345, 1466234945,
1466238545, 1466242145, 1466245745, 1466249345, 1466252945, 1466256545,
1466260145, 1466263745, 1466267345, 1466270945, 1466274545, 1466278145,
1466281745, 1466285345, 1466288945, 1466292545, 1466296145, 1466299745,
1466303345, 1466306945, 1466310545, 1466314145, 1466317745, 1466321345,
1466324945, 1466328545, 1466332145, 1466335745, 1466339345, 1466342945,
1466346545, 1466350145, 1466353745, 1466357345, 1466360945, 1466364545,
1466368145, 1466371745, 1466375345, 1466378945, 1466382545, 1466386145,
1466389745, 1466393345, 1466396945, 1466400545), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), DFQ1 = c(0.408025, 0.4355833335,
0.68485, 0.650875, 0.5307833335, 0.509775, 0.5273135595, 0.5763083335,
0.4954, 0.444308333, 0.4048083335, 0.419475, 0.35105, 0.2740416665,
0.3038666665, 0.351774317, 0.306025, 0.3183916665, 0.249175,
0.268133333, 0.3285083335, 0.2807666665, 0.351633333, 0.374516667,
0.3763, 0.3806583335, 0.366675, 0.411133333, 0.433291667, 0.408225,
0.3812, 0.380358333, 0.3557166665, 0.3701, 0.400788842, 0.396833333,
0.362991667, 0.3790083335, 0.3631666665, 0.367041667, 0.3899583335,
0.360658333, 0.359675, 0.356358333, 0.3864083335, 0.3965083335,
0.3901166665, 0.403976695)), class = "data.frame", row.names = c(NA,
-48L))
#daily
dfb <- structure(list(datetime = structure(c(1466263745, 1466350145), class
= c("POSIXct",
"POSIXt"), tzone = "UTC"), Tchl = c(0.1265, 0.1503), TCSE = structure(c(12L,
9L), .Label = c("", "#DIV/0!", "0.000", "0.001", "0.002", "0.003",
"0.004", "0.005", "0.007", "0.008", "0.009", "0.010", "0.011",
"0.012", "0.013", "0.015", "0.021", "0.026", "0.027", "CB2016",
"Std error"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
You can use this
dfc <- merge(dfa, dfb, by = "datetime", all.x = TRUE)
# datetime DFQ1 Tchl TCSE
# 1 2016-06-18 06:29:05 0.4080250 NA <NA>
# 2 2016-06-18 07:29:05 0.4355833 NA <NA>
# 3 2016-06-18 08:29:05 0.6848500 NA <NA>
# 4 2016-06-18 09:29:05 0.6508750 NA <NA>
# 5 2016-06-18 10:29:05 0.5307833 NA <NA>
# 6 2016-06-18 11:29:05 0.5097750 NA <NA>
# 7 2016-06-18 12:29:05 0.5273136 NA <NA>
# 8 2016-06-18 13:29:05 0.5763083 NA <NA>
# 9 2016-06-18 14:29:05 0.4954000 NA <NA>
# 10 2016-06-18 15:29:05 0.4443083 0.1265 0.010
# ...
Or a tidyverse solution:
library(tidyverse)
dfc <- left_join(dfa, dfb, by="datetime")
#> head(dfc,10)
# datetime DFQ1 Tchl TCSE
#1 2016-06-18 06:29:05 0.4080250 NA <NA>
#2 2016-06-18 07:29:05 0.4355833 NA <NA>
#3 2016-06-18 08:29:05 0.6848500 NA <NA>
#4 2016-06-18 09:29:05 0.6508750 NA <NA>
#5 2016-06-18 10:29:05 0.5307833 NA <NA>
#6 2016-06-18 11:29:05 0.5097750 NA <NA>
#7 2016-06-18 12:29:05 0.5273136 NA <NA>
#8 2016-06-18 13:29:05 0.5763083 NA <NA>
#9 2016-06-18 14:29:05 0.4954000 NA <NA>
#10 2016-06-18 15:29:05 0.4443083 0.1265 0.010

Replace Loop with vectorised operation

I am using this code to create candlesticks in plotly. However, it contains a loop which is very inefficient (38 secs to loop through 10K observations). It also uses the rbind function which means the date has to be converted to numeric and then back again, which doesn't appear to be straight forward considering its a date with time.
The loop Im trying to replace with a more efficient function is:
for(i in 1:nrow(prices)){
x <- prices[i, ]
# For high / low
mat <- rbind(c(x[1], x[3]),
c(x[1], x[4]),
c(NA, NA))
plot.base <- rbind(plot.base, mat)
}
The output is a vector with the first observation being the 1st(date) and 3rd col from input data, the second observation is the 1st and 4th col from input data, and the third observation is two NAs. The NAs are important later on for the plotting.
What is the most efficient way to achieve this?
Minimal reproducible example:
library(quantmod)
prices <- getSymbols("MSFT", auto.assign = F)
# Convert to dataframe
prices <- data.frame(time = index(prices),
open = as.numeric(prices[,1]),
high = as.numeric(prices[,2]),
low = as.numeric(prices[,3]),
close = as.numeric(prices[,4]),
volume = as.numeric(prices[,5]))
# Create line segments for high and low prices
plot.base <- data.frame()
for(i in 1:nrow(prices)){
x <- prices[i, ]
# For high / low
mat <- rbind(c(x[1], x[3]),
c(x[1], x[4]),
c(NA, NA))
plot.base <- rbind(plot.base, mat)
}
Edit:
dput(head(prices))
structure(list(time = structure(c(13516, 13517, 13518, 13521,
13522, 13523), class = "Date"), open = c(29.91, 29.700001, 29.629999,
29.65, 30, 29.799999), high = c(30.25, 29.969999, 29.75, 30.1,
30.18, 29.889999), low = c(29.4, 29.440001, 29.450001, 29.530001,
29.73, 29.43), close = c(29.860001, 29.809999, 29.639999, 29.93,
29.959999, 29.66), volume = c(76935100, 45774500, 44607200, 50220200,
44636600, 55017400)), .Names = c("time", "open", "high", "low",
"close", "volume"), row.names = c(NA, 6L), class = "data.frame")
I would be wary of a tutorial that grows an object in a loop. That's one of the slowest operations you can do in programming. (It's like buying a shelf that has exactly the room needed for your books and then replacing the shelf every time you buy a new book.)
Use subsetting like this:
res <- data.frame(date = rep(prices[, 1], each = 3),
y = c(t(prices[,c(3:4)])[c(1:2, NA),])) #transpose, subset, make to vector
res[c(FALSE, FALSE, TRUE), 1] <- NA
# date y
#1 2007-01-03 30.25
#2 2007-01-03 29.40
#3 <NA> <NA>
#4 2007-01-04 29.97
#5 2007-01-04 29.44
#6 <NA> <NA>
#7 2007-01-05 29.75
#8 2007-01-05 29.45
#9 <NA> <NA>
#10 2007-01-08 30.10
#11 2007-01-08 29.53
#12 <NA> <NA>
#13 2007-01-09 30.18
#14 2007-01-09 29.73
#15 <NA> <NA>
#16 2007-01-10 29.89
#17 2007-01-10 29.43
#18 <NA> <NA>

Resources