Finding monthly average of columns using group_by function in R - r

I have a dataset that has daily values. I want to find the monthly average of the values of columns. The following code used to work for me but I don't understand why, it doesn't work anymore. It gives me data1 as 1 obs of 1 variable which is NA.
data %>% group_by(month=floor_date(Timestamp, "month")) %>%
summarize(USDTRY=mean(USDTRY)) -> data1
The following is how my data looks:
dput(head(data))
structure(list(Timestamp = structure(c(1629417600, 1629331200,
1629244800, 1629158400, 1629072000, 1628812800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), USDTRY = c(8.4852, 8.4939, 8.4485, 8.4284, 8.453,
8.5171), EURTRY = c(9.9325, 9.9311, 9.8916, 9.8746, 9.9618, 10.0539
), EURUSD = c(1.1696, 1.1674, 1.171, 1.1708, 1.1777, 1.1791),
BIST100 = c(1444.63, 1439.86, 1449.59, 1461.69, 1455.25,
1447.64), TR2YT = c(18.01, 18.01, 18.01, 18.01, 18.01, 18.15
), TR10YT = c(16.88, 16.87, 16.79, 16.8, 16.69, 16.77), TR_EURBON_2 = c(3.648673,
3.63085, 3.611969, 3.572728, 3.567871, 3.559959), TR_EURBON_10 = c(6.302608,
6.307343, 6.276473, 6.240502, 6.255035, 6.301358), BRENT = c(65.18,
66.45, 68.23, 69.03, 69.51, 70.59), WTI = c(62.32, 63.69,
65.46, 66.59, 67.29, 68.44), Altın = c(1780.8668, 1780.179,
1787.59, 1785.9556, 1787.2383, 1779.1515), Gümüş = c(23.01,
23.23, 23.4805, 23.6351, 23.8235, 23.74)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Any idea how can I solve it?
Thanks.
(Additionally note that my Timestamp variable has the column values as 2021-08-01, 2021-08-18... when I view(data) but it seems as 1629417600, 1629331200 in the dput output.)

Related

How to plot layers of tupples on same plot in R?

I am trying to plot the time and NDVI for each region on the same plot. I think to do this I have to convert the date column from characters to time and then plot each layer. However I cannot figure out how to do this. Any thoughts?
list(structure(list(observation = 1L, HRpcode = NA_character_,
timeseries = NA_character_), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.3793765496776215", "0.21686891782421552", "0.3785652933528299",
"0.41027240624704164", "0.4035578030242673", "0.341299793064468"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.4071076986818826", "0.09090719657570319", "0.35214166081795284",
"0.4444311032927228", "0.5220702877666005", "0.5732370503295022"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.3412131556625801", "0.18815996897460135", "0.5218904976415136",
"0.6970128777711452", "0.7229657162729096", "0.535967435470161"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)))
111
First we need to clean your data. The first element in this list is empty
df = df[-1]
Now we need to make a data.frame
df = do.call(rbind, df)
I am going to add a region variable, change the name of NDVI to remove the space,
change ndvi into a numeric vector, and change time into a Date object
library(dplyr)
df = df %>%
mutate(region = factor(rep(1:3, rep(6, 3)))) %>%
rename(ndvi = ' NDVI') %>%
mutate(ndvi = as.numeric(ndvi)) %>%
mutate(time = as.Date(time))
Now we can use ggplot2 to plot the data by region
library(ggplot2)
g = df %>%
ggplot(aes(x = time, y = ndvi, col = region)) +
geom_line()
g
Which gives this plot:
Here's an approach with lubridate to handle dates and dplyr to make the binding of the data.frames easier to understand.
Note that the group names are taken from the names of the list, and since those don't exist in the data you provided, we have to set them in advance.
library(lubridate)
library(ggplot2)
library(dplyr)
names(data) <- 1:3
data <- bind_rows(data, .id = "group")
data$time <- ymd(data$time)
setnames(data," NDVI","NDVI")
data$NDVI <- as.numeric(data$NDVI)
ggplot(data, aes(x=time,y=NDVI,color=Group)) + geom_line()

Find value of a row by comparing two columns and a value with a range of a different dataset

I have 2 different datasets. One with an object that comes from a StationX and goes to StationY and arrives at a specific date and time as the following.
df1<-structure(list(From = c("Station1", "Station5", "Station6", "Station10"), To = c("Station15", "Station2", "Station2", "Station7"),
Arrival = structure(c(971169720, 971172720, 971178120, 971179620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L),class = c("tbl_df","tbl", "data.frame"))
In the Dataset2 are e.g. trucks which wait for the specific object at StationY between the time&date "Arrival" and "Departure" and leave at "Departure to a specifc region "TOID".
As in the following:
df2<-structure(list(TOID = c(2, 4, 7, 20), Station = c("Station15",
"Station2", "Station2","Station7"), Arrival = structure(c(971169600, 971172000, 971177700, 971179500), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Departure1 = structure(c(971170200, 971173200, 971178600, 971179800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I want to look for the TOID in Dataset2 and add it to Dataset1 if "TO"(Dataset1)="Station"(Dataset2) and "Arrival"(Dataset2)<="Arrival"(Dataset1)<="Departure"(Dataset2) and has therefore the following outcome:
df1outcome<-structure(list(From = c("Station1", "Station5", "Station6", "Station10"
), To = c("Station15", "Station2", "Station2", "Station7"), `TO_ID` = c(2, 4, 7, 20), Arrival = structure(c(971169720, 971172720, 971178120, 971179620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I need a solution which looks in dataset2 for the ID that matches the conditions regardless the roworder.
Would be awesome if you guys could help me how to code this in R.
Best,
J
Perhaps you could use tidyverse, use a left_join based on the station, and then filter based on dates:
library(tidyverse)
df1 %>%
left_join(df2, by = c("To" = "Station"), suffix = c("1","2")) %>%
filter(Arrival1 >= Arrival2 & Arrival1 <= Departure1) %>%
select(-c(Arrival2, Departure1))
# A tibble: 4 x 4
From To Arrival1 TOID
<chr> <chr> <dttm> <dbl>
1 Station1 Station15 2000-10-10 09:22:00 2
2 Station5 Station2 2000-10-10 10:12:00 4
3 Station6 Station2 2000-10-10 11:42:00 7
4 Station10 Station7 2000-10-10 12:07:00 20
Im pretty new to R, so this code is probably longer then it should be. But does this work?
#renaming variables so its easier to merge the objects and to compare them
df1 <- df1 %>% rename(Arrival_Package = Arrival)
df2 <- df2 %>% rename(Arrival_Truck = Arrival)
#merge objects
df1outcome <- merge(df1, df2, by.x = "To", by.y = "Station")
#subset from object and select relevant columns
df1outcome <- subset(df1outcome, Arrival_Package <= Departure1)
df1outcome <- subset(df1outcome, Arrival_Truck <= Arrival_Package)
df1outcome <- df1outcome %>% select(From, To, TOID, Arrival_Package)

time average for specific time range in r

I am trying to extract average values of all variables between 0 to 40 minutes every hour.
dput(head(df))
structure(list(DateTime = structure(c(1563467460, 1563468060,
1563468660, 1563469260, 1563469860, 1563470460), class = c("POSIXct",
"POSIXt"), tzone = "GMT"), date = structure(c(1563467460, 1563468060,
1563468660, 1563469260, 1563469860, 1563470460), class = c("POSIXct",
"POSIXt"), tzone = "GMT"), Date = structure(c(18095, 18095, 18095,
18095, 18095, 18095), class = "Date"), TimeCtr = structure(c(1563467460,
1563468060, 1563468660, 1563469260, 1563469860, 1563470460), class = c("POSIXct",
"POSIXt"), tzone = "GMT"), MassConc = c(0.397627, 0.539531, 0.571902,
0.608715, 0.670382, 0.835773), VolConc = c(175.038, 160.534,
174.386, 183.004, 191.074, 174.468), NumbConc = c(234.456, 326.186,
335.653, 348.996, 376.018, 488.279), MassD = c(101.426, 102.462,
101.645, 102.145, 101.255, 101.433)), .Names = c("DateTime",
"date", "Date", "TimeCtr", "MassConc", "VolConc", "NumbConc",
"MassD"), row.names = c(NA, 6L), class = "data.frame")
What I've tried so far..
hourly_mean<-mydata %>%
filter(between(as.numeric(format(DateTime, "%M")), 0, 40)) %>%
group_by(DateTime=format(DateTime, "%Y-%m-%d %H")) %>%
summarise(variable1_mean=mean(variable1))
But it gives me a single average value for the whole period. Any help is very much welcomed.
We can convert DateTime , use ceiling_date with hourly unit to round Datetime, extract minutes from DateTime and filter to keep minutes which are less than 40, group_by hour and take mean of values.
library(lubridate)
library(dplyr)
df %>%
dplyr::mutate(DateTime = ymd_hm(DateTime),
hour = ceiling_date(DateTime, "hour"),
minutes = minute(DateTime)) %>%
filter(minutes <= 40) %>%
group_by(hour) %>%
summarise_at(vars(ends_with("Conc")), mean)
data
df <- structure(list(DateTime = structure(1:7, .Label = c("2019-08-0810:07",
"2019-08-0810:17", "2019-08-0810:27", "2019-08-0810:37", "2019-08-0810:47",
"2019-08-0810:57", "2019-08-0811:07"), class = "factor"), MassConc = c(0.556398,
1.06868, 0.777654, 0.87289, 0.789704, 0.51948, 0.416676), NumbConc = c(588.069,
984.018, 964.634, 997.678, 1013.52, 924.271, 916.357), VolConc = c(582.887,
979.685, 963.3, 994.178, 1009.52, 922.104, 916.856), Conc = c(281.665,
486.176, 420.058, 422.101, 429.841, 346.539, 330.282)), class =
"data.frame", row.names = c(NA, -7L))

R: add a new column to dataframes from a function

I have many tibbles similar to this:
dftest_tw <- structure(list(text = c("RT #BitMEXdotcom: A new high: US$500M turnover in the last 24 hours, over 80% of it on $XBTUSD. Congrats to the team and thank you to our u…",
"RT #Crowd_indicator: Thank you for this nice video, #Nicholas_Merten",
"RT #Crowd_indicator: Review of #Cindicator by DataDash: t.co/D0da3u5y3V"
), Tweet.id = c("896858423521837057", "896858275689398272", "896858135314538497"
), created.date = structure(c(17391, 17391, 17391), class = "Date"),
created.week = c(33, 33, 33)), .Names = c("text", "Tweet.id",
"created.date", "created.week"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
For testing, we add another one:
dftest2_tw <- dftest_tw
I have this list of my df:
myUserList <- ls(,pattern = "_tw")
What I am looking to do is:
1- add a new column named Twitter.name
2- fill the column with the df name, all this in a function. The following code works for each df taken one by one:
dftest_tw %>% rowwise() %>% mutate(Twitter.name = myUserList[1])
The desired result is this:
MyRes <- structure(list(text = c("RT #BitMEXdotcom: A new high: US$500M turnover in the last 24 hours, over 80% of it on $XBTUSD. Congrats to the team and thank you to our u…",
"RT #Crowd_indicator: Thank you for this nice video, #Nicholas_Merten",
"RT #Crowd_indicator: Review of #Cindicator by DataDash: t.co/D0da3u5y3V"
), Tweet.id = c("896858423521837057", "896858275689398272", "896858135314538497"
), created.date = structure(c(17391, 17391, 17391), class = "Date"),
created.week = c(33, 33, 33), retweet = c(0, 0, 0), custom = c(0,
0, 0), Twitter.name = c("dftest_tw", "dftest_tw", "dftest_tw"
)), .Names = c("text", "Tweet.id", "created.date", "created.week",
"retweet", "custom", "Twitter.name"), class = c("rowwise_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))
When it comes to write a function to be thereafter been applied to all my df (more than 100), I can't achieve it. Any help would be appreciated.
We can use tidyverse options. Get the value of multiple string objects with mget, then with map2 from purrr, create the new column 'Twitter.name in each dataset of the list with corresponding string element of 'myUserList`
library(tidyverse)
lst <- mget(myUserList) %>%
map2(myUserList, ~mutate(.data = .x, Twitter.name = .y))
If we need to modify the objects in the global environment, use list2env
list2env(lst, envir = .GlobalEnv)

Applying a function to a few rows then the next few rows

I am trying to find the max of rows 2:5, then 3:6, then 4:7 and so on for nrows(df). I am however having a problem thinking of how to do this because I have never used a for loop in the past successfully. Any help is greatly appreciated.
structure(c(76.89, 77.08, 77.05, 77.28, 77.28, 77.61, 77.03,
77.61, 77.28, 77.3, 77.37, 77.61, 76.7, 77, 76.98, 77.09, 77.21,
77.5, 76.74, 77.49, 76.98, 77.2, 77.29, 77.58, NA, 76.91, 77.27,
77.13, 77.24, 77.45, NA, 0.910154726303475, 0.0129416332341208,
0.220407104887854, 0.168306576903153, 0.20658489347966, NA, 0.117019893381879,
-0.3753073637893, -0.0518604952677195, -0.0388399792853642, 0.0645577792123914
), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", class = c("xts",
"zoo"), index = structure(c(631324800, 631411200, 631497600,
631756800, 631843200, 631929600), tzone = "UTC", tclass = "Date"), .Dim = 6:7, .Dimnames = list(
NULL, c("open", "high", "low", "close", "avgco", "percenthigh",
"percentlow")))
Specifically I want to apply the max function over the AD1$high column for rows 2 through 5 then rows 3 through 6 and so on and have this in a new column.
Thank You
You could do it by making three copies of your column (i.e "high") and offsetting them so one starts ahead one value and one starts behind one value. Then just take the max as you iterate across them:
y <- yourdata
t <- y[,"high"]
tback <- t[2:length(t)]
tforward <- append(NA,t)
using a loop
for(i in 1:length(t)) {
maxvals[i] <- max(c(t[i],tback[i],tforward[i]), na.rm=T)
}
output
> maxvals
[1] 77.61 77.61 77.61 77.37 77.61 77.61
Or you could do it more efficiently without a loop by initializing maxvals to the proper length and filling its values.
Using the zoo function "rollapply" solved my problem.

Resources