I have a dataframe like this one:
library(lubridate)
set.seed(23)
date_list = seq(ymd('2000-01-15'),ymd('2010-09-18'),by='day')
testframe = data.frame(Date = date_list)
testframe$Day = substr(testframe$Date, start = 6, stop = 10)
testframe$ABC = rnorm(3900)
testframe$DEF = rnorm(3900)
testframe$GHI = seq(from = 10, to = 25, length.out = 3900)
testframe$JKL = seq(from = 5, to = 45, length.out = 3900)
I want to have an automatic rolling subset of this dataframe, that should be like this:
testframe_ABC = testframe[,c("Date","Day","ABC")]
testframe_DEF = testframe[,c("Date","Day","DEF")]
testframe_GHI = testframe[,c("Date","Day","GHI")]
testframe_JKL = testframe[,c("Date","Day","JKL")]
The columns Date and Day should always stay, the other columns should be added individually. The name of the varying column should be added to the dataframename, to have a new df. All dataframes could also be in a list of dataframes, if possible.
Any ideas how to do that?
I assume you want a list of 4 data frames whose components are ABC, DEF, etc. It would be better to put them in a list:
L <- Map(function(nm) testframe[c("Date", "Day", nm)], names(testframe)[-(1:2)])
in which case L$ABC or L[[1]] will refer to the ABC data frame but if you want to leave them dangling in the global environment this will copy the list components to it:
list2env(L, .GlobalEnv)
I would not use the term rolling in this context. Typically that term refers to a sliding window such as this:
library(zoo)
rollmeanr(1:10, 3) # 2 is mean of 1:3, 3 is mean of 2:4, etc.
## [1] 2 3 4 5 6 7 8 9
You can use split.default to split on every column and then cbind the first 2 columns to every element, i.e.
lapply(split.default(testframe[-c(1, 2)], seq_along(testframe)[-c(1, 2)]), function(i)
cbind.data.frame(testframe[c(1, 2)], i))
which gives a list,
$`3`
Date Day ABC
1 2000-01-15 01-15 0.1932123
2 2000-01-16 01-16 -0.4346821
3 2000-01-17 01-17 0.9132671
$`4`
Date Day DEF
1 2000-01-15 01-15 1.7933881
2 2000-01-16 01-16 0.9966051
3 2000-01-17 01-17 1.1074905
$`5`
Date Day GHI
1 2000-01-15 01-15 10.0
2 2000-01-16 01-16 17.5
3 2000-01-17 01-17 25.0
$`6`
Date Day JKL
1 2000-01-15 01-15 5
2 2000-01-16 01-16 25
3 2000-01-17 01-17 45
DATA USED
dput(testframe)
structure(list(Date = structure(c(10971, 10972, 10973), class = "Date"),
Day = c("01-15", "01-16", "01-17"), ABC = c(0.193212333898146,
-0.434682108206693, 0.913267096589322), DEF = c(1.79338809206353,
0.996605106833546, 1.10749048744809), GHI = c(10, 17.5, 25
), JKL = c(5, 25, 45)), row.names = c(NA, -3L), class = "data.frame")
Related
I have a database with a 142 columns with one called "Date" (of class POSIXct) that I'd like to make a new column from that groups consecutive dates together. Dates with more than 2 days separating one another are categorized into separate groups.
I'd also like to name the level of the group with the name of month the consecutive dates start in (For example: Jan. 3rd, 2018 -> Jan. 12th 2018 = group level called "January sampling event"; Feb 27th, 2018 -> March 1st, 2018 = group level called "February sampling event"; etc...).
I've seen very similar questions like Group consecutive dates in R and R: group dates that are next to each other, but just can't get it to work for my data.
EDIT:
My data example (Last row shows dates separated by over a year are grouped together, for some reason)
> dput(df)
structure(list(Date = structure(c(17534, 17535, 17536, 17537,
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1,
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
My attempt:
df$group <- 1 + c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
Remove time from date time
It's hard to tell exactly what the problem is without seeing your data (or similar example data), but my guess is that the date time format (the 00:00:00 part) is messing up as.Date
One solution would be to extract just the date part and then try again with just the date part:
# here are your date times
date_time <- "2018-01-03 00:00:00"
# this looks for 4 digits between 0 and 9, followed by a dash, followed by 2 digits between 0 and 9,followed by a dash, followed by 2 digits between 0 and 9
date_pattern <- " ?([0-9]{4}-[0-9]{2}-[0-9]{2}) ?"
#need this library
library(stringr)
library(magrittr) #for pipes
#this pulls out text matching the pattern we specified in date pattern
date_new <- str_extract(date_time, date_pattern) %>%
str_squish() # this removes white space
# this is the new date without the time
date_new
# then we convert to as date
date_new <- as.Date(date_new)
See if converting your date column to just dates and then rerunning your grouping works.
If you have dates in different formats and need to adapt the regular expression, here's something about regular expressions: https://stackoverflow.com/a/49286794/16502170
Group dates
Let's start with an example data frame that contains a date column
# here's a bunch of example dates:
library(lubridate)
dates2 <- seq.Date(as.Date("2018-03-01"),by="days",length.out = 60)
#here's the dataframe
exampl_df <- data.frame(animals = rep(c("cats","dogs","rabbits"),20), dates=dates2,
numbers= rep(1:3,20))
Here's what it looks like:
head(exampl_df)
animals dates numbers
1 cats 2018-03-01 1
2 dogs 2018-03-02 2
3 rabbits 2018-03-03 3
4 cats 2018-03-04 1
5 dogs 2018-03-05 2
6 rabbits 2018-03-06 3
Then let's make a sequence of every day between the minimum and maximum date in the sequence. This step is important because there may be missing dates in our data that we still want counting towards the separation between days.
# this is a day by day sequence from the earliest day in your data to the latest day
date_sequence <- seq.Date(from = min(dates2),max(dates2),by="day")
Then let's make a sequence of numbers each repeated seven times. If you wanted to group every three days, you could change each to 3. Then the length.out= length(date_sequence) tells R to make this vector have as many entries as the min to max date sequence has:
# and then if you want a new group every seven days you can make this number sequence
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
Then let's attach the groups to the date_sequence to make a grouping index
date_grouping_index <- data.frame(a=date_sequence,b=groups)
then you can do a join to attach the groups to the original dataframe
library(dplyr)
example_df 2 <- exampl_df %>%
inner_join(date_grouping_index, by=c("dates"="a"))
This is what we get:
head(example_df2,n=10)
animals dates numbers b
1 cats 2018-03-01 1 1
2 dogs 2018-03-02 2 1
3 rabbits 2018-03-03 3 1
4 cats 2018-03-04 1 1
5 dogs 2018-03-05 2 1
6 rabbits 2018-03-06 3 1
7 cats 2018-03-07 1 1
8 dogs 2018-03-08 2 2
9 rabbits 2018-03-09 3 2
10 cats 2018-03-10 1 2
Then you should be able to group_by() or aggregate() your data using column b
Using the data provided in the question
#original data
df <- structure(list(Date = structure(c(17534, 17535, 17536, 17537,
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1,
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
#plus extra step
df$group2 <- 1 + c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
Method described above
date_sequence <- seq.Date(from = min(df$Date),max(df$Date),by="day")
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
date_grouping_index <- data.frame(a=date_sequence,groups=groups)
example_df2<- df %>%
inner_join(date_grouping_index, by=c("Date"="a"))
Looks like it worked?
example_df2
# A tibble: 9 x 4
Date group group2 groups
<date> <dbl> <dbl> <int>
1 2018-01-03 1 1 1
2 2018-01-04 1 1 1
3 2018-01-05 1 1 1
4 2018-01-06 1 1 1
5 2020-01-18 2 2 107
6 2020-01-19 2 2 107
7 2020-01-20 2 2 107
8 2020-01-21 2 2 107
9 2019-02-05 2 2 57
Here's something you could do to make group names with the date and year in them:
example_df2$group_name <- paste0("sampling number ",
example_df2$groups,
" (",
month.name[month(example_df2$Date)],
"-",
year(example_df2$Date),
")")
I want to paste together results within the same case_when statement (i.e., if multiple statements are true for a given row). I know that I could do something like below to create additional columns and then unite them together. But what is the best way to make the code more efficient (and less verbose) without having to explicitly create var.m and var.o?
Data
df <- structure(list(
ind = 1:10,
var = c(-21, -60, 7, 8,
9, 10, NA, 14, 101, 160)
),
class = "data.frame",
row.names = c(NA, -10L))
Code
library(tidyverse)
df %>%
mutate(
var.m = ifelse(row.names(df) %in% (which(abs(diff(var, lag = 1)) > 10)), "derivative", NA),
var.o = ifelse((var + 50) > 90, "add", NA),
results = case_when(is.na(var) ~ "Missing Data",
var > 100 ~ "High",
var < -20 ~ "Low")) %>%
unite(
"message",
c(var.m, results, var.o),
sep = "_",
remove = TRUE,
na.rm = TRUE
)
Output/Expected Output
ind var message
1 1 -21 derivative_Low
2 2 -60 derivative_Low
3 3 7
4 4 8
5 5 9
6 6 10
7 7 NA Missing Data
8 8 14 derivative
9 9 101 derivative_High_add
10 10 160 High_add
So, in other words, is it possible to forgo creating var.m and var.o and do everything within case_when? Or is there an alternative to creating multiple variables (i.e., var.m and var.o) before concatenating the messages together?
df %>%
mutate(
message = paste(
coalesce(if_else(c(abs(diff(var, lag = 1)) > 10, FALSE), "derivative", ""), ""),
coalesce(if_else((var+50) > 90, "add", ""), ""),
coalesce(case_when(
is.na(var) ~ "Missing Data",
var > 100 ~ "High",
var < -20 ~ "Low"), ""),
sep = "_"),
message = gsub("^_|_$", "",
gsub("__", "_", message))
)
# ind var message
# 1 1 -21 derivative_Low
# 2 2 -60 derivative_Low
# 3 3 7
# 4 4 8
# 5 5 9
# 6 6 10
# 7 7 NA Missing Data
# 8 8 14 derivative
# 9 9 101 derivative_add_High
# 10 10 160 add_High
The coalesce calls are because any one of them could be NA because of the possibility of things slipping through. I could add !is.na(var) before some of the conditionals, but diff introduces a challenge since we'd also need !lead(is.na(var),default=F) or such; and for that coalesce just seemed simpler.
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 1 year ago.
I got some data like this
structure(list(id = c(1, 1, 1), time1 = c(10, 20, 30), time2 = c(15, 25, 35)), row.names = c(NA, 3L), class = "data.frame")
and I want to create a single column from the two columns in the above data
structure(list(id = c(1, 1, 1, 1, 1, 1), time = c(10, 15, 20, 25, 30, 35)), row.names = c(NA, 6L), class = "data.frame")
I dont think its the same as converting into long format because I dont want two columns as a result of gather(), with one the names of the columns used and one the values.
We can use pivot_longer and this should be more general as it can also do reshaping based on other patterns and multiple columns as well. Note that pivot_longer succeeds the reshape2 function melt with more enhanced capabilities and bug fixes
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = time1:time2, values_to = 'time') %>%
select(-name)
-output
# A tibble: 6 x 2
# id time
# <dbl> <dbl>
#1 1 10
#2 1 15
#3 1 20
#4 1 25
#5 1 30
#6 1 35
Or using base R with stack
transform(stack(df1[-1])[1], id = rep(df1$id, 2))[2:1]
Or can use data.frame with unlist
data.frame(id = df1$id, value = unlist(df1[-1], use.names = FALSE))
Alternative to tidyr, though that's a good way to do it:
reshape2::melt(dat, "id")[,-2]
# id value
# 1 1 10
# 2 1 20
# 3 1 30
# 4 1 15
# 5 1 25
# 6 1 35
(Normally it includes the pivoted column names as a column itself, so the [,-2] removes that since your expected output didn't have it. You can do just melt(.) if you want/need to keep it.)
I have a dataframe with values defined per bucket. (See df1 below)
Now I have another dataframe with values within those buckets for which I want to look up a value from the bucketed dataframe (See df2 below)
Now I would like to have the result df3 below.
df1 <- data.frame(MIN = c(1,4,8), MAX = c(3, 6, 10), VALUE = c(3, 56, 8))
df2 <- data.frame(KEY = c(2,5,9))
df3 <- data.frame(KEY = c(2,5,9), VALUE = c(3, 56, 8))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
> df2
KEY
1 2
2 5
3 9
> df3
KEY VALUE
1 2 3
2 5 56
3 9 8
EDIT :
Extended the example.
> df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
> df2 <- data.frame(KEY = c(2,5,9,18,3))
> df3 <- data.frame(KEY = c(2,5,9,18,3), VALUE = c(3, 56, 3, 5, 3))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 3
4 14 18 5
> df2
KEY
1 2
2 5
3 9
4 18
5 3
> df3
KEY VALUE
1 2 3
2 5 56
3 9 3
4 18 5
5 3 3
This solution assumes that KEY, MIN and MAX are integers, so we can create a sequence of keys and then join.
df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
df2 <- data.frame(KEY = c(2,5,9,18,3))
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
group_by(VALUE, id=row_number()) %>% # for each value and row id
nest() %>% # nest rest of columns
mutate(KEY = map(data, ~seq(.$MIN, .$MAX))) %>% # create a sequence of keys
unnest(KEY) %>% # unnest those keys
right_join(df2, by="KEY") %>% # join the other dataset
select(KEY, VALUE)
# # A tibble: 5 x 2
# KEY VALUE
# <dbl> <dbl>
# 1 2.00 3.00
# 2 5.00 56.0
# 3 9.00 3.00
# 4 18.0 5.00
# 5 3.00 3.00
Or, group just by the row number and add VALUE in the map:
df1 %>%
group_by(id=row_number()) %>%
nest() %>%
mutate(K = map(data, ~data.frame(VALUE = .$VALUE,
KEY = seq(.$MIN, .$MAX)))) %>%
unnest(K) %>%
right_join(df2, by="KEY") %>%
select(KEY, VALUE)
A very good and well-thought-out solution from #AntioniosK.
Here's a base R solution implemented as a general lookup function given as arguments a key dataframe and a bucket dataframe defined as listed in the question. The lookup values need not be unique or contiguous in this example, taking account of #Michael's comment that values may occur in more than one row (though normally such lookups would use unique ranges).
lookup = function(keydf, bucketdf){
keydf$rowid = 1:nrow(keydf)
T = merge(bucketdf, keydf)
T = T[T$KEY >= T$MIN & T$KEY <= T$MAX,]
T = merge(T, keydf, all.y = TRUE)
T[order(T$rowid), c("rowid", "KEY", "VALUE")]
}
The first merge uses a Cartesian join of all rows in the key to all rows in the bucket list. Such joins can be inefficient if the number of rows in the real tables is large, as the result of joining x rows in the key to y rows in the bucket would be xy rows; I doubt this would be a problem in this case unless x or y run into thousands of rows.
The second merge is done to recover any key values which are not matched to rows in the bucket list.
Using the example data as listed in #AntioniosK's post:
> lookup(df2, df1)
rowid KEY VALUE
2 1 2 3
4 2 5 56
5 3 9 3
1 4 18 5
3 5 3 3
Using key and bucket exemplars that test edge cases (where the key = the min or the max), where a key value is not in the bucket list (the value 50 in df2A), and where there is a non-unique range (row 6 of df4 below):
df4 <- data.frame(MIN = c(1,4,8, 20, 30, 22), MAX = c(3, 6, 10, 25, 40, 24), VALUE = c(3, 56, 8, 10, 12, 23))
df2A <- data.frame(KEY = c(3, 6, 22, 30, 50))
df4
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
4 20 25 10
5 30 40 12
6 22 24 23
> df2A
KEY
1 3
2 6
3 22
4 30
5 50
> lookup(df2A, df4)
rowid KEY VALUE
1 1 3 3
2 2 6 56
3 3 22 10
4 3 22 23
5 4 30 12
6 5 50 NA
As shown above, the lookup in this case returns two values for the non-unique ranges matching the key value 22, and NA for values in the key but not in the bucket list.
I have a data frame of date which I want to merge with a list that has different number of rows ( 3 or 4 ). I want to merge the first date of the data frame with all elements (dates) of the list. For example the first two dates in df1 are merged with the first two elements of the list (bc) and I would get the following:
date date1 id
1992-09-26 1992-09-05 1
1992-09-26 1992-09-12 1
1992-09-26 1992-09-19 1
1992-09-27 1992-09-06 2
1992-09-27 1992-09-13 2
1992-09-27 1992-09-20 2
How can I do that? I have searched this forum and did not find similar problem or comparable solution.
df1 <- structure(c(8304, 8305, 8306, 8307, 8308, 8309, 8310, 8311, 8312,
8313, 8314), class = "Date")
mylist <- list(structure(c(8283, 8290, 8297), class = "Date"), structure(c(8284,
8291, 8298), class = "Date"), structure(c(8285, 8292, 8299), class = "Date"),
structure(c(8279, 8286, 8293, 8300), class = "Date"), structure(c(8280,
8287, 8294, 8301), class = "Date"), structure(c(8316, 8323,
8330, 8337), class = "Date"), structure(c(8317, 8324, 8331,
8338), class = "Date"), structure(c(8318, 8325, 8332, 8339
), class = "Date"), structure(c(8319, 8326, 8333), class = "Date"),
structure(c(8320, 8327, 8334), class = "Date"), structure(c(8321,
8328, 8335), class = "Date"))
Highly inelegant, but it works:
df <- as.data.frame(do.call("rbind", mapply(function(x,y) cbind(date=x, date1=y, id=which(df1==x)), df1, mylist)))
class(df[,1]) <- "Date"
class(df[,2]) <- "Date"
head(df)
date date1 id
1 1992-09-26 1992-09-05 1
2 1992-09-26 1992-09-12 1
3 1992-09-26 1992-09-19 1
4 1992-09-27 1992-09-06 2
5 1992-09-27 1992-09-13 2
6 1992-09-27 1992-09-20 2
It sounds like you're just looking for rep and unlist, as in the following:
A <- sapply(mylist, length)
out <- data.frame(date = rep(df1, A),
date1 = unlist(mylist),
id = rep(seq_along(A), A))
head(out)
# date date1 id
# 1 1992-09-26 8283 1
# 2 1992-09-26 8290 1
# 3 1992-09-26 8297 1
# 4 1992-09-27 8284 2
# 5 1992-09-27 8291 2
# 6 1992-09-27 8298 2
tail(out)
# date date1 id
# 33 1992-10-05 8320 10
# 34 1992-10-05 8327 10
# 35 1992-10-05 8334 10
# 36 1992-10-06 8321 11
# 37 1992-10-06 8328 11
# 38 1992-10-06 8335 11
You could also use data.table
library(data.table)
DT <- data.table(date = df1, date1 = mylist, key = "date")
DT[, id := 1:nrow(DT)]
DT[, as.IDate(unlist(date1), origin = "1970-01-01"), by = list(date, id)]