i have this data
data.frame(start_date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16')),
end_date =as.Date(c('2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8),
b = c(6, 5, 7))
and I want to manipulate it to transform it like this
data.frame(date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16', '2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8, 9, 1, 8),
b = c(6, 5, 7, 6, 5, 7))
How can i do it? Thanks!
You can use the tidyr gather function to get this.
-First assign the data frame as an object.
-Then, gather the start and end dates with their a and b values respectively. (by excluding a and b from gather with a minus "-" sign.) Change the name of the value column with "date". The output from the gather was like that.
df %>%
gather(key, value = "date", -a, -b)
a b key date
1 9 6 start_date 2020-03-02
2 1 5 start_date 2020-03-09
3 8 7 start_date 2020-03-16
4 9 6 end_date 2020-03-06
5 1 5 end_date 2020-03-13
6 8 7 end_date 2002-03-20
-For the last part, in order to get rid of the "key" column (start_date and end_date), select only the ones you wanted.
See the full code below:
df <- data.frame(start_date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16')),
end_date =as.Date(c('2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8),
b = c(6, 5, 7)) #df assignment to an object
df1 <- df %>%
gather(key, value = "date", -a, -b) %>% #gathering dates
select(date, a, b) #choosing what is needed
-This full code brings this output:
date a b
1 2020-03-02 9 6
2 2020-03-09 1 5
3 2020-03-16 8 7
4 2020-03-06 9 6
5 2020-03-13 1 5
6 2002-03-20 8 7
Related
I'm trying to use Slider to compute moving averages over some time series data. The data has day resolution (one observation per day). For each observation I want to compute the average daily value over the last 7 days.
The problem is my code is ignoring the missing observations with implied values of zero. So if my period is 7 days, and during some 7 day window there are only 2 observations, it's summing them and dividing by 2, whereas I'm looking to sum and divide by 7 to get the average per day.
In the code below you'll see that the second row (2023-02-03) is computing the average by dividing by 2 (the number of observations), rather than by dividing by 4 (the number of days in the period 2023-01-31 to 2023-02-03).
Is there a good way to achieve the desired result, or do I just need to replace the mean calculation with sum() / 7?
I had originally backfilled the missing observations which worked, but the data is relatively large and quite sparse and doing so massively increased the runtime (from ~8 seconds to ~100).
library(tidyverse)
library(slider)
data <- data.frame(
date = Sys.Date() - c(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 13),
val = c(0, 0, 2, 1, 0, 10, 0, 1, 1, 6, 1)
)
print(as_tibble(data))
summary <- function(data) {
summarise(data,
moving_total = sum(val),
moving_avg = mean(val, na.rm = FALSE),
num_observations = n()
)
}
res <- data %>%
arrange(date) %>%
mutate(
weekly = slide_period_dfr(
.x = pick(everything()),
.i = date,
.period = "day",
.f = summary,
.before = 6,
.complete = FALSE
)
)
print(as_tibble(res))
# A tibble: 11 x 2
date val
<date> <dbl>
1 2023-02-13 0
2 2023-02-12 0
3 2023-02-11 2
4 2023-02-10 1
5 2023-02-09 0
6 2023-02-07 10
7 2023-02-06 0
8 2023-02-05 1
9 2023-02-04 1
10 2023-02-03 6
11 2023-01-31 1
# A tibble: 11 x 3
date val weekly$moving_total $moving_avg $num_observations
<date> <dbl> <dbl> <dbl> <int>
1 2023-01-31 1 1 1 1
2 2023-02-03 6 7 3.5 2
3 2023-02-04 1 8 2.67 3
4 2023-02-05 1 9 2.25 4
5 2023-02-06 0 9 1.8 5
6 2023-02-07 10 18 3.6 5
7 2023-02-09 0 18 3 6
8 2023-02-10 1 13 2.17 6
9 2023-02-11 2 14 2.33 6
10 2023-02-12 0 13 2.17 6
11 2023-02-13 0 13 2.17 6
Just a note on the implementation. In the real world the moving averages are being computed over groups, hence the use of pick(everything()) above. Don't think it's necessary for the toy example, but I leave it in just in case it influences the answer.
Thanks
A straightforward solution is to supply zero values for the absent dates.
library(tidyverse)
library(slider)
(data_ <- tibble(
date = Sys.Date() - c(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 13),
val = c(0, 0, 2, 1, 0, 10, 0, 1, 1, 6, 1)
))
summary <- function(.data) {
summarise(.data,
moving_total = sum(val),
moving_avg = mean(val, na.rm = FALSE),
num_observations = n()
)
}
(full_dates <- tibble(date=
seq(min(data_$date),
max(data_$date),
by = "1 day"
)))
(fdj <- left_join(
full_dates,
data_
) |> mutate(val = if_else(
is.na(val), 0, val
)))
(res <-
mutate(fdj,
weekly = slide_period_dfr(
.x = pick(everything()),
.i = date,
.period = "day",
.f = summary,
.before = 6,
.complete = FALSE
)
))
I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?
You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)
If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.
This is for R
date <- seq(as.Date("2020/03/11"), as.Date("2020/03/16"), "day")
x_pos_a <- c(1, 5, 4, 9, 0)
x_pos_b <- c(2, 6, 9, 5, 4)
like so [...]
I have a timeseries dataframe with 69 time points. The rows in the dataframe are dates.
Four variables (pos, anx, ang, sad) have been measured from three populations (A, B, C). Three samples were drawn from each population (x, y, z). Currently, each combination of the variable, population, and sample forms a column in the dataframe. For example, "x_pos_A", "x_pos_B", "x_pos_C", x_anx_A"..."z_sad_b", "z_sad_c".
I want to reshape it in the following shape
"Date" "variables" "population" "sample" "value"
I have spend the last 3 hours searching for answers on the forum but have been unsuccessful.
Any help much appreciated!
Thanks
You can use pivot_longer from tidyr :
tidyr::pivot_longer(df,
cols = -date,
names_to = c('sample', 'variable', 'population'),
names_sep = '_')
# date sample variable population value
# <date> <chr> <chr> <chr> <dbl>
# 1 2020-03-11 x pos a 1
# 2 2020-03-11 x pos b 2
# 3 2020-03-12 x pos a 5
# 4 2020-03-12 x pos b 6
# 5 2020-03-13 x pos a 4
# 6 2020-03-13 x pos b 9
# 7 2020-03-14 x pos a 9
# 8 2020-03-14 x pos b 5
# 9 2020-03-15 x pos a 0
#10 2020-03-15 x pos b 4
data
date <- seq(as.Date("2020/03/11"), as.Date("2020/03/15"), "day")
x_pos_a <- c(1, 5, 4, 9, 0)
x_pos_b <- c(2, 6, 9, 5, 4)
df <- data.frame(date, x_pos_a, x_pos_b)
With the dataset df:
df
confint row Index
0.3407,0.4104 1 1
0.2849,0.4413 2 2
0.2137,0.2674 3 3
0.1910,0.4575 4 1
0.4039,0.4905 5 2
0.403,0.4822 6 3
0.0301,0.0646 7 1
0.0377,0.0747 8 2
0.0835,0.0918 9 3
0.0437,0.0829 10 1
0.0417,0.0711 11 2
0.0718,0.0798 12 3
0.0112,0.0417 13 1
0.019,0.0237 14 2
0.0213,0.0293 15 3
0.0121,0.0393 16 1
0.0126,0.0246 17 2
0.0318,0.0428 18 3
0.0298,0.0631 19 1
0.018,0.0202 20 2
0.1031,0.1207 21 3
This should be a rather easy dataset to convert from long to wide form that is a 7 (row) x 3 (column) dataframe. The result should have 3 columns named by Index and 7 rows (21/3 = 7). The code is as of the following:
df <- spread(df,Index, confint, convert = FALSE)
However, by using Spread() I received the following error:
Error: Duplicate identifiers for rows (1, 4, 7, 10, 13, 16, 19), (2, 5, 8, 11, 14, 17, 20), (3, 6, 9, 12, 15, 18, 21)
Any help will be greatly appreciated!
We need to create a sequence column and then spread
library(tidyverse)
df %>%
group_by(Index) %>%
mutate(ind = row_number()) %>%
spread(Index, confint, convert = FALSE)
NOTE: This would be an issue in the original dataset and not in the example data showed in the post
I have a dataframe with values defined per bucket. (See df1 below)
Now I have another dataframe with values within those buckets for which I want to look up a value from the bucketed dataframe (See df2 below)
Now I would like to have the result df3 below.
df1 <- data.frame(MIN = c(1,4,8), MAX = c(3, 6, 10), VALUE = c(3, 56, 8))
df2 <- data.frame(KEY = c(2,5,9))
df3 <- data.frame(KEY = c(2,5,9), VALUE = c(3, 56, 8))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
> df2
KEY
1 2
2 5
3 9
> df3
KEY VALUE
1 2 3
2 5 56
3 9 8
EDIT :
Extended the example.
> df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
> df2 <- data.frame(KEY = c(2,5,9,18,3))
> df3 <- data.frame(KEY = c(2,5,9,18,3), VALUE = c(3, 56, 3, 5, 3))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 3
4 14 18 5
> df2
KEY
1 2
2 5
3 9
4 18
5 3
> df3
KEY VALUE
1 2 3
2 5 56
3 9 3
4 18 5
5 3 3
This solution assumes that KEY, MIN and MAX are integers, so we can create a sequence of keys and then join.
df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
df2 <- data.frame(KEY = c(2,5,9,18,3))
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
group_by(VALUE, id=row_number()) %>% # for each value and row id
nest() %>% # nest rest of columns
mutate(KEY = map(data, ~seq(.$MIN, .$MAX))) %>% # create a sequence of keys
unnest(KEY) %>% # unnest those keys
right_join(df2, by="KEY") %>% # join the other dataset
select(KEY, VALUE)
# # A tibble: 5 x 2
# KEY VALUE
# <dbl> <dbl>
# 1 2.00 3.00
# 2 5.00 56.0
# 3 9.00 3.00
# 4 18.0 5.00
# 5 3.00 3.00
Or, group just by the row number and add VALUE in the map:
df1 %>%
group_by(id=row_number()) %>%
nest() %>%
mutate(K = map(data, ~data.frame(VALUE = .$VALUE,
KEY = seq(.$MIN, .$MAX)))) %>%
unnest(K) %>%
right_join(df2, by="KEY") %>%
select(KEY, VALUE)
A very good and well-thought-out solution from #AntioniosK.
Here's a base R solution implemented as a general lookup function given as arguments a key dataframe and a bucket dataframe defined as listed in the question. The lookup values need not be unique or contiguous in this example, taking account of #Michael's comment that values may occur in more than one row (though normally such lookups would use unique ranges).
lookup = function(keydf, bucketdf){
keydf$rowid = 1:nrow(keydf)
T = merge(bucketdf, keydf)
T = T[T$KEY >= T$MIN & T$KEY <= T$MAX,]
T = merge(T, keydf, all.y = TRUE)
T[order(T$rowid), c("rowid", "KEY", "VALUE")]
}
The first merge uses a Cartesian join of all rows in the key to all rows in the bucket list. Such joins can be inefficient if the number of rows in the real tables is large, as the result of joining x rows in the key to y rows in the bucket would be xy rows; I doubt this would be a problem in this case unless x or y run into thousands of rows.
The second merge is done to recover any key values which are not matched to rows in the bucket list.
Using the example data as listed in #AntioniosK's post:
> lookup(df2, df1)
rowid KEY VALUE
2 1 2 3
4 2 5 56
5 3 9 3
1 4 18 5
3 5 3 3
Using key and bucket exemplars that test edge cases (where the key = the min or the max), where a key value is not in the bucket list (the value 50 in df2A), and where there is a non-unique range (row 6 of df4 below):
df4 <- data.frame(MIN = c(1,4,8, 20, 30, 22), MAX = c(3, 6, 10, 25, 40, 24), VALUE = c(3, 56, 8, 10, 12, 23))
df2A <- data.frame(KEY = c(3, 6, 22, 30, 50))
df4
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
4 20 25 10
5 30 40 12
6 22 24 23
> df2A
KEY
1 3
2 6
3 22
4 30
5 50
> lookup(df2A, df4)
rowid KEY VALUE
1 1 3 3
2 2 6 56
3 3 22 10
4 3 22 23
5 4 30 12
6 5 50 NA
As shown above, the lookup in this case returns two values for the non-unique ranges matching the key value 22, and NA for values in the key but not in the bucket list.