I have a large data.frame that I am trying to spread. A toy example looks like this.
data = data.frame(date = rep(c("2019", "2020"), 2), ticker = c("SPY", "SPY", "MSFT", "MSFT"), value = c(1, 2, 3, 4))
head(data)
date ticker value
1 2019 SPY 1
2 2020 SPY 2
3 2019 MSFT 3
4 2020 MSFT 4
I would like to spread it so the data.frame looks like this.
spread(data, key = ticker, value = value)
date MSFT SPY
1 2019 3 1
2 2020 4 2
However, when I do this on my actual data.frame, I get an error.
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 18204 rows:
* 30341, 166871
* 30342, 166872
* 30343, 166873
* 30344, 166874
* 30345, 166875
* 30346, 166876
* 30347, 166877
* 30348, 166878
* 30349, 166879
* 30350, 166880
* 30351, 166881
* 30352, 166882
Below is a head and tail of my data.frame
head(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2008-02-01 SPY NA
2 2008-02-04 SPY NA
3 2008-02-05 SPY NA
4 2008-02-06 SPY NA
5 2008-02-07 SPY NA
6 2008-02-08 SPY -0.0478
tail(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2020-02-12 MDYV 0.00293
2 2020-02-13 MDYV 0.00917
3 2020-02-14 MDYV 0.0179
4 2020-02-18 MDYV 0.0107
5 2020-02-19 MDYV 0.00422
6 2020-02-20 MDYV 0.00347
You can use dplyr and tidyr packages. To get rid of that error, you would have to firstly sum the values for each group.
data %>%
group_by(date, ticker) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = ticker, values_from = value)
# date MSFT SPY
# <fct> <dbl> <dbl>
# 1 2019 3 1
# 2 2020 4 2
As said in the comments, you have multiple values for same combination of date-ticker. You need to define what to do with it.
Here with a reprex:
library(tidyr)
library(dplyr)
# your data is more like:
data = data.frame(
date = c(2019, rep(c("2019", "2020"), 2)),
ticker = c("SPY", "SPY", "SPY", "MSFT", "MSFT"),
value = c(8, 1, 2, 3, 4))
# With two values for same date-ticker combination
data
#> date ticker value
#> 1 2019 SPY 8
#> 2 2019 SPY 1
#> 3 2020 SPY 2
#> 4 2019 MSFT 3
#> 5 2020 MSFT 4
# Results in error
data %>%
spread(ticker, value)
#> Error: Each row of output must be identified by a unique combination of keys.
#> Keys are shared for 2 rows:
#> * 1, 2
# New pivot_wider() Creates list-columns for duplicates
data %>%
pivot_wider(names_from = ticker, values_from = value,)
#> Warning: Values in `value` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(value = list)` to suppress this warning.
#> * Use `values_fn = list(value = length)` to identify where the duplicates arise
#> * Use `values_fn = list(value = summary_fun)` to summarise duplicates
#> # A tibble: 2 x 3
#> date SPY MSFT
#> <fct> <list> <list>
#> 1 2019 <dbl [2]> <dbl [1]>
#> 2 2020 <dbl [1]> <dbl [1]>
# Otherwise, decide yourself how to summarise duplicates with mean() for instance
data %>%
group_by(date, ticker) %>%
summarise(value = mean(value, na.rm = TRUE)) %>%
spread(ticker, value)
#> # A tibble: 2 x 3
#> # Groups: date [2]
#> date MSFT SPY
#> <fct> <dbl> <dbl>
#> 1 2019 3 4.5
#> 2 2020 4 2
Created on 2020-02-22 by the reprex package (v0.3.0)
Related
I'm working with a dataset where I have the date that a given value (weight) was collected, and then the weight (for that date). Some participants have multiple weights in the dataset because they have come back more than once; others only have one weight value. Is there an easy way to ask R to provide a new dataframe with one value per person, based on the earliest date? (And by default, those with only one value are included)?
I'm wondering if it would be advantageous to group by a subject ID and get their mean weight value (as I don't anticipate it may fluctuate drastically). But to be consistent, grouping based on the earliest/first weight recorded would be ideal.
I'm thinking possibly a function in the 'lubridate' package would be useful, but I'm not 100%.
Sort by date, group by id, then take the first row per group:
library(dplyr)
weights %>%
arrange(date) %>%
group_by(id) %>%
slice(1) %>%
ungroup()
#> # A tibble: 3 × 3
#> id date weight
#> <int> <date> <dbl>
#> 1 1 2021-03-15 182.
#> 2 2 2021-05-12 133.
#> 3 3 2021-08-09 151.
Example data:
set.seed(13)
weights <- tibble::tibble(
id = rep(1:3, each = 3),
date = lubridate::ymd("2021-01-01") + sample(0:364, 9),
weight = rnorm(9, 160, 20)
)
weights
#> # A tibble: 9 × 3
#> id date weight
#> <int> <date> <dbl>
#> 1 1 2021-09-16 165.
#> 2 1 2021-12-23 153.
#> 3 1 2021-03-15 182.
#> 4 2 2021-07-24 138.
#> 5 2 2021-09-19 169.
#> 6 2 2021-05-12 133.
#> 7 3 2021-11-16 123.
#> 8 3 2021-08-09 151.
#> 9 3 2021-09-05 156.
Created on 2022-11-11 with reprex v2.0.2
I have a table like this:
ID
Date
Status
101
2020-09-14
1
102
2020-09-14
1
103
2020-09-14
1
104
2020-09-14
2
105
2020-09-14
2
106
2020-09-14
2
But want a table like this:
Status
ID
Date
1
101,102,103
2020-09-14, 2020-09-14, 2020-09-14
1
104,105,106
2020-09-14, 2020-09-14, 2020-09-14
Code that i'm currently using:
note: date is in format yyyy-mm-dd before running code.
g1 <- df1 %>%
mutate(Date = as.Date(Date, format = '%Y%m%d')) %>%
group_by(status) %>%
summarise_at(c("ID", "Date"), list)
This seems to work except for the date in the new table is not in yyyy-mm-dd. For example, 2021-06-10 is converting to 18788.
A possible solution:
library(tidyverse)
df %>%
group_by(Status) %>%
summarise(ID = str_c(ID, collapse = ","), Date = str_c(Date, collapse = ","))
#> # A tibble: 2 × 3
#> Status ID Date
#> <int> <chr> <chr>
#> 1 1 101,102,103 2020-09-14,2020-09-14,2020-09-14
#> 2 2 104,105,106 2020-09-14,2020-09-14,2020-09-14
A more succinct alternative:
library(tidyverse)
df %>%
group_by(Status) %>%
summarise(across(c(ID, Date), str_c, collapse = ","))
#> # A tibble: 2 × 3
#> Status ID Date
#> <int> <chr> <chr>
#> 1 1 101,102,103 2020-09-14,2020-09-14,2020-09-14
#> 2 2 104,105,106 2020-09-14,2020-09-14,2020-09-14
I have a data frame that represents policies with start and end dates. I'm trying to tally the count of policies that are active each month.
library(tidyverse)
ayear <- 2021
amonth <- 10
months <- 12
df <- tibble(
pol = c(1, 2, 3, 4)
, bdate = c('2021-02-23', '2019-12-03', '2020-08-11', '2020-12-14')
, edate = c('2022-02-23', '2020-12-03', '2021-08-11', '2021-06-14')
)
These four policies have a begin date (bdate) and end date (edate). Beginning in October (amonth) 2021 (ayear) and going back 12 months (months) I'm trying to generate a count of how many of the 4 policies were active at some point in the month to generate a data frame that looks something like this.
Data frame I'm trying to generate would have three columns: month, year, and active_pol_count with 12 rows. Like this.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df <- tibble(
pol = c(1, 2, 3, 4),
bdate = c("2021-02-23", "2019-12-03", "2020-08-11", "2020-12-14"),
edate = c("2022-02-23", "2020-12-03", "2021-08-11", "2021-06-14")
)
# transform star and end date to interval
df <- mutate(df, interval = interval(bdate, edate))
# for every first date of each month between 2020-10 to 2021-10
seq(as.Date("2020-10-01"), as.Date("2021-09-01"), by = "months") %>%
tibble(date = .) %>%
mutate(
year = year(date),
month = month(date),
active_pol_count = date %>% map_dbl(~ .x %within% df$interval %>% sum()),
)
#> # A tibble: 12 x 4
#> date year month active_pol_count
#> <date> <dbl> <dbl> <dbl>
#> 1 2020-10-01 2020 10 2
#> 2 2020-11-01 2020 11 2
#> 3 2020-12-01 2020 12 2
#> 4 2021-01-01 2021 1 2
#> 5 2021-02-01 2021 2 2
#> 6 2021-03-01 2021 3 3
#> 7 2021-04-01 2021 4 3
#> 8 2021-05-01 2021 5 3
#> 9 2021-06-01 2021 6 3
#> 10 2021-07-01 2021 7 2
#> 11 2021-08-01 2021 8 2
#> 12 2021-09-01 2021 9 1
Created on 2021-12-13 by the reprex package (v2.0.1)
I am working in R, but I don't know very well how to extract from any number a series of data, i.e., from the number 20102168056, I want to subdivide it like this
2010 -> year
2 -> semester
168 -> university career
056 -> unique number
I tried to do it with an if, but every time I got more errors, I am new to this and I would like to know if you can help me (By the way, it is for any number, as 20211888070, so I did not use the if I raised).
You can use tidyr::separate.
library(tidyverse)
df <- tibble(original = c(20102168056, 20141152013, 20182008006))
df %>%
separate(original, into = c("year", "semester", "university_career", "unique_number"), sep = c(4,5,8,11))
# A tibble: 3 × 4
year semester university_career unique_number
<chr> <chr> <chr> <chr>
1 2010 2 168 056
2 2014 1 152 013
3 2018 2 008 006
You may want to convert some of the columns to an integer:
df %>%
separate(original, into = c("year", "semester", "university_career", "unique_number"), sep = c(4,5,8,11)) %>%
mutate(across(year:unique_number, as.integer))
# A tibble: 3 × 4
year semester university_career unique_number
<int> <int> <int> <int>
1 2010 2 168 56
2 2014 1 152 13
3 2018 2 8 6
We can use stringr::str_match().
library(tidyverse)
data <- c(20102168056, 20102168356)
str_match(data, '^(\\d{4})(\\d{1})(\\d{3})(\\d{3})') %>%
as.data.frame() %>%
set_names(c('value', 'year', 'semester', 'university_career', 'unique_number'))
#> value year semester university_career unique_number
#> 1 20102168056 2010 2 168 056
#> 2 20102168356 2010 2 168 356
Created on 2021-12-08 by the reprex package (v2.0.1)
You can use the substr() function if you first make the number into a character with as.character().
test <- '20102168056'
data <- list()
data$year <- substr(test, 1, 4)
data$semester <- substr(test, 5, 5)
data$uni_career <- substr(test, 6, 8)
data$unique_num <- substr(test, 9, 11)
print(data)
#> $year
#> [1] "2010"
#>
#> $semester
#> [1] "2"
#>
#> $uni_career
#> [1] "168"
#>
#> $unique_num
#> [1] "056"
Created on 2021-12-08 by the reprex package (v2.0.1)
I am trying to replicate the tidyr:complete function in sparklyr. I have a dataframe with some missing values and I have to fill out those rows. In dplyr/tidyr I can do:
data <- tibble(
"id" = c(1,1,2,2),
"dates" = c("2020-01-01", "2020-01-03", "2020-01-01", "2020-01-03"),
"values" = c(3,4,7,8))
# A tibble: 4 x 3
id dates values
<dbl> <chr> <dbl>
1 1 2020-01-01 3
2 1 2020-01-03 4
3 2 2020-01-01 7
4 2 2020-01-03 8
data %>%
mutate(dates = as_date(dates)) %>%
group_by(id) %>%
complete(dates = seq.Date(min(dates), max(dates), by="day"))
# A tibble: 6 x 3
# Groups: id [2]
id dates values
<dbl> <date> <dbl>
1 1 2020-01-01 3
2 1 2020-01-02 NA
3 1 2020-01-03 4
4 2 2020-01-01 7
5 2 2020-01-02 NA
6 2 2020-01-03 8
However the complete function does not exist in sparklyr.
data_spark %>%
mutate(dates = as_date(dates)) %>%
group_by(id) %>%
complete(dates = seq.Date(min(dates), max(dates), by="day"))
Error in UseMethod("complete_") :
no applicable method for 'complete_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"
Is there a way to set a UDF or to achieve a similar result?
Thank you
Under the hood tidyr::complete just performs a full join followed by optional NA fill. You can replicate its effects by using sdf_copy_to to create a new sdf that is just a single column seq.Date between your start and end date, and then perform a full_join between that and your dataset.
Here's a method that does all of the work in Spark.
library(sparklyr)
sc <- spark_connect(master = "local")
data <- tibble(
id = c(1, 1, 2, 2),
dates = c("2020-01-02", "2020-01-04", "2020-01-01", "2020-01-03"),
values = c(1, 2, 3, 4)
)
data_spark <- copy_to(sc, data)
We need to generate all combinations of dates and id. To do this, we need to know the total number of days and the first date.
days_info <-
data_spark %>%
summarise(
first_date = min(dates),
total_days = datediff(max(dates), min(dates))
) %>%
collect()
days_info
#> # A tibble: 1 x 2
#> first_date total_days
#> <chr> <int>
#> 1 2020-01-01 3
sdf_seq can be used to generate a sequence in Spark. This can be used to get the combinations of dates and id.
dates_id_combinations <-
sdf_seq(
sc,
from = 0,
to = days_info$total_days,
repartition = 1
) %>%
transmute(
dates = date_add(local(days_info$first_date), id),
join_by = TRUE
) %>%
full_join(data_spark %>% distinct(id) %>% mutate(join_by = TRUE)) %>%
select(dates, id)
dates_id_combinations
#> # Source: spark<?> [?? x 2]
#> dates id
#> <date> <dbl>
#> 1 2020-01-01 1
#> 2 2020-01-01 2
#> 3 2020-01-02 1
#> 4 2020-01-02 2
#> 5 2020-01-03 1
#> 6 2020-01-03 2
#> 7 2020-01-04 1
#> 8 2020-01-04 2
full_join the original data frame and the combination data frame. Then filter based on the min/max date for each group.
data_spark %>%
group_by(id) %>%
mutate(first_date = min(dates), last_date = max(dates)) %>%
full_join(dates_id_combinations) %>%
filter(dates >= min(first_date), dates <= max(last_date)) %>%
arrange(id, dates) %>%
select(id, dates)
#> # Source: spark<?> [?? x 2]
#> # Groups: id
#> # Ordered by: id, dates
#> id dates
#> <dbl> <chr>
#> 1 1 2020-01-02
#> 2 1 2020-01-03
#> 3 1 2020-01-04
#> 4 2 2020-01-01
#> 5 2 2020-01-02
#> 6 2 2020-01-03