Create index column reverse group by column - r

I want to create a column sequenced like rank but not based on numerical values.
like in the example below.
a <- rep(letters[1:3],each =3)
b <- round(rnorm(9,5,1),2)
tempdf <- data.frame(cbind(a,b))
tempdf
#> a b
#> 1 a 5.58
#> 2 a 3.68
#> 3 a 5.12
#> 4 b 3.28
#> 5 b 4.97
#> 6 b 6.57
#> 7 c 5.92
#> 8 c 5.25
#> 9 c 3.02
tempdf["c"] <- rep(1:3, each = 3)
tempdf
#> a b c
#> 1 a 5.58 1
#> 2 a 3.68 1
#> 3 a 5.12 1
#> 4 b 3.28 2
#> 5 b 4.97 2
#> 6 b 6.57 2
#> 7 c 5.92 3
#> 8 c 5.25 3
#> 9 c 3.02 3
Created on 2021-04-09 by the reprex package (v1.0.0)
My data actually looks more like this. I want to create an index of week number over multiple years. Please suggest better ways to do it.
library(dplyr)
library(lubridate)
a <- seq.Date(as.Date("2021-01-01"), as.Date("2021-02-28"), by = "1 day")
b <- round(rnorm(59,5,1),2)
tempdf <- cbind.data.frame(a,b)
tempdf <- tempdf %>%
mutate(weeks = week(a),
month = month(a),
year = year(a)) %>%
# mutate(ymw = 10000*year+100*month+weeks) %>%
mutate(ymw = paste0(year,month, weeks))
tempdf
#> a b weeks month year ymw
#> 1 2021-01-01 6.78 1 1 2021 202111
#> 2 2021-01-02 4.17 1 1 2021 202111
#> 3 2021-01-03 5.65 1 1 2021 202111
#> 4 2021-01-04 5.20 1 1 2021 202111
#> 5 2021-01-05 4.55 1 1 2021 202111
#> 6 2021-01-06 5.07 1 1 2021 202111
#> 7 2021-01-07 6.29 1 1 2021 202111
#> 8 2021-01-08 6.01 2 1 2021 202112
#> 9 2021-01-09 4.45 2 1 2021 202112
#> 10 2021-01-10 5.35 2 1 2021 202112
#> 11 2021-01-11 5.10 2 1 2021 202112
#> 12 2021-01-12 4.34 2 1 2021 202112
#> 13 2021-01-13 4.47 2 1 2021 202112
#> 14 2021-01-14 6.03 2 1 2021 202112
#> 15 2021-01-15 6.55 3 1 2021 202113
#> 16 2021-01-16 5.60 3 1 2021 202113
#> 17 2021-01-17 5.54 3 1 2021 202113
Created on 2021-04-09 by the reprex package (v1.0.0)

I can think of two options, depending on what you ultimately intend to do with the column.
tempdf %>%
mutate(
weekIndex_1 = year + weeks/100,
weekIndex_2 = floor(as.numeric(a)/7)
)
#> a b weeks month year ymw weekIndex_1 weekIndex_2
#> 1 2021-01-01 7.30 1 1 2021 202111 2021.01 2661
#> 2 2021-01-02 4.53 1 1 2021 202111 2021.01 2661
#> 3 2021-01-03 5.21 1 1 2021 202111 2021.01 2661
#> 4 2021-01-04 6.74 1 1 2021 202111 2021.01 2661
#> 5 2021-01-05 4.53 1 1 2021 202111 2021.01 2661
#> 6 2021-01-06 5.56 1 1 2021 202111 2021.01 2661
#> 7 2021-01-07 5.09 1 1 2021 202111 2021.01 2662
#> 8 2021-01-08 4.82 2 1 2021 202112 2021.02 2662
#> 9 2021-01-09 5.65 2 1 2021 202112 2021.02 2662
#> 10 2021-01-10 4.46 2 1 2021 202112 2021.02 2662
Both will allow you to sort on the index. The difference is weekIndex_1 tracks the year and resets the week number when the year changes. In a sense, using semantic versioning for the date. This is very similar to what you did with the ymw column though. With weekIndex_2 you are essentially tracking years since the origin, which accounts for the fact that years aren't exactly 52 weeks long. You get the sequential order, but lose a bit of the year context. Since you have both these in other columns already (weeks and year), perhaps this isn't that important.

Related

pivot_wider results in list column column instead of expected results

I'm just going to chalk this up to my ignorance, but sometimes the pivot_* functions drive me crazy.
I have a tibble:
# A tibble: 12 x 3
year term estimate
<dbl> <chr> <dbl>
1 2018 intercept -29.8
2 2018 daysuntilelection 8.27
3 2019 intercept -50.6
4 2019 daysuntilelection 7.40
5 2020 intercept -31.6
6 2020 daysuntilelection 6.55
7 2021 intercept -19.0
8 2021 daysuntilelection 4.60
9 2022 intercept -10.7
10 2022 daysuntilelection 6.41
11 2023 intercept 120
12 2023 daysuntilelection 0
that I would like to flip to:
# A tibble: 6 x 3
year intercept daysuntilelection
<dbl> <dbl> <dbl>
1 2018 -29.8 8.27
2 2019 -50.6 7.40
3 2020 -31.6 6.55
4 2021 -19.0 4.60
5 2022 -10.7 6.41
6 2023 120 0
Normally pivot_wider should be able to do this as x %>% pivot_wider(!year, names_from = "term", values_from = "estimate") but instead it returns a two-column tibble with lists and a bunch of warning.
# A tibble: 1 x 2
intercept daysuntilelection
<list> <list>
1 <dbl [6]> <dbl [6]>
Warning message:
Values from `estimate` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(term) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
Where do I go wrong here? Help!
Next to the solutions offered in the comments, data.table's dcast is a very fast implementation to pivot your data. If the pivot_ functions drive you crazy, maybe this is a nice alternative for you:
x <- read.table(text = "
1 2018 intercept -29.8
2 2018 daysuntilelection 8.27
3 2019 intercept -50.6
4 2019 daysuntilelection 7.40
5 2020 intercept -31.6
6 2020 daysuntilelection 6.55
7 2021 intercept -19.0
8 2021 daysuntilelection 4.60
9 2022 intercept -10.7
10 2022 daysuntilelection 6.41
11 2023 intercept 120
12 2023 daysuntilelection 0")
names(x) <- c("id", "year", "term", "estimate")
library(data.table)
dcast(as.data.table(x), year ~ term)
#> Using 'estimate' as value column. Use 'value.var' to override
#> year daysuntilelection intercept
#> 1: 2018 8.27 -29.8
#> 2: 2019 7.40 -50.6
#> 3: 2020 6.55 -31.6
#> 4: 2021 4.60 -19.0
#> 5: 2022 6.41 -10.7
#> 6: 2023 0.00 120.0
DATA
df <- read.table(text = "
1 2018 intercept -29.8
2 2018 daysuntilelection 8.27
3 2019 intercept -50.6
4 2019 daysuntilelection 7.40
5 2020 intercept -31.6
6 2020 daysuntilelection 6.55
7 2021 intercept -19.0
8 2021 daysuntilelection 4.60
9 2022 intercept -10.7
10 2022 daysuntilelection 6.41
11 2023 intercept 120
12 2023 daysuntilelection 0")
CODE
library(tidyverse)
df %>%
pivot_wider(names_from = V3,values_from = V4 , values_fill = 0) %>%
group_by(V2) %>%
summarise_all(sum,na.rm=T)
OUTPUT
V2 V1 intercept daysuntilelection
<int> <int> <dbl> <dbl>
1 2018 3 -29.8 8.27
2 2019 7 -50.6 7.4
3 2020 11 -31.6 6.55
4 2021 15 -19 4.6
5 2022 19 -10.7 6.41
6 2023 23 120 0

how to apply round() to odd or even rows only in R

assume my original dataframe is :
a b d e
1 1 1 2 1
2 20 30 40 30
3 1 2 6 2
4 40 50 40 50
5 5 5 3 5
6 60 60 60 60
I want to add a percentage row below each row.
a b d e
1 1.00 1.00 2.00 1.00
2 0.79 0.66 1.57 0.66
3 20.00 30.00 40.00 30.00
4 13.51 20.27 27.03 20.27
5 1.00 2.00 6.00 2.00
6 0.66 1.57 3.97 1.57
7 40.00 50.00 40.00 50.00
8 27.03 33.78 27.03 33.78
9 5.00 5.00 3.00 5.00
10 3.94 3.31 2.36 3.31
11 60.00 60.00 60.00 60.00
12 40.54 40.54 40.54 40.54
but as you see, my odd rows get .00 which I do not want.
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df <- df %>% slice(rep(1:n(), each=2))
df[seq_len(nrow(df)) %% 2 ==0, ] <- round(100*df[seq_len(nrow(df)) %% 2 ==0,
]/colSums(df[seq_len(nrow(df)) %% 2 ==0, ]),2)
how can I keep my odd rows without decimals?
The problem is that columns in data frames can only hold one type of data. If some of the columns in your data frame have decimals, then the whole column must be of type double. The only way to change how your data frame appears is via its print method.
Fortunately, you can easily turn your data frame into a tibble. This is a type of data frame, but prints in such a way that the integers don't have decimal points afterwards.
df
#> a b d e
#> 1 1.00 1.00 2.00 1.00
#> 2 0.79 0.66 1.57 0.66
#> 3 20.00 30.00 40.00 30.00
#> 4 13.51 20.27 27.03 20.27
#> 5 1.00 2.00 6.00 2.00
#> 6 0.66 1.57 3.97 1.57
#> 7 40.00 50.00 40.00 50.00
#> 8 27.03 33.78 27.03 33.78
#> 9 5.00 5.00 3.00 5.00
#> 10 3.94 3.31 2.36 3.31
#> 11 60.00 60.00 60.00 60.00
#> 12 40.54 40.54 40.54 40.54
dplyr::tibble(df)
#> # A tibble: 12 x 4
#> a b d e
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1
#> 2 0.79 0.66 1.57 0.66
#> 3 20 30 40 30
#> 4 13.5 20.3 27.0 20.3
#> 5 1 2 6 2
#> 6 0.66 1.57 3.97 1.57
#> 7 40 50 40 50
#> 8 27.0 33.8 27.0 33.8
#> 9 5 5 3 5
#> 10 3.94 3.31 2.36 3.31
#> 11 60 60 60 60
#> 12 40.5 40.5 40.5 40.5
Created on 2022-04-26 by the reprex package (v2.0.1)
Allan Cameron is right, that a tibble prints better and does what you want. To offer another solution, though, if you're trying to print something that you might send to a text file (rather than just look at on the screen), you could print the values to character strings as follows:
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df %>%
mutate(obs = row_number(),
across(-obs, ~.x/sum(.x)),
type = "pct") %>%
bind_rows(df %>% mutate(obs = row_number(),
type = "raw")) %>%
mutate(type = factor(type, levels=c("raw", "pct"))) %>%
arrange(obs, type) %>%
mutate(across(a:e, ~case_when(
type == "raw" ~ sprintf("%.0f", .x),
TRUE ~ sprintf("%.2f%%", .x*100)))) %>%
select(-c(obs, type))
#> a b d e
#> 1 1 1 2 1
#> 2 0.79% 0.68% 1.32% 0.68%
#> 3 20 30 40 30
#> 4 15.75% 20.27% 26.49% 20.27%
#> 5 1 2 6 2
#> 6 0.79% 1.35% 3.97% 1.35%
#> 7 40 50 40 50
#> 8 31.50% 33.78% 26.49% 33.78%
#> 9 5 5 3 5
#> 10 3.94% 3.38% 1.99% 3.38%
#> 11 60 60 60 60
#> 12 47.24% 40.54% 39.74% 40.54%
Created on 2022-04-26 by the reprex package (v2.0.1)
Also note, I think the percentages you calculated are wrong. When I used your data, I get:
sum(df$a[c(2,4,6,8,10,12)])
#> [1] 86.47
And when I use mine, that are different from yours, I get 100 (if we turn them back into numbers from strings).

Filter in dplyr interval of dates

I have the following simulated dataset in R:
library(tidyverse)
A = seq(from = as.Date("2021/1/1"),to=as.Date("2022/1/1"), length.out = 252)
length(A)
x = rnorm(252)
d = tibble(A,x);d
that looks like :
# A tibble: 252 × 2
A x
<date> <dbl>
1 2021-01-01 0.445
2 2021-01-02 -0.793
3 2021-01-03 -0.367
4 2021-01-05 1.64
5 2021-01-06 -1.15
6 2021-01-08 0.276
7 2021-01-09 1.09
8 2021-01-11 0.443
9 2021-01-12 -0.378
10 2021-01-14 0.203
# … with 242 more rows
Is one year of 252 trading days.Let's say I have a date of my interest which is:
start = as.Date("2021-05-23");start.
I want to filter the data set and the result to be a new dataset starting from this starting date and the next 20 index dates NOT simple days, and then to find the total indexes that the new dataset contains.
For example from the starting date and after I have :
d1=d%>%
dplyr::filter(A>start)%>%
dplyr::summarise(n())
d1
# A tibble: 1 × 1
`n()`
<int>
1 98
but I want from the starting date and after the next 20 trading days.How can I do that ? Any help?
Perhaps a brute-force attempt:
d %>%
filter(between(A, start, max(head(sort(A[A > start]), 20))))
# # A tibble: 20 x 2
# A x
# <date> <dbl>
# 1 2021-05-23 -0.185
# 2 2021-05-24 0.102
# 3 2021-05-26 0.429
# 4 2021-05-27 -1.21
# 5 2021-05-29 0.260
# 6 2021-05-30 0.479
# 7 2021-06-01 -0.623
# 8 2021-06-02 0.982
# 9 2021-06-04 -0.0533
# 10 2021-06-05 1.08
# 11 2021-06-07 -1.96
# 12 2021-06-08 -0.613
# 13 2021-06-09 -0.267
# 14 2021-06-11 -0.284
# 15 2021-06-12 0.0851
# 16 2021-06-14 0.355
# 17 2021-06-15 -0.635
# 18 2021-06-17 -0.606
# 19 2021-06-18 -0.485
# 20 2021-06-20 0.255
If you have duplicate dates, you may prefer to use head(sort(unique(A[A > start])),20), depending on what "20 index dates" means.
And to find the number of indices, you can summarise or count as needed.
You could first sort by the date, filter for days greater than given date and then pull top 20 records.
d1 = d %>%
arrange(A) %>%
filter(A > start) %>%
head(20)

Combining a loop with stacking dataframes created by a function

I'm doing some analysis with the BaseballR package and want to be able to combine dataframes by using a loop.
For example, the following code using the standings_on_date_bref function gives me a table of division standings for the specified day (plus manually adding a column for the date of those standings):
library("baseballr")
library("dplyr")
standings_on_date_bref(date = "04-28-2021", division = "NL West") %>%
mutate(date = "04-28-2021")
Tm
W-L%
date
SFG
0.640
04-28-2021
LAD
0.640
04-28-2021
SDP
0.538
04-28-2021
ARI
0.500
04-28-2021
COL
0.375
04-28-2021
However, I'm interested in getting the standings for a whole range of days (which would end up being a dataframe with rows = 5 teams * x number of days) for example for 04-28-2021 to 04-29-2021, I'm hoping it would look something like this:
Tm
W-L%
date
SFG
0.640
04-28-2021
LAD
0.640
04-28-2021
SDP
0.538
04-28-2021
ARI
0.500
04-28-2021
COL
0.375
04-28-2021
SFG
0.640
04-29-2021
LAD
0.615
04-29-2021
SDP
0.538
04-29-2021
ARI
0.520
04-29-2021
COL
0.360
04-29-2021
I have tried to do so by implementing some sort of loop. This is what I've come up with so far, but in the end it just gives me the standings for the end date.
start <- as.Date("04-01-21",format="%m-%d-%y")
end <- as.Date("04-03-21",format="%m-%d-%y")
theDate <- start
while (theDate <= end)
{
all_standings <- standings_on_date_bref(date = theDate, division = "NL West") %>%
mutate(date = theDate)
theDate <- theDate + 1
}
You can try purrr which would do it quite nicely with map_dfr function
library(baseballr)
library(dplyr)
library(purrr)
date_seq <- seq(as.Date("04-01-21",format="%m-%d-%y"),
as.Date("04-03-21",format="%m-%d-%y"), by = "1 day")
map_dfr(.x = date_seq,
.f = function(x) {
standings_on_date_bref(date = x, division = "NL West") %>%
mutate(date = x)
})
#> # A tibble: 15 x 9
#> Tm W L `W-L%` GB RS RA `pythW-L%` date
#> <chr> <int> <int> <dbl> <chr> <int> <int> <dbl> <date>
#> 1 SDP 1 0 1 -- 8 7 0.561 2021-04-01
#> 2 COL 1 0 1 -- 8 5 0.703 2021-04-01
#> 3 ARI 0 1 0 1.0 7 8 0.439 2021-04-01
#> 4 SFG 0 1 0 1.0 7 8 0.439 2021-04-01
#> 5 LAD 0 1 0 1.0 5 8 0.297 2021-04-01
#> 6 SDP 2 0 1 -- 12 9 0.629 2021-04-02
#> 7 COL 1 1 0.5 1.0 14 16 0.439 2021-04-02
#> 8 SFG 1 1 0.5 1.0 13 11 0.576 2021-04-02
#> 9 LAD 1 1 0.5 1.0 16 14 0.561 2021-04-02
#> 10 ARI 0 2 0 2.0 9 12 0.371 2021-04-02
#> 11 SDP 3 0 1 -- 19 9 0.797 2021-04-03
#> 12 LAD 2 1 0.667 1.0 22 19 0.567 2021-04-03
#> 13 COL 1 2 0.333 2.0 19 22 0.433 2021-04-03
#> 14 SFG 1 2 0.333 2.0 13 15 0.435 2021-04-03
#> 15 ARI 0 3 0 3.0 9 19 0.203 2021-04-03
Created on 2022-01-02 by the reprex package (v2.0.1)

How to find both rows associated with a string in an R dataframe and subtract their mutual column values

In R, I have a dataframe that looks like this:
sample value gene tag isPTV
1 1120 3.4 arx1 1120|arx1 0
2 2123 2.3 mnf2 2123|mnf2 0
3 1129 1.9 trf4 1129|trf4 0
4 2198 0.2 brc1 2198|brc1 0
5 1120 2.1 arx1 1120|arx1 1
6 2123 0.4 mnf2 2123|mnf2 1
7 1129 1.2 trf4 1129|trf4 1
8 2198 0.9 brc1 2198|brc1 1
Such that 0 means false and 1 means true. What I'm ultimately trying to do is create a dataframe that, for each tag, finds the absolute value between the value numbers.
For instance, for 1129|trf4 occurs in two separate rows. There's a value for when it isPTV and when it is not, so the absolute value would be 1.9 - 1.2 = 0.7.
I started out by trying to write a function to do these for a given tag value, such that, for a given tag, it would return both rows containing the tag:
getExprValue <- function(dataframe, tag){
return(dataframe[tag,])
}
But this is not working, and I'm not very familiar with how you index dataframes in R.
What is the right way to do this?
UPDATE:
Solution 1 Attempt:
m_diff <- m %>% group_by(tag) %>% mutate(absDiff = abs(diff(value)))
Response:
Error in mutate_impl(.data, dots) : ColumnabsDiffmust be length 1 (the group size), not 0
Solution 2 Attempt:
with(df1, abs(ave(value, tag, FUN = diff)))
Response:
Error in x[i] <- value[[j]] : replacement has length zero
Edit: I just noticed that #akrun had a much simpler solution
Create data with a structure similar to yours:
library(tidyverse)
dat <- tibble(
sample = rep(sample(1000:3000, 10), 2),
value = rnorm(20, 5, 1),
gene = rep(letters[1:10], 2),
tag = paste(sample, gene, sep = "|"),
isPTV = rep(0:1, each = 10)
)
dat
#> # A tibble: 20 x 5
#> sample value gene tag isPTV
#> <int> <dbl> <chr> <chr> <int>
#> 1 2149 5.90 a 2149|a 0
#> 2 1027 5.46 b 1027|b 0
#> 3 1103 5.65 c 1103|c 0
#> 4 1884 4.86 d 1884|d 0
#> 5 2773 5.58 e 2773|e 0
#> 6 2948 6.98 f 2948|f 0
#> 7 2478 5.17 g 2478|g 0
#> 8 2724 6.71 h 2724|h 0
#> 9 1927 5.06 i 1927|i 0
#> 10 1081 4.39 j 1081|j 0
#> 11 2149 4.60 a 2149|a 1
#> 12 1027 2.97 b 1027|b 1
#> 13 1103 6.17 c 1103|c 1
#> 14 1884 5.83 d 1884|d 1
#> 15 2773 4.23 e 2773|e 1
#> 16 2948 6.48 f 2948|f 1
#> 17 2478 5.06 g 2478|g 1
#> 18 2724 5.32 h 2724|h 1
#> 19 1927 7.32 i 1927|i 1
#> 20 1081 4.73 j 1081|j 1
#akrun solution (much better than mine):
dat %>%
group_by(tag) %>%
mutate(absDiff = abs(diff(value)))
#> # A tibble: 20 x 6
#> # Groups: tag [10]
#> sample value gene tag isPTV absDiff
#> <int> <dbl> <chr> <chr> <int> <dbl>
#> 1 2149 5.90 a 2149|a 0 1.30
#> 2 1027 5.46 b 1027|b 0 2.49
#> 3 1103 5.65 c 1103|c 0 0.520
#> 4 1884 4.86 d 1884|d 0 0.974
#> 5 2773 5.58 e 2773|e 0 1.34
#> 6 2948 6.98 f 2948|f 0 0.502
#> 7 2478 5.17 g 2478|g 0 0.114
#> 8 2724 6.71 h 2724|h 0 1.39
#> 9 1927 5.06 i 1927|i 0 2.26
#> 10 1081 4.39 j 1081|j 0 0.337
#> 11 2149 4.60 a 2149|a 1 1.30
#> 12 1027 2.97 b 1027|b 1 2.49
#> 13 1103 6.17 c 1103|c 1 0.520
#> 14 1884 5.83 d 1884|d 1 0.974
#> 15 2773 4.23 e 2773|e 1 1.34
#> 16 2948 6.48 f 2948|f 1 0.502
#> 17 2478 5.06 g 2478|g 1 0.114
#> 18 2724 5.32 h 2724|h 1 1.39
#> 19 1927 7.32 i 1927|i 1 2.26
#> 20 1081 4.73 j 1081|j 1 0.337
My initial suggestion (unnecessarily complicated):
nested <- dat %>%
group_by(tag) %>%
nest()
nested %>%
mutate(difference = map(data, ~ abs(diff(.$value)))) %>%
select(- data) %>%
unnest()
#> # A tibble: 10 x 2
#> tag difference
#> <chr> <dbl>
#> 1 2149|a 1.30
#> 2 1027|b 2.49
#> 3 1103|c 0.520
#> 4 1884|d 0.974
#> 5 2773|e 1.34
#> 6 2948|f 0.502
#> 7 2478|g 0.114
#> 8 2724|h 1.39
#> 9 1927|i 2.26
#> 10 1081|j 0.337

Resources