This is my dataset:
df = structure(list(from = c(0, 0, 0, 0, 38, 43, 49, 54), to = c(43,
54, 56, 62, 62, 62, 62, 62), count = c(342, 181, 194, 386, 200,
480, 214, 176), group = c("keiner", "keiner", "keiner", "keiner",
"paid", "paid", "owned", "earned")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -8L))
My Problem is that the columns from and to need to be ranked (the ranking has to be done for the two columns from and to), since the visualisation library requires that and also needs to start with an index of 0.
Thats why I build two vectors, one (ranking) with a ranking of each unique value of the two columns, the other (uniquevalues) with original unique values of the dataset.
ranking <- dplyr::dense_rank(unique(c(df$from, df$to))) - 1 ### Start Index at 0, "recode" variables
uniquevalues <- unique(c(df$from, df$to))
Now I have to recode the original dataset. The columns to and from have to receive the values from ranking, according to the corresponding value of uniquevalues.
The only option I came around with was to create a dataframe of the the two vectors and loop over each row, but I would really like to have a vectorized solution for this. Can anyone help me?
This:
<dbl> <dbl> <dbl> <chr>
1 0 43 342 keiner
2 0 54 181 keiner
3 0 56 194 keiner
4 0 62 386 keiner
5 38 62 200 paid
6 43 62 480 paid
7 49 62 214 owned
8 54 62 176 earned
should become this:
from to count group
<dbl> <dbl> <dbl> <chr>
1 0 2 342 keiner
2 0 4 181 keiner
3 0 5 194 keiner
4 0 6 386 keiner
5 1 6 200 paid
6 2 6 480 paid
7 3 6 214 owned
8 4 6 176 earned
We could unlist the values and match them with uniquevalues
df[1:2] <- match(unlist(df[1:2]), uniquevalues) - 1
df
# from to count group
# <dbl> <dbl> <dbl> <chr>
#1 0 2 342 keiner
#2 0 4 181 keiner
#3 0 5 194 keiner
#4 0 6 386 keiner
#5 1 6 200 paid
#6 2 6 480 paid
#7 3 6 214 owned
#8 4 6 176 earned
Or using column names instead of index.
df[c("from", "to")] <- match(unlist(df[c("from", "to")]), uniquevalues) - 1
Another solution converting to factor and back.
f <- unique(unlist(df1[1:2]))
df[1:2] <- lapply(df[1:2], function(x) {
as.integer(as.character(factor(x, levels=f, labels=1:length(f) - 1)))
})
df
# # A tibble: 8 x 4
# from to count group
# <fct> <fct> <dbl> <chr>
# 1 0 2 342 keiner
# 2 0 4 181 keiner
# 3 0 5 194 keiner
# 4 0 6 386 keiner
# 5 1 6 200 paid
# 6 2 6 480 paid
# 7 3 6 214 owned
# 8 4 6 176 earned
I would use mapvalues function. Like this
library(plyr)
df[ , 1:2] <- mapvalues(unlist(df[ , 1:2]),
from= uniquevalues,
to= ranking)
df
# from to count group
# <dbl> <dbl> <dbl> <chr>
#1 0 2 342 keiner
#2 0 4 181 keiner
#3 0 5 194 keiner
#4 0 6 386 keiner
#5 1 6 200 paid
#6 2 6 480 paid
#7 3 6 214 owned
#8 4 6 176 earned
Related
I have a question related to the function tmerge() in the R package survival.
Trying to set up a data set with time-dependent covariates, but the value(s) of the initial time period is set to NA (see reprex below).
I have one data frame with baseline variables, time-, and event data, and a second data frame with variables measured 3 months after baseline.
Have used the same approach as in the PBC-data example in the vignette by Terry Therneau and Co. (or tried at least! https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf). On p. 11 it says:
"The tdc and cumtdc arguments can have 1, 2 or three arguments. The first is always the
time point, the second, if present, is the value to be inserted, and an optional third argument is the initial value. If the tdc call has a single argument the result is always a 0/1 variable, 0 before the time point and 1 after. For the 2 or three argument form, the starting value before the first definition of the new variable (before the first time point) will be the initial value. The default for the initial value is NA, the value of the tdcstart option." Not sure I understand the last bit highlighted in bold.
Do not get the same problem when I replicate the PBC-example. Tried to specify init in the second tmerge call and/or the tdcstart option without any success (both generates an error). There are no missing values in the covariates or the outcome (time, event).
Reaching out here, since I cannot find out what I am doing wrong.
Thanks a lot in advance!
PS. This is my first post, so apologize if I have missed something. Hope it makes sense.
library(tidyverse)
library(survival)
set.seed(123)
# Generate data
df_base <- tibble(
ID = as.numeric(1:100),
time = as.integer(runif(100, min = 100, max = 730)),
status = as.factor(sample(x = c("0", "1"), prob = c(0.7, 0.3), size = 100, replace = T)),
vas = as.integer(rnorm(n = 100, mean = 53, sd = 10)))
df_fu <- tibble(
ID = as.numeric(1:100),
fu_3mo = 91,
vas = as.integer(rnorm(n = 100, mean = 44, sd = 15)))
# Baseline data
head(df_base)
#> # A tibble: 6 x 4
#> ID time status vas
#> <dbl> <int> <fct> <int>
#> 1 1 281 0 45
#> 2 2 596 0 55
#> 3 3 357 0 50
#> 4 4 656 1 49
#> 5 5 692 0 43
#> 6 6 128 1 52
# Follow-up data
head(df_fu)
#> # A tibble: 6 x 3
#> ID fu_3mo vas
#> <dbl> <dbl> <int>
#> 1 1 91 76
#> 2 2 91 63
#> 3 3 91 40
#> 4 4 91 52
#> 5 5 91 37
#> 6 6 91 36
# Generate time-dependent covariates
df_tdc <- tmerge(df_base, df_base, id = ID, surgery = event(time, status))
head(df_tdc)
#> ID time status vas tstart tstop surgery
#> 1 1 281 0 45 0 281 0
#> 2 2 596 0 55 0 596 0
#> 3 3 357 0 50 0 357 0
#> 4 4 656 1 49 0 656 1
#> 5 5 692 0 43 0 692 0
#> 6 6 128 1 52 0 128 1
df_tdc <- tmerge(df_tdc, df_fu, id = ID, vas = tdc(fu_3mo, vas))
#> Warning in tmerge(df_tdc, df_fu, id = ID, vas = tdc(fu_3mo, vas)): replacement
#> of variable 'vas'
head(df_tdc)
#> ID time status vas tstart tstop surgery
#> 1 1 281 0 NA 0 91 0
#> 2 1 281 0 76 91 281 0
#> 3 2 596 0 NA 0 91 0
#> 4 2 596 0 63 91 596 0
#> 5 3 357 0 NA 0 91 0
#> 6 3 357 0 40 91 357 0
Created on 2021-11-26 by the reprex package (v0.3.0)
I am trying to use a lag value of a previous row, which needs to be calculated from the previous row (unless its first entry).
I was trying something similar to:
test<-data.frame(account_id=c(123,123,123,123,444,444,444,444),entry=c(1,2,3,4,1,2,3,4),beginning_balance=c(100,0,0,0,200,0,0,0),
deposit=c(10,20,5,8,10,12,20,4),running_balance=c(0,0,0,0,0,0,0,0))
test2<-test %>%
group_by(account_id) %>%
mutate(running_balance = if_else(entry==1, beginning_balance+deposit,
lag(running_balance)+deposit))
print(test2)
the running balance should be 110,130,135,143,210,222,242,246
For each account_id you can add first beginning_balance with cumulative sum of deposit.
library(dplyr)
test %>%
group_by(account_id) %>%
mutate(running_balance = first(beginning_balance) + cumsum(deposit))
# account_id entry beginning_balance deposit running_balance
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 123 1 100 10 110
#2 123 2 0 20 130
#3 123 3 0 5 135
#4 123 4 0 8 143
#5 444 1 200 10 210
#6 444 2 0 12 222
#7 444 3 0 20 242
#8 444 4 0 4 246
Same thing using data.table :
library(data.table)
setDT(test)[, running_balance := first(beginning_balance) + cumsum(deposit), account_id]
Using for-loops for each unique account_id and adding cumulative sum for each id.
for ( i in unique (test$account_id)) {
test$running_balance [test$account_id == i] <- cumsum(test$beginning_balance[test$account_id == i]+test$deposit[test$account_id == i])
}
print (test)
account_id entry beginning_balance deposit running_balance
1 123 1 100 10 110
2 123 2 0 20 130
3 123 3 0 5 135
4 123 4 0 8 143
5 444 1 200 10 210
6 444 2 0 12 222
7 444 3 0 20 242
8 444 4 0 4 246
Hi and Thanks in advance for any assistance the group can give.
I have a dataset which gives the performance ratings for 7 race horses
over their last 3 races. The performance ratings are DaH1, DaH2 and DaH3 where
DaH1 is the performance rating for the last race etc.
I also have data for race distances over which the races were ran, where the distances are
Dist1, Dist2 and Dist3 and they correspond to the performance ratings. ie. Horse 2 has a
performance rating of 124 for DaH1, with a race distance, Dist1, of 12.
The dataset is:
horse_data <- tibble(
DaH1=c(0, 124, 121, 123, 0, NA, 110),
DaH2=c(124, 117, 125, 120, 125, 0, NA),
DaH3=c(121, 119, 123, 119, NA, 0, 123),
Dist1 =c(10,12,10.3,11,11.5,14,10),
Dist2 =c(10,10.1,12,8,9.5,10.25,8.75),
Dist3 =c(11.5,12.5,9.8,10,10,15,10),
horse =c(1,2,3,4,5,6,7),
)
I am trying to use pivot_longer to convert the data to a better dataset for performing
calculations depending upon race distances.
So far I used this code:
tidyData <- horse_data %>%
pivot_longer(
values_to="Rating",
cols=c(DaH1, DaH2, DaH3),
names_prefix="DaH",
names_to="RaceIdx"
)
To achieve:
> tidyData
# A tibble: 21 x 6
Dist1 Dist2 Dist3 horse RaceIdx Rating
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 10 10 11.5 1 1 0
2 10 10 11.5 1 2 124
3 10 10 11.5 1 3 121
4 12 10.1 12.5 2 1 124
5 12 10.1 12.5 2 2 117
6 12 10.1 12.5 2 3 119
7 10.3 12 9.8 3 1 121
8 10.3 12 9.8 3 2 125
9 10.3 12 9.8 3 3 123
10 11 8 10 4 1 123
# ... with 11 more rows
Where RaceIdx is the race number.
This has achieved the desired result for 'Rating' column but I need to be able to convert
Dist1, Dist2 and Dist3 in to a separate column 'Distance' that matches up each horses
corresponding DaH rating with Dist.
To illustrate, I am trying to end up with a dataset as follows:
Distance horse RaceIdx Rating
<dbl> <dbl> <chr> <dbl>
1 10 1 1 0
2 10 1 2 124
3 11 1 3 121
4 12 2 1 124
5 10.1 2 2 117
6 12.5 2 3 119
7 10.3 3 1 121
8 12 3 2 125
9 9.8 3 3 123
10 11 4 1 123
# ... with 11 more rows
I need to filter the Ratings by Distance.
Then I hope to be able to produce average ratings for each horse ratings where the
race Distance is between 10 and 11.
Many Thanks in advance.
We can specify the names_sep with a regex lookaround
library(dplyr)
library(tidyr)
horse_data %>%
pivot_longer(cols = -c(horse), names_to = c('.value', 'RaceIdx'),
names_sep="(?<=[A-Za-z])(?=[0-9])") %>%
rename(Distance = Dist, Rating = DaH)
# A tibble: 21 x 4
# horse RaceIdx Rating Distance
# <dbl> <chr> <dbl> <dbl>
# 1 1 1 0 10
# 2 1 2 124 10
# 3 1 3 121 11.5
# 4 2 1 124 12
# 5 2 2 117 10.1
# 6 2 3 119 12.5
# 7 3 1 121 10.3
# 8 3 2 125 12
# 9 3 3 123 9.8
#10 4 1 123 11
# … with 11 more rows
Having a rough time approaching this problem with a large dataset. Essentially there are multiple rows for the same item. However, only one of the items contains the required value. I need to copy that value to all matching items.
Eg. below, I need item 100 to have a cost of 1203 for every row.
df = data.frame("item" = c(100, 100, 100, 105, 105, 102, 102, 102),
"cost" = c(1203, 0, 0, 66, 0, 1200, 0, 0))
> df
item cost
1 100 1203
2 100 0
3 100 0
4 105 66
5 105 0
6 102 1200
7 102 0
8 102 0
Like so:
df_wanted = data.frame("item" = c(100, 100, 100, 105, 105, 102, 102, 102),
"cost" = c(1203, 1203, 1203, 66, 66, 1200, 1200, 1200))
> df_wanted
item cost
1 100 1203
2 100 1203
3 100 1203
4 105 66
5 105 66
6 102 1200
7 102 1200
8 102 1200
Below is my attempt at I think an inefficient method:
for (row in 1:length(df$cost)){
if (df$cost[row] == 0){
df$cost[row] = df$cost[row-1]
}
}
here is one option. After grouping by 'item', subset the 'cost' where the 'cost' is not 0 and select the first element
library(dplyr)
df %>%
group_by(item) %>%
mutate(cost = first(cost[cost!=0))
# A tibble: 8 x 2
# Groups: item [3]
# item cost
# <dbl> <dbl>
#1 100 1203
#2 100 1203
#3 100 1203
#4 105 66
#5 105 66
#6 102 1200
#7 102 1200
#8 102 1200
Looks like you want to group by item and then replace 0 in cost with the last non-zero value. In each group, cummax(which(cost != 0)) will give the index of the last non-zero value.
library(dplyr)
df %>%
group_by(item) %>%
mutate(cost = cost[cummax(which(cost != 0))]) %>%
ungroup()
## A tibble: 8 x 2
# item cost
# <dbl> <dbl>
#1 100 1203
#2 100 1203
#3 100 1203
#4 105 66
#5 105 66
#6 102 1200
#7 102 1200
#8 102 1200
Base R equivalent is
transform(df, cost = ave(cost, item, FUN = function(x) x[cummax(which(x != 0))]))
What I ended up going with after revisiting this problem as a left_join(). Which makes more sense to me intuitively though it may not be the best solution.
The original DF below.
df = tibble("item" = as.factor(c(100, 100, 100, 105, 105, 102, 102, 102)),
"cost" = c(1203, 0, 0, 66, 0, 0, 1200, 0))
> df
# A tibble: 8 x 2
item cost
<fct> <dbl>
1 100 1203
2 100 0
3 100 0
4 105 66
5 105 0
6 102 0
7 102 1200
8 102 0
Create an 'index' of item-value pairs
df_index <- df %>%
group_by(item) %>%
arrange(-cost) %>%
slice(1)
> df_index
# A tibble: 3 x 2
# Groups: item [3]
item cost
<fct> <dbl>
1 100 1203
2 102 1200
3 105 66
Finally, join the dataframes by item to fill in the empty row values.
df_joined <- df %>%
left_join(df_index, by="item")
> df_joined
# A tibble: 8 x 3
item cost.x cost.y
<fct> <dbl> <dbl>
1 100 1203 1203
2 100 0 1203
3 100 0 1203
4 105 66 66
5 105 0 66
6 102 0 1200
7 102 1200 1200
8 102 0 1200
Apologies if this has been asked before. I couldn't find any satisfactory answers, although it sounds like it should be a rather straightforward operation.
I have my data
transition_frame name state_number lifetime
<int> <chr> <dbl> <dbl>
1 38 //Traces_exp1_tif_pair10 1 NA
2 44 //Traces_exp1_tif_pair10 2 6
3 352 //Traces_exp1_tif_pair10 3 308
4 362 //Traces_exp1_tif_pair10 4 10
5 379 //Traces_exp1_tif_pair10 5 17
6 388 //Traces_exp1_tif_pair10 6 9
It was easy enough to calculate the rowwise differences between transition frames, but since there's no "transition" between state 0 and 1, it breaks the flow.
How can I make only the first row be transition_frame - 1 (hint, it's 37), without touching any other data?
Imagine,
group_by(name) %>%
filter(state_number == 1) %>%
mutate(lifetime = transition_frame - 1) %>%
unfilter() # To retrieve dropped data
Which would result in a whole set, with the first row computed, and NOT only the first row.
transition_frame name state_number lifetime
<int> <chr> <dbl> <dbl>
1 38 //Traces_exp1_tif_pair10 1 37
2 44 //Traces_exp1_tif_pair10 2 6
3 352 //Traces_exp1_tif_pair10 3 308
4 362 //Traces_exp1_tif_pair10 4 10
5 379 //Traces_exp1_tif_pair10 5 17
6 388 //Traces_exp1_tif_pair10 6 9
Does the following work for you?
df <- data.frame(transition_frame = c(38, 44, 352, 362, 379, 388),
name = rep("//Traces_exp1_tif_pair10", 6),
state_number = seq(1, 6))
df %>% mutate(lifetime = diff(c(1, transition_frame)))
transition_frame name state_number lifetime
1 38 //Traces_exp1_tif_pair10 1 37
2 44 //Traces_exp1_tif_pair10 2 6
3 352 //Traces_exp1_tif_pair10 3 308
4 362 //Traces_exp1_tif_pair10 4 10
5 379 //Traces_exp1_tif_pair10 5 17
6 388 //Traces_exp1_tif_pair10 6 9
Replace 1 in diff() with other values if you want the transition frame in state 0 to take on different values.
Hope an approach similar to below code might help you!
df <- data.frame(transition_frame=c(38,44,352),
name=c('//Traces_exp1_tif_pair10','//Traces_exp1_tif_pair10','//Traces_exp1_tif_pair10'),
state_number=c(1,2,3),
lifetime=c(NA,6,308))
df[df$state_number==1 & is.na(df$lifetime),"lifetime"] <-
df[df$state_number==1 & is.na(df$lifetime),"transition_frame"] - 1
df