I'm trying to find the predicted values of car accidents according to age and sex and finally adjusted to population.
My data is (df):
df <- dplyr::tibble(
city = c("a", "a", "b", "b", "c", "c"),
sex = c(1,0,1,0,1,0),
age = c(1,2,1,2,1,2),
population = c(100, 123, 189, 234, 221, 435),
accidents = c(87, 98, 79, 43,45,65)
)
My code:
library(tidyverse)
library(ggeffects)
poisson<-glm(accidents~sex+age,family="poisson",data=df)
df<-df%>%
mutate(acc_pred=predict(poisson))
Output:
city sex age population accidents acc_pred
a 1 1 100 87 4.36
a 0 2 123 98 4.43
b 1 1 189 79 4.21
b 0 2 234 43 4.25
c 1 1 221 45 4.26
c 0 2 435 65 3.93
What am I doing wrong?
A Poisson glm uses a log link function, and by default the predict.glm method returns the predictions without applying the inverse link function. You either need to use type = "response" inside predict, which will call the inverse link function on the predictions to give you predictions in the same units as your input data, or equivalently, since the inverse link function is essentially just exp, you can exponentiate the results of predict.
So you can do either:
df %>%
mutate(acc_pred=predict(poisson, type = 'response'))
#> city sex age population accidents acc_pred
#> 1 a 1 1 100 87 70.33333
#> 2 a 0 2 123 98 68.66667
#> 3 b 1 1 189 79 70.33333
#> 4 b 0 2 234 43 68.66667
#> 5 c 1 1 221 45 70.33333
#> 6 c 0 2 435 65 68.66667
Or
df %>%
mutate(acc_pred = exp(predict(poisson)))
#> city sex age population accidents acc_pred
#> 1 a 1 1 100 87 70.33333
#> 2 a 0 2 123 98 68.66667
#> 3 b 1 1 189 79 70.33333
#> 4 b 0 2 234 43 68.66667
#> 5 c 1 1 221 45 70.33333
#> 6 c 0 2 435 65 68.66667
Related
I have a question related to the function tmerge() in the R package survival.
Trying to set up a data set with time-dependent covariates, but the value(s) of the initial time period is set to NA (see reprex below).
I have one data frame with baseline variables, time-, and event data, and a second data frame with variables measured 3 months after baseline.
Have used the same approach as in the PBC-data example in the vignette by Terry Therneau and Co. (or tried at least! https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf). On p. 11 it says:
"The tdc and cumtdc arguments can have 1, 2 or three arguments. The first is always the
time point, the second, if present, is the value to be inserted, and an optional third argument is the initial value. If the tdc call has a single argument the result is always a 0/1 variable, 0 before the time point and 1 after. For the 2 or three argument form, the starting value before the first definition of the new variable (before the first time point) will be the initial value. The default for the initial value is NA, the value of the tdcstart option." Not sure I understand the last bit highlighted in bold.
Do not get the same problem when I replicate the PBC-example. Tried to specify init in the second tmerge call and/or the tdcstart option without any success (both generates an error). There are no missing values in the covariates or the outcome (time, event).
Reaching out here, since I cannot find out what I am doing wrong.
Thanks a lot in advance!
PS. This is my first post, so apologize if I have missed something. Hope it makes sense.
library(tidyverse)
library(survival)
set.seed(123)
# Generate data
df_base <- tibble(
ID = as.numeric(1:100),
time = as.integer(runif(100, min = 100, max = 730)),
status = as.factor(sample(x = c("0", "1"), prob = c(0.7, 0.3), size = 100, replace = T)),
vas = as.integer(rnorm(n = 100, mean = 53, sd = 10)))
df_fu <- tibble(
ID = as.numeric(1:100),
fu_3mo = 91,
vas = as.integer(rnorm(n = 100, mean = 44, sd = 15)))
# Baseline data
head(df_base)
#> # A tibble: 6 x 4
#> ID time status vas
#> <dbl> <int> <fct> <int>
#> 1 1 281 0 45
#> 2 2 596 0 55
#> 3 3 357 0 50
#> 4 4 656 1 49
#> 5 5 692 0 43
#> 6 6 128 1 52
# Follow-up data
head(df_fu)
#> # A tibble: 6 x 3
#> ID fu_3mo vas
#> <dbl> <dbl> <int>
#> 1 1 91 76
#> 2 2 91 63
#> 3 3 91 40
#> 4 4 91 52
#> 5 5 91 37
#> 6 6 91 36
# Generate time-dependent covariates
df_tdc <- tmerge(df_base, df_base, id = ID, surgery = event(time, status))
head(df_tdc)
#> ID time status vas tstart tstop surgery
#> 1 1 281 0 45 0 281 0
#> 2 2 596 0 55 0 596 0
#> 3 3 357 0 50 0 357 0
#> 4 4 656 1 49 0 656 1
#> 5 5 692 0 43 0 692 0
#> 6 6 128 1 52 0 128 1
df_tdc <- tmerge(df_tdc, df_fu, id = ID, vas = tdc(fu_3mo, vas))
#> Warning in tmerge(df_tdc, df_fu, id = ID, vas = tdc(fu_3mo, vas)): replacement
#> of variable 'vas'
head(df_tdc)
#> ID time status vas tstart tstop surgery
#> 1 1 281 0 NA 0 91 0
#> 2 1 281 0 76 91 281 0
#> 3 2 596 0 NA 0 91 0
#> 4 2 596 0 63 91 596 0
#> 5 3 357 0 NA 0 91 0
#> 6 3 357 0 40 91 357 0
Created on 2021-11-26 by the reprex package (v0.3.0)
Hi and Thanks in advance for any assistance the group can give.
I have a dataset which gives the performance ratings for 7 race horses
over their last 3 races. The performance ratings are DaH1, DaH2 and DaH3 where
DaH1 is the performance rating for the last race etc.
I also have data for race distances over which the races were ran, where the distances are
Dist1, Dist2 and Dist3 and they correspond to the performance ratings. ie. Horse 2 has a
performance rating of 124 for DaH1, with a race distance, Dist1, of 12.
The dataset is:
horse_data <- tibble(
DaH1=c(0, 124, 121, 123, 0, NA, 110),
DaH2=c(124, 117, 125, 120, 125, 0, NA),
DaH3=c(121, 119, 123, 119, NA, 0, 123),
Dist1 =c(10,12,10.3,11,11.5,14,10),
Dist2 =c(10,10.1,12,8,9.5,10.25,8.75),
Dist3 =c(11.5,12.5,9.8,10,10,15,10),
horse =c(1,2,3,4,5,6,7),
)
I am trying to use pivot_longer to convert the data to a better dataset for performing
calculations depending upon race distances.
So far I used this code:
tidyData <- horse_data %>%
pivot_longer(
values_to="Rating",
cols=c(DaH1, DaH2, DaH3),
names_prefix="DaH",
names_to="RaceIdx"
)
To achieve:
> tidyData
# A tibble: 21 x 6
Dist1 Dist2 Dist3 horse RaceIdx Rating
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 10 10 11.5 1 1 0
2 10 10 11.5 1 2 124
3 10 10 11.5 1 3 121
4 12 10.1 12.5 2 1 124
5 12 10.1 12.5 2 2 117
6 12 10.1 12.5 2 3 119
7 10.3 12 9.8 3 1 121
8 10.3 12 9.8 3 2 125
9 10.3 12 9.8 3 3 123
10 11 8 10 4 1 123
# ... with 11 more rows
Where RaceIdx is the race number.
This has achieved the desired result for 'Rating' column but I need to be able to convert
Dist1, Dist2 and Dist3 in to a separate column 'Distance' that matches up each horses
corresponding DaH rating with Dist.
To illustrate, I am trying to end up with a dataset as follows:
Distance horse RaceIdx Rating
<dbl> <dbl> <chr> <dbl>
1 10 1 1 0
2 10 1 2 124
3 11 1 3 121
4 12 2 1 124
5 10.1 2 2 117
6 12.5 2 3 119
7 10.3 3 1 121
8 12 3 2 125
9 9.8 3 3 123
10 11 4 1 123
# ... with 11 more rows
I need to filter the Ratings by Distance.
Then I hope to be able to produce average ratings for each horse ratings where the
race Distance is between 10 and 11.
Many Thanks in advance.
We can specify the names_sep with a regex lookaround
library(dplyr)
library(tidyr)
horse_data %>%
pivot_longer(cols = -c(horse), names_to = c('.value', 'RaceIdx'),
names_sep="(?<=[A-Za-z])(?=[0-9])") %>%
rename(Distance = Dist, Rating = DaH)
# A tibble: 21 x 4
# horse RaceIdx Rating Distance
# <dbl> <chr> <dbl> <dbl>
# 1 1 1 0 10
# 2 1 2 124 10
# 3 1 3 121 11.5
# 4 2 1 124 12
# 5 2 2 117 10.1
# 6 2 3 119 12.5
# 7 3 1 121 10.3
# 8 3 2 125 12
# 9 3 3 123 9.8
#10 4 1 123 11
# … with 11 more rows
I'm trying to split columns into new rows keeping the data of the first two columns.
d1 <- data.frame(a=c(100,0,78),b=c(0,137,117),c.1=c(111,17,91), d.1=c(99,66,22), c.2=c(11,33,44), d.2=c(000,001,002))
d1
a b c.1 d.1 c.2 d.2
1 100 0 111 99 11 0
2 0 137 17 66 33 1
3 78 117 91 22 44 2
Expected results would be:
a b c d
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2
Multiple tries with dplyr, but in sees is not the right approach.
If you want to stay in dplyr/tidyverse, you want tidyr::pivot_longer with a special reference to .value -- see the pivot vignette for more:
library(tidyverse)
d1 <- data.frame(
a = c(100, 0, 78),
b = c(0, 137, 117),
c.1 = c(111, 17, 91),
d.1 = c(99, 66, 22),
c.2 = c(11, 33, 44),
d.2 = c(000, 001, 002)
)
d1 %>%
pivot_longer(
cols = contains("."),
names_to = c(".value", "group"),
names_sep = "\\."
)
#> # A tibble: 6 x 5
#> a b group c d
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 100 0 1 111 99
#> 2 100 0 2 11 0
#> 3 0 137 1 17 66
#> 4 0 137 2 33 1
#> 5 78 117 1 91 22
#> 6 78 117 2 44 2
Created on 2020-05-11 by the reprex package (v0.3.0)
This could solve your issue:
#Try this
a1 <- d1[,c(1:4)]
a2 <- d1[,c(1,2,5,6)]
names(a1) <- names(a2) <- c('a','b','c','d')
DF <- rbind(a1,a2)
The posted answers are good, here's my attempt:
df <- data.frame(a=c(100,0,78),b=c(0,137,117),
c.1=c(111,17,91), d.1=c(99,66,22),
c.2=c(11,33,44), d.2=c(000,001,002))
# Make 2 pivot long operations
df_c <- df %>% select(-d.1, -d.2) %>%
pivot_longer(cols = c("c.1", "c.2"), values_to = "c") %>% select(-name)
df_d <- df %>% select(-c.1, -c.2) %>%
pivot_longer(cols=c("d.1","d.2"), values_to = "d") %>% select(-name)
# bind them without the "key" colums
bind_cols(df_c, select(df_d, -a, -b))
Which produces
# A tibble: 6 x 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2
I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows
This is my dataset:
df = structure(list(from = c(0, 0, 0, 0, 38, 43, 49, 54), to = c(43,
54, 56, 62, 62, 62, 62, 62), count = c(342, 181, 194, 386, 200,
480, 214, 176), group = c("keiner", "keiner", "keiner", "keiner",
"paid", "paid", "owned", "earned")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -8L))
My Problem is that the columns from and to need to be ranked (the ranking has to be done for the two columns from and to), since the visualisation library requires that and also needs to start with an index of 0.
Thats why I build two vectors, one (ranking) with a ranking of each unique value of the two columns, the other (uniquevalues) with original unique values of the dataset.
ranking <- dplyr::dense_rank(unique(c(df$from, df$to))) - 1 ### Start Index at 0, "recode" variables
uniquevalues <- unique(c(df$from, df$to))
Now I have to recode the original dataset. The columns to and from have to receive the values from ranking, according to the corresponding value of uniquevalues.
The only option I came around with was to create a dataframe of the the two vectors and loop over each row, but I would really like to have a vectorized solution for this. Can anyone help me?
This:
<dbl> <dbl> <dbl> <chr>
1 0 43 342 keiner
2 0 54 181 keiner
3 0 56 194 keiner
4 0 62 386 keiner
5 38 62 200 paid
6 43 62 480 paid
7 49 62 214 owned
8 54 62 176 earned
should become this:
from to count group
<dbl> <dbl> <dbl> <chr>
1 0 2 342 keiner
2 0 4 181 keiner
3 0 5 194 keiner
4 0 6 386 keiner
5 1 6 200 paid
6 2 6 480 paid
7 3 6 214 owned
8 4 6 176 earned
We could unlist the values and match them with uniquevalues
df[1:2] <- match(unlist(df[1:2]), uniquevalues) - 1
df
# from to count group
# <dbl> <dbl> <dbl> <chr>
#1 0 2 342 keiner
#2 0 4 181 keiner
#3 0 5 194 keiner
#4 0 6 386 keiner
#5 1 6 200 paid
#6 2 6 480 paid
#7 3 6 214 owned
#8 4 6 176 earned
Or using column names instead of index.
df[c("from", "to")] <- match(unlist(df[c("from", "to")]), uniquevalues) - 1
Another solution converting to factor and back.
f <- unique(unlist(df1[1:2]))
df[1:2] <- lapply(df[1:2], function(x) {
as.integer(as.character(factor(x, levels=f, labels=1:length(f) - 1)))
})
df
# # A tibble: 8 x 4
# from to count group
# <fct> <fct> <dbl> <chr>
# 1 0 2 342 keiner
# 2 0 4 181 keiner
# 3 0 5 194 keiner
# 4 0 6 386 keiner
# 5 1 6 200 paid
# 6 2 6 480 paid
# 7 3 6 214 owned
# 8 4 6 176 earned
I would use mapvalues function. Like this
library(plyr)
df[ , 1:2] <- mapvalues(unlist(df[ , 1:2]),
from= uniquevalues,
to= ranking)
df
# from to count group
# <dbl> <dbl> <dbl> <chr>
#1 0 2 342 keiner
#2 0 4 181 keiner
#3 0 5 194 keiner
#4 0 6 386 keiner
#5 1 6 200 paid
#6 2 6 480 paid
#7 3 6 214 owned
#8 4 6 176 earned