Add a series as a column in R - r

I need to expand a dataset and merge with series of time from 1 hr to 600 hr
ID Age Weight
1 25 150
2 30 140
3 28 170
to the following format [dots represent continuous count from 4 to 599; with the same ID, Age and Weight values]
ID Age Weight Time
1 25 150 1
1 25 150 2
1 25 150 3
1 25 150 4
.. .. ... ..
1 25 150 599
1 25 150 600
2 30 140 1
2 30 140 2
2 30 140 3
2 30 140 4
.. .. ... ..
2 30 140 599
2 30 140 600
3 28 170 1
3 28 170 2
3 28 170 3
3 28 170 4
.. .. ... ..
3 28 170 599
3 28 170 600

We can use complete from tidyr
library(dplyr)
df %>%
mutate(Time = 1) %>%
group_by(ID, Age, Weight) %>%
tidyr::complete(Time = 1:600)
# ID Age Weight Time
# <int> <int> <int> <dbl>
# 1 1 25 150 1
# 2 1 25 150 2
# 3 1 25 150 3
# 4 1 25 150 4
# 5 1 25 150 5
# 6 1 25 150 6
# 7 1 25 150 7
# 8 1 25 150 8
# 9 1 25 150 9
#10 1 25 150 10
# … with 1,790 more rows
Or in base R using split and transform
do.call(rbind, lapply(split(df, df$ID), function(x) transform(x, Time = 1:600)))

Related

Calculating the difference between two columns based on similar names in R

I have a large dataframe with 400 columns of baseline and follow-up scores (and 10,000 subjects). Each alphabet represents a score and I would like to calculate the difference between the follow-up and baseline for each score in a new column:
subid
a_score.baseline
a_score.followup
b_score.baseline
b_score.followup
c_score.baseline
c_score.followup
1
100
150
5
2
80
70
2
120
142
10
9
79
42
3
111
146
60
49
89
46
4
152
148
4
4
69
48
5
110
123
20
18
60
23
6
112
120
5
3
12
20
7
111
145
6
4
11
45
I'd like to calculate the difference between followup and baseline for each score in a new column like this:
df$a_score_difference = df$a_score.followup - df$a_score.baseleine
Any ideas on how to do this efficiently? I really appreciate your help.
code to generate sample data:
subid <- c(1:7)
a_score.baseline <- c(100,120,111,152,110,112,111)
a_score.followup <- c(150,142,146,148,123,120,145)
b_score.baseline <- c(5,10,60,4,20,5,6)
b_score.followup <- c(2,9,49,4,18,3,4)
c_score.baseline <- c(80,79,89,69,60,12,11)
c_score.followup <- c(70,42,46,48,23,20,45)
df <- data.frame(subid,a_score.baseline,a_score.followup,b_score.baseline,b_score.followup,c_score.baseline,c_score.followup)
base R
scores <- sort(grep("score\\.(baseline|followup)", names(df), value = TRUE))
scores
# [1] "a_score.baseline" "a_score.followup" "b_score.baseline" "b_score.followup" "c_score.baseline" "c_score.followup"
scores <- split(scores, sub(".*_", "", scores))
scores
# $score.baseline
# [1] "a_score.baseline" "b_score.baseline" "c_score.baseline"
# $score.followup
# [1] "a_score.followup" "b_score.followup" "c_score.followup"
Map(`-`, df[scores[[2]]], df[scores[[1]]])
# $a_score.followup
# [1] 50 22 35 -4 13 8 34
# $b_score.followup
# [1] -3 -1 -11 0 -2 -2 -2
# $c_score.followup
# [1] -10 -37 -43 -21 -37 8 34
out <- Map(`-`, df[scores[[2]]], df[scores[[1]]])
names(out) <- sub("followup", "difference", names(out))
df <- cbind(df, out)
df
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference
# 1 1 100 150 5 2 80 70 50
# 2 2 120 142 10 9 79 42 22
# 3 3 111 146 60 49 89 46 35
# 4 4 152 148 4 4 69 48 -4
# 5 5 110 123 20 18 60 23 13
# 6 6 112 120 5 3 12 20 8
# 7 7 111 145 6 4 11 45 34
# b_score.difference c_score.difference
# 1 -3 -10
# 2 -1 -37
# 3 -11 -43
# 4 0 -21
# 5 -2 -37
# 6 -2 8
# 7 -2 34
There exists (in an unsupervised mode) the possibility that not all followups will have comparable baselines, which could cause a problem. You might include a test to validate the presence and order:
all(sub("baseline", "followup", scores$score.baseline) == scores$score.followup)
# [1] TRUE
dplyr
You might consider pivoting the data into a more long format. This can be done in base R as well, but looks a lot simpler when done here:
library(dplyr)
# library(tidyr) # pivot_*
df %>%
tidyr::pivot_longer(
-subid,
names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline)
# # A tibble: 21 x 5
# subid ltr baseline followup difference
# <int> <chr> <dbl> <dbl> <dbl>
# 1 1 a 100 150 50
# 2 1 b 5 2 -3
# 3 1 c 80 70 -10
# 4 2 a 120 142 22
# 5 2 b 10 9 -1
# 6 2 c 79 42 -37
# 7 3 a 111 146 35
# 8 3 b 60 49 -11
# 9 3 c 89 46 -43
# 10 4 a 152 148 -4
# # ... with 11 more rows
Honestly, I tend to prefer a long format most of the time for many reasons. If, however, you want to make it wide again, then
df %>%
tidyr::pivot_longer(
-subid, names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline) %>%
tidyr::pivot_wider(
names_from = "ltr",
values_from = c("baseline", "followup", "difference"),
names_glue = "{ltr}_score.{.value}")
# # A tibble: 7 x 10
# subid a_score.baseline b_score.baseline c_score.baseline a_score.followup b_score.followup c_score.followup a_score.difference b_score.difference c_score.difference
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 100 5 80 150 2 70 50 -3 -10
# 2 2 120 10 79 142 9 42 22 -1 -37
# 3 3 111 60 89 146 49 46 35 -11 -43
# 4 4 152 4 69 148 4 48 -4 0 -21
# 5 5 110 20 60 123 18 23 13 -2 -37
# 6 6 112 5 12 120 3 20 8 -2 8
# 7 7 111 6 11 145 4 45 34 -2 34
dplyr #2
This is a keep-it-wide (no pivoting), which will be more efficient than the pivot-mutate-pivot above if you have no intention of working on it in a longer format.
df %>%
mutate(across(
ends_with("score.followup"),
~ . - cur_data()[[sub("followup", "baseline", cur_column())]],
.names = "{sub('followup', 'difference', col)}")
)
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference b_score.difference c_score.difference
# 1 1 100 150 5 2 80 70 50 -3 -10
# 2 2 120 142 10 9 79 42 22 -1 -37
# 3 3 111 146 60 49 89 46 35 -11 -43
# 4 4 152 148 4 4 69 48 -4 0 -21
# 5 5 110 123 20 18 60 23 13 -2 -37
# 6 6 112 120 5 3 12 20 8 -2 8
# 7 7 111 145 6 4 11 45 34 -2 34

Reshape R dataframe with values in new columns from the original dataframe

I have the following dataframe "Diet":
ID Food.group grams day weight
A 12 200 1 60
A 13 300 1 60
A 14 100 1 60
A 15 50 1 60
A 16 200 1 60
A 17 250 1 60
B 13 300 2 73
B 14 140 2 73
B 15 345 2 73
B 17 350 2 73
C 12 120 6 66
C 13 100 6 66
C 16 200 6 66
I need to create a new dataframe with each food group as a new column and the values in grams as their values, all organized by ID. The other columns have unique values for each ID and can become one line. Something like this:
ID 12 13 14 15 16 17 day weight
A 200 300 100 50 200 250 1 60
B N/A 300 140 345 N/A 350 2 73
C 120 100 N/A N/A 200 N/A 6 66
I tried using Diet2 <- reshape(Diet, idvar="ID", timevar="Food.group", direction="wide")
But I get this:
ID 12.grams 12.day 12.weight 13.grams 13.day 13.weight
A 200 1 60 300 1 60
B N/A N/A N/A 300 2 73
C 120 6 66 100 6 66
and so on. How can I get the correct dataframe shape?
You can use the more recent pivot_wider() from {tidyr}.
library(dplyr)
library(tidyr)
diet %>%
pivot_wider(id_cols = c("ID","day","weight")
, names_from = "Food.group"
, values_from = "grams")
This will give you:
# A tibble: 3 x 9
ID day weight `12` `13` `14` `15` `16` `17`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 60 200 300 100 50 200 250
2 B 2 73 NA 300 140 345 NA 350
3 C 6 66 120 100 NA NA 200 NA

How to reshape data frame from a row level to person level in R

I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows

Using R to determine if errors are normally distributed:

Say I have a dataset called wage that looks like this:
wage
# A tibble: 935 x 17
wage hours iq kww educ exper tenure age married black south urban sibs brthord meduc
<int> <int> <int> <int> <int> <int> <int> <int> <fctr> <fctr> <fctr> <fctr> <int> <int> <int>
1 769 40 93 35 12 11 2 31 1 0 0 1 1 2 8
2 808 50 119 41 18 11 16 37 1 0 0 1 1 NA 14
3 825 40 108 46 14 11 9 33 1 0 0 1 1 2 14
4 650 40 96 32 12 13 7 32 1 0 0 1 4 3 12
5 562 40 74 27 11 14 5 34 1 0 0 1 10 6 6
6 1400 40 116 43 16 14 2 35 1 1 0 1 1 2 8
7 600 40 91 24 10 13 0 30 0 0 0 1 1 2 8
8 1081 40 114 50 18 8 14 38 1 0 0 1 2 3 8
9 1154 45 111 37 15 13 1 36 1 0 0 0 2 3 14
10 1000 40 95 44 12 16 16 36 1 0 0 1 1 1 12
# ... with 925 more rows, and 2 more variables: feduc <int>, lwage <dbl>
Say I then look at a simple linear regression btw wage and IQ:
m_wage_iq = lm(wage ~ iq, data = wage)
m_wage_iq$coefficients
which gives me:
## (Intercept) iq
## 116.991565 8.303064
I want check that the errors are:
ϵi∼N(0,σ2)
How do I check this using R?
There are a number of ways you can try.
One way would be the shapiro.test to test for normality. A p.value greater than your alpha level (typically up to 10%) would mean that the null hypothesis (i.e. the errors are normally distributed) cannot be rejected. However, the test is biased by sample size so you might want to reinforce your results by looking at the QQplot.
You can see that by plotting m_wage_iq (plot(m_wage_iq )) and looking at the second graph. If your points approximately lie on the x=y line then that would suggest that the errors follow a normal distribution.

Loop linear regrssion model

I have a data like this where Amount is the dependent variable and len,age, quantity and pos are explanotry variables. I trying to Make a regression of Amount On age, quantity and pos Using stepwise.
ID Sym Month Amount len Age quantity Pos
11 10 1 500 5 17 0 12
22 10 1 300 6 11 0 11
33 10 1 200 2 10 0 10
44 10 1 100 2 11 0 11
55 10 1 150 4 15 0 12
66 10 1 250 4 16 0 14
11 20 1 500 5 17 0 12
22 20 1 300 6 11 0 11
33 20 1 200 2 10 0 10
44 20 1 100 2 11 0 11
55 20 1 150 4 15 0 12
66 20 1 250 4 16 0 14
77 20 1 700 4 17 0 11
88 20 1 100 2 16 0 12
11 30 1 500 5 17 0 12
22 30 1 300 6 11 0 11
33 30 1 200 2 10 0 10
44 30 1 100 2 11 0 11
55 30 1 150 4 15 0 12
66 30 1 250 4 16 0 14
11 40 1 500 5 17 2000 12
22 40 1 300 6 11 1000 11
33 40 1 200 2 10 1000 10
44 40 1 100 2 11 1000 11
55 40 1 150 4 15 1000 12
66 40 1 250 4 16 1000 14
And the Output of the results I want after running all regression should be a dataframe that's look like this (That should help me detect outliers):
Id Month Sym Amount len Age Quantity Pos R^2 CookDistanse Residuals UpperLimit LowerLimit
11 1 10 500 5 17 null 12 0.7 1.5 -350 -500 1000
22 1 10 300 6 11 null 11 0.8 1.7 -400 -500 1000
That's the code that I am trying to run on Sym = 10, Sym= 20, Sym = 30, Sym = 40.
I have something like 400 Sym values to run a regression analysis on them.
fit[i] <- step(lm (Sym[i]$Sum ~ len + Age + Quantity,
na.action=na.omit), direction="backward")
R_Sq <- summary(fit[i])$r.squared
Res[i] <- resid(fit[i])
D[i] <- cooks.distance(fit[i])
Q[i] <- quantile(resid(fit[i), c(.25, .50, .75, .99))
L[i]<- Q[1][i] - 2.2* (Q[3][i]-Q[1][i])
U[i] <- Q[3][i] + 2.2*(Q[3][i]-Q[1][i])
"i" means the results for the regression of sym = i (10,20..).
Any way to do this on loop for every Sym value?
Any help will be highly appreciate.

Resources