I have a dataframe containing repeated measurements for some individuals, the data.frame looks like this
# A tibble: 853 x 5
ID Test N_ind Pheno Week
<chr> <chr> <int> <dbl> <int>
1 02A01 Int 96 0 12
2 02A01 Int 96 0 24
3 02A01 Int 94 0 36
4 02A01 Int 90 0 48
7 02A01 Int 78 1 84
9 02A03 Int 96 0 12
10 02A03 Int 96 0 24
11 02A03 Int 94 0 36
19 02C03 Int 96 1 12
20 0202C03 Int 96 0 24
21 0202C03 Int 94 0 36
22 0202C03 Int 90 0 48
23 02E02 Int 96 0 12
24 02E02 Int 96 0 24
25 02E02 Int 94 1 36
26 02E02 Int 90 1 48
I want to subset the data.frame, first grouping by ID and then within the group selecting for those individuals with a 1 in the column Pheno the lowest Week value, but for those individuals with a 0 in the column Pheno the highest value in Week.
The optimal result should look like this:
ID Test N_ind Pheno Week
<chr> <chr> <int> <dbl> <int>
7 02A01 Int 78 1 84
11 02A03 Int 94 0 36
19 02C03 Int 96 1 12
22 0202C03 Int 90 0 48
25 02E02 Int 94 1 36
I have managed to do that for the 1 values but I am stuck in how to do it for the 0 values.
Here is my code:
df_sub <- data %>%
group_by(ID) %>%
arrange(Week, .by_group = TRUE) %>%
ungroup()
Any help would be much appreciated
You can try -
library(dplyr)
data %>%
group_by(ID) %>%
filter(Week == if(all(Pheno == 0)) max(Week) else min(Week[Pheno == 1])) %>%
ungroup
# ID Test N_ind Pheno Week
# <chr> <chr> <int> <int> <int>
#1 02A01 Int 78 1 84
#2 02A03 Int 94 0 36
#3 02C03 Int 96 1 12
#4 0202C03 Int 90 0 48
#5 02E02 Int 94 1 36
If all the Pheno values in a group are 0 return the max value or else from the Pheno = 1 values return the minimum one.
using dplyr and if, else in filter,
df %>%
group_by(ID) %>%
filter(if (Pheno == 0) Week == max(Week)
else Week == min(Week))
ID Test N_ind Pheno Week
<chr> <chr> <dbl> <dbl> <dbl>
1 02A01 Int 78 1 84
2 02A03 Int 94 0 36
3 02C03 Int 96 1 12
4 0202C03 Int 90 0 48
5 02E02 Int 90 1 48
We can first group_by ID, then filter with the max value in Pheno.
After that, we can filter with the conditions you want, separated by an OR | statement.
library(dplyr)
df %>% group_by(ID) %>%
filter(Pheno==max(Pheno))%>%
filter(Pheno==0 & Week==min(Week)|Pheno==1 & Week==max(Week))
# A tibble: 5 x 5
# Groups: ID [5]
ID Test N_ind Pheno Week
<chr> <chr> <int> <int> <int>
1 02A01 Int 78 1 84
2 02A03 Int 96 0 12
3 02C03 Int 96 1 12
4 0202C03 Int 96 0 24
5 02E02 Int 90 1 48
This is what I came up with:
data %>%
group_by(ID, Pheno) %>%
filter(case_when(
Pheno == 1 ~ Week == min(Week),
Pheno == 0 ~ Week == max(Week)
)) %>%
ungroup(Pheno) %>%
filter(case_when(
any(Pheno == 1) ~ Pheno == 1,
all(Pheno == 0) ~ Pheno == 0
)) %>%
ungroup()
Which results to:
# A tibble: 5 x 5
ID Test N_ind Pheno Week
<chr> <chr> <int> <int> <int>
1 02A01 Int 78 1 84
2 02A03 Int 94 0 36
3 02C03 Int 96 1 12
4 0202C03 Int 90 0 48
5 02E02 Int 94 1 36
Related
I am interested in web scraping Pro Football Reference. I need to set up a function that enables me to scrape multiple pages. So far, I have code that seems to be functional. However, I continuously get an error...
scrapeData = function(urlprefix, urlend, startyr, endyr) {
master = data.frame()
for (i in startyr:endyr) {
cat('Loading Year', i, '\n')
URL = paste(urlprefix, as.character(i), urlend, sep = "")
table = readHTMLTable(URL, stringsAsFactors = F)[[1]]
table$Year = i
master = rbind(table, master)
}
return(master)
}
drafts = scrapeData('http://www.pro-football-reference.com/years/', '/draft.htm', 2010, 2010)
When running it, the return is --
Error: failed to load external entity "http://www.pro-football-reference.com/years/2010/draft.htm"
Any advice would be helpful. Thank you.
library(tidyverse)
library(rvest)
get_football <- function(year) {
str_c("https://www.pro-football-reference.com/years/",
year,
"/draft.htm") %>%
read_html() %>%
html_table() %>%
pluck(1) %>%
janitor::row_to_names(1) %>%
janitor::clean_names() %>%
mutate(year = year)
}
map_dfr(2010:2015, get_football)
# A tibble: 1,564 × 30
rnd pick tm player pos age to ap1 pb st w_av dr_av g cmp att
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 STL Sam B… QB 22 2018 0 0 5 44 25 83 1855 2967
2 1 2 DET Ndamu… DT 23 2022 3 5 12 100 59 196 0 0
3 1 3 TAM Geral… DT 22 2021 1 6 10 69 65 140 0 0
4 1 4 WAS Trent… T 22 2022 1 9 11 78 58 160 0 0
5 1 5 KAN Eric … DB 21 2018 3 5 5 50 50 89 0 0
6 1 6 SEA Russe… T 21 2020 0 2 9 56 31 131 0 0
7 1 7 CLE Joe H… DB 21 2021 0 3 10 62 39 158 0 0
8 1 8 OAK Rolan… LB 21 2015 0 0 5 25 15 65 0 0
9 1 9 BUF C.J. … RB 23 2017 0 1 3 34 32 90 0 0
10 1 10 JAX Tyson… DT 23 2022 0 0 7 44 33 188 0 0
# … with 1,554 more rows, and 15 more variables
I have a big data-set with over 1000 subjects, a small piece of the data-set looks like:
mydata <- read.table(header=TRUE, text="
Id DAYS QS Event
01 50 1 1
01 57 4 1
01 70 1 1
01 78 2 1
01 85 3 1
02 70 2 1
02 92 4 1
02 98 5 1
02 105 6 1
02 106 7 0
")
I would like to get row number of the observation 28 or more days prior to last observation, eg. for id=01; last observation is 85 minus 28 would be 57 which is row number 2. For id=02; last observation is 106 minus 28; 78 and because 78 does not exist, we will use row number of 70 which is 1 (I will be getting the row number for each observation separately) or first observation for id=02.
This should work:
mydata %>% group_by(Id) %>%
mutate(row_number = last(which(DAYS <= max(DAYS) - 28)))
# A tibble: 10 x 6
# Groups: Id [2]
Id DAYS QS Event max row_number
<int> <int> <int> <int> <dbl> <int>
1 1 50 1 1 57 2
2 1 57 4 1 57 2
3 1 70 1 1 57 2
4 1 78 2 1 57 2
5 1 85 3 1 57 2
6 2 70 2 1 78 1
7 2 92 4 1 78 1
8 2 98 5 1 78 1
9 2 105 6 1 78 1
10 2 106 7 0 78 1
My goal is to find out half life (from terminal phase if anyone is familiar with Pharmacokinetics)
I have some data containing the following;
1500 rows, with ID being main "key". There is 15 rows per ID. Then I have other columns TIME and CONCENTRATION. Now What I want to do is, for each ID, remove the first TIME (which equals "000" (numeric)), then run lm() function on the remaining 14 rows per ID, and then use abs() to extract the absolute value of the slope, then then save this to a new column named THALF. (If anyone is familiar with Pharmacokinetics maybe there is better way to do this?)
But I have not be able to do this using my limited knowledge of R.
Here is what I've come up with so far:
data_new <- data %>% dplyr::group_by(data $ID) %>% dplyr::filter(data $TIME != 10) %>% dplyr::mutate(THAFL = abs(lm$coefficients[2](data $CONC ~ data $TIME)))
From what I've understood from other Stackoverflow answers, lm$coefficients[2] will extract the slope.
But however, I have not been able to make this work. I get this error from trying to run the code:
Error: Problem with `mutate()` input `..1`.
x Input `..1` can't be recycled to size 15.
i Input `..1` is `data$ID`.
i Input `..1` must be size 15 or 1, not 1500.
i The error occurred in group 1: data$ID = "pat1".
Any suggestions on how to solve this? IF you need more info, let me know please.
(Also, if anyone is familiar with Pharmacokinetics, when they ask for half life from terminal phase, do I do lm() from the concentration max ? I Have a column with value for the highest observed concentration at what time. )
If after the model fitting you still need the observations with TIME == 10, you can try summarising after you group by ID and then using a right join
data %>%
filter(TIME != 10) %>%
group_by(ID) %>%
summarise(THAFL = abs(lm(CONC ~ TIME)$coefficients[2])) %>%
right_join(data, by = "ID")
# A tibble: 30 x 16
ID THAFL Sex Weight..kg. Height..cm. Age..yrs. T134A A443G G769C G955C A990C TIME CONC LBM `data_combine$ID` CMAX
<chr> <dbl> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl>
1 pat1 0.00975 F 50 135 47 0 2 1 2 0 10 0 Under pat1 60
2 pat1 0.00975 F 50 135 47 0 2 1 2 0 20 6.93 Under pat1 60
3 pat1 0.00975 F 50 135 47 0 2 1 2 0 30 12.2 Under pat1 60
4 pat1 0.00975 F 50 135 47 0 2 1 2 0 45 14.8 Under pat1 60
5 pat1 0.00975 F 50 135 47 0 2 1 2 0 60 15.0 Under pat1 60
6 pat1 0.00975 F 50 135 47 0 2 1 2 0 90 12.4 Under pat1 60
7 pat1 0.00975 F 50 135 47 0 2 1 2 0 120 9.00 Under pat1 60
8 pat1 0.00975 F 50 135 47 0 2 1 2 0 150 6.22 Under pat1 60
9 pat1 0.00975 F 50 135 47 0 2 1 2 0 180 4.18 Under pat1 60
10 pat1 0.00975 F 50 135 47 0 2 1 2 0 240 1.82 Under pat1 60
# ... with 20 more rows
If after the model fitting you don't want the rows with TIME == 10 to appear on your dataset, you can use mutate
data %>%
filter(TIME != 10) %>%
group_by(ID) %>%
mutate(THAFL = abs(lm(CONC ~ TIME)$coefficients[2]))
# A tibble: 28 x 16
# Groups: ID [2]
ID Sex Weight..kg. Height..cm. Age..yrs. T134A A443G G769C G955C A990C TIME CONC LBM `data_combine$ID` CMAX THAFL
<chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 pat1 F 50 135 47 0 2 1 2 0 20 6.93 Under pat1 60 0.00975
2 pat2 M 75 175 29 0 2 0 0 0 20 6.78 Under pat2 60 0.00835
3 pat1 F 50 135 47 0 2 1 2 0 30 12.2 Under pat1 60 0.00975
4 pat2 M 75 175 29 0 2 0 0 0 30 11.6 Above pat2 60 0.00835
5 pat1 F 50 135 47 0 2 1 2 0 45 14.8 Under pat1 60 0.00975
6 pat2 M 75 175 29 0 2 0 0 0 45 13.5 Under pat2 60 0.00835
7 pat1 F 50 135 47 0 2 1 2 0 60 15.0 Under pat1 60 0.00975
8 pat2 M 75 175 29 0 2 0 0 0 60 13.1 Above pat2 60 0.00835
9 pat1 F 50 135 47 0 2 1 2 0 90 12.4 Under pat1 60 0.00975
10 pat2 M 75 175 29 0 2 0 0 0 90 9.77 Under pat2 60 0.00835
# ... with 18 more rows
You can use broom:
library(broom)
library(dplyr)
#Code
data %>% group_by(ID) %>%
filter(TIME!=10) %>%
do(fit = tidy(lm(CONC ~ TIME, data = .))) %>%
unnest(fit) %>%
filter(term=='TIME') %>%
mutate(estimate=abs(estimate))
Output:
# A tibble: 2 x 6
ID term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 pat1 TIME 0.00975 0.00334 -2.92 0.0128
2 pat2 TIME 0.00835 0.00313 -2.67 0.0204
If joining with original data is needed, try:
#Code 2
data <- data %>% left_join(data %>% group_by(ID) %>%
filter(TIME!=10) %>%
do(fit = tidy(lm(CONC ~ TIME, data = .))) %>%
unnest(fit) %>%
filter(term=='TIME') %>%
mutate(estimate=abs(estimate)) %>%
select(c(ID,estimate)))
Similar to #RicS.
I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows
I am giving a data set called ChickWeight. This has the weights of chicks over a time period. I need to introduce a new variable that measures the current weight difference compared to day 0.
I first cleaned the data set and took out only the chicks that were recorded for all 12 weigh ins:
library(datasets)
library(dplyr)
Frequency <- dplyr::count(ChickWeight$Chick)
colnames(Frequency)[colnames(Frequency)=="x"] <- "Chick"
a <- inner_join(ChickWeight, Frequency, by='Chick')
complete <- a[(a$freq == 12),]
head(complete,3)
This data set is in the library(datasets) of r, called ChickWeight.
You can try:
library(dplyr)
ChickWeight %>%
group_by(Chick) %>%
filter(any(Time == 21)) %>%
mutate(wdiff = weight - first(weight))
# A tibble: 540 x 5
# Groups: Chick [45]
weight Time Chick Diet wdiff
<dbl> <dbl> <ord> <fct> <dbl>
1 42 0 1 1 0
2 51 2 1 1 9
3 59 4 1 1 17
4 64 6 1 1 22
5 76 8 1 1 34
6 93 10 1 1 51
7 106 12 1 1 64
8 125 14 1 1 83
9 149 16 1 1 107
10 171 18 1 1 129
# ... with 530 more rows