How to change the values within the group? - r

I created a column of nesting success with a value of "1" if nest's fate was "rearing" or "fledged", and 0 if nest's fate was "nest failed". For some cases, the nest's fate was "rearing" in the first visit and "failed" for the second visit. In such cases, success of a single nest turned out to be both 1 and 0 (see nest "D063" and "D063").
How to remove "1"s or assign "NA", and only keep "0"s in the cases with both 1 and 0 in the success of the same nest?
In other words, I'd like to have only one success outcome per nest (single 1 or 0), not multiple. And, I want to keep all the rows.
My data looks like this:
Example data:
structure(list(date = structure(c(4L, 2L, 1L, 5L, 3L, 1L, 5L,
2L, 1L, 5L, 3L, 1L, 5L, 2L, 1L), .Label = c("14/06/2018", "17/05/2018",
"21/05/2018", "5/05/2018", "6/05/2018"), class = "factor"), nest.code = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("D046",
"D047", "D062", "D063", "W18003"), class = "factor"), year = c(2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L), species = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("AA",
"BB"), class = "factor"), visit = c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), eggs = c(1L, 0L, 0L, 1L, 0L,
0L, 2L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), chicks = c(0L, NA, NA,
0L, 1L, 0L, 0L, 2L, 0L, 0L, 1L, 0L, 0L, NA, 1L), fate = structure(c(2L,
4L, 5L, 2L, 4L, 3L, 2L, 4L, 3L, 2L, 4L, 3L, 2L, 5L, 1L), .Label = c("fledged",
"incubating", "nest failed", "rearing", "unknown"), class = "factor"),
success = c(NA, 1L, NA, NA, 1L, 0L, NA, 1L, 0L, NA, 1L, 0L,
NA, NA, 1L)), class = "data.frame", row.names = c(NA, -15L
))
This is the code I tried:
datanew <- data %>%
group_by(year, species, nest.code)%>%
mutate(Real_success = ifelse(success ==1 & 0, 0, success))

I'm not sure how you imagine it to look in the end. Do you want to have all rows preserved, do you want to have it ordered in some way. Anyway, this is what I came up with:
UPDATE: Sry, I missed "fledged" in the first answer
dat %>%
group_by(year, species, nest.code)%>%
arrange(year, species, nest.code, success) %>%
mutate(success = ifelse(row_number() > 1, NA, success))
# A tibble: 15 x 9
# Groups: year, species, nest.code [5]
date nest.code year species visit eggs chicks outcome success
<fct> <fct> <int> <fct> <int> <int> <int> <fct> <int>
1 17/05/2018 D046 2018 AA 2 0 NA rearing 1
2 5/05/2018 D046 2018 AA 1 1 0 incubating NA
3 14/06/2018 D046 2018 AA 3 0 NA unknown NA
4 14/06/2018 D047 2018 AA 3 0 0 nest failed 0
5 21/05/2018 D047 2018 AA 2 0 1 rearing NA
6 6/05/2018 D047 2018 AA 1 1 0 incubating NA
7 14/06/2018 D062 2018 AA 3 0 0 nest failed 0
8 17/05/2018 D062 2018 AA 2 0 2 rearing NA
9 6/05/2018 D062 2018 AA 1 2 0 incubating NA
10 14/06/2018 D063 2018 AA 3 0 0 nest failed 0
11 21/05/2018 D063 2018 AA 2 0 1 rearing NA
12 6/05/2018 D063 2018 AA 1 1 0 incubating NA
13 14/06/2018 W18003 2018 BB 3 0 1 fledged 1
14 6/05/2018 W18003 2018 BB 1 1 0 incubating NA
15 17/05/2018 W18003 2018 BB 2 0 NA unknown NA
there definitly will be some easier way to do this. No pro in dplyr myself.
If it works, I'm happy.

Here's an approach that puts a zero in all rows for nests with at least one fail, a 1 if there is at least one success and no fail, and NA otherwise:
library(dplyr)
mydata %>%
group_by(year, species, nest.code) %>%
mutate(real_success = case_when(
sum(1-success, na.rm = T) > 0 ~ 0, # There was a fail
sum(success, na.rm = T) > 0 ~ 1,
TRUE ~ NA_real_)) %>%
ungroup()
# A tibble: 15 x 10
date nest.code year species visit eggs chicks fate success real_success
<fct> <fct> <int> <fct> <int> <int> <int> <fct> <int> <dbl>
1 5/05/2018 D046 2018 AA 1 1 0 incubating NA 1
2 17/05/2018 D046 2018 AA 2 0 NA rearing 1 1
3 14/06/2018 D046 2018 AA 3 0 NA unknown NA 1
4 6/05/2018 D047 2018 AA 1 1 0 incubating NA 0
5 21/05/2018 D047 2018 AA 2 0 1 rearing 1 0
6 14/06/2018 D047 2018 AA 3 0 0 nest fail… 0 0
7 6/05/2018 D062 2018 AA 1 2 0 incubating NA 0
8 17/05/2018 D062 2018 AA 2 0 2 rearing 1 0
9 14/06/2018 D062 2018 AA 3 0 0 nest fail… 0 0
10 6/05/2018 D063 2018 AA 1 1 0 incubating NA 0
11 21/05/2018 D063 2018 AA 2 0 1 rearing 1 0
12 14/06/2018 D063 2018 AA 3 0 0 nest fail… 0 0
13 6/05/2018 W18003 2018 BB 1 1 0 incubating NA 1
14 17/05/2018 W18003 2018 BB 2 0 NA unknown NA 1
15 14/06/2018 W18003 2018 BB 3 0 1 fledged 1 1

Related

time series plot for missing data

I have some sequence event data for which I want to plot the trend of missingness on value across time. Example below:
id time value
1 aa122 1 1
2 aa2142 1 1
3 aa4341 1 1
4 bb132 1 2
5 bb2181 2 1
6 bb3242 2 3
7 bb3321 2 NA
8 cc122 2 1
9 cc2151 2 2
10 cc3241 3 1
11 dd161 3 3
12 dd2152 3 NA
13 dd3282 3 NA
14 ee162 3 1
15 ee2201 4 2
16 ee3331 4 NA
17 ff1102 4 NA
18 ff2141 4 NA
19 ff3232 5 1
20 gg142 5 3
21 gg2192 5 NA
22 gg3311 5 NA
23 gg4362 5 NA
24 ii111 5 NA
The NA suppose to increase over time (the behaviors are fading). How do I plot the NA across time
I think this is what you're looking for? You want to see how many NA's appear over time. Assuming this is correct, if each time is a group, then you can count the number of NA's appear in each group
data:
df <- structure(list(id = structure(1:24, .Label = c("aa122", "aa2142",
"aa4341", "bb132", "bb2181", "bb3242", "bb3321", "cc122", "cc2151",
"cc3241", "dd161", "dd2152", "dd3282", "ee162", "ee2201", "ee3331",
"ff1102", "ff2141", "ff3232", "gg142", "gg2192", "gg3311", "gg4362",
"ii111"), class = "factor"), time = c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L), value = c(1L, 1L, 1L, 2L, 1L, 3L, NA, 1L, 2L, 1L, 3L,
NA, NA, 1L, 2L, NA, NA, NA, 1L, 3L, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-24L))
library(tidyverse)
library(ggplot2)
df %>%
group_by(time) %>%
summarise(sumNA = sum(is.na(value)))
# A tibble: 5 × 2
time sumNA
<int> <int>
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
You can then plot this using ggplot2
df %>%
group_by(time) %>%
summarise(sumNA = sum(is.na(value))) %>%
ggplot(aes(x=time)) +
geom_line(aes(y=sumNA))
As you can see, as time increases, the number of NA's also increases

How can I combine rows based on a specific parameter in R

I have a dataframe which looks like this:
ID Smoker Asthma Age Sex COPD Event_Date
1 1 0 0 65 M 0 12-2009
2 1 0 1 65 M 0 21-2009
3 1 0 1 65 M 0 23-2009
4 2 1 0 67 M 0 19-2010
5 2 1 0 67 M 0 21-2010
6 2 1 1 67 M 1 01-2011
7 2 1 1 67 M 1 02-2011
8 3 2 1 77 F 0 09-2015
9 3 2 1 77 F 1 10-2015
10 3 2 1 77 F 1 10-2015
I would like to know whether it would be possible it combine my rows in order to achieve a dataset like this:
ID Smoker Asthma Age Sex COPD Event_Data
1 0 1 65 M 0 12-2009
2 1 1 66 M 1 19-2010
3 2 1 77 F 1 09-2015
I have tried using the unique function, however this doesn't give me my desired output and repeats the ID for multiple rows.
This is an example of the code i've tried
Data2<-unique(Data)
I do not just want the first row because I want to include each column status. For example, just getting the first row would not include the COPD status which occurs in the later rows for each ID.
Alternative Solution:
library(dplyr)
d %>%
group_by(ID, Age, Sex, Smoker) %>%
summarise(Asthma = !is.na(match(1, Asthma)),
COPD = !is.na(match(1, COPD)),
Event_Date = first(Event_Date)) %>%
ungroup %>%
mutate_if(is.logical, as.numeric)
# A tibble: 3 x 7
ID Age Sex Smoker Asthma COPD Event_Date
<int> <int> <fct> <int> <dbl> <dbl> <fct>
1 1 65 M 0 1 0 12-2009
2 2 67 M 1 1 1 19-2010
3 3 77 F 2 1 1 09-2015
If you want to get the (first) row for each ID you can try something like this:
d <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L),
Smoker = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Asthma = c(0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
Age = c(65L, 65L, 65L, 67L, 67L, 67L, 67L, 77L, 77L, 77L),
Sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L),
.Label = c("F", "M"), class = "factor"),
COPD = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L),
Event_Date = structure(c(5L, 7L, 9L, 6L, 8L, 1L, 2L, 3L, 4L, 4L),
.Label = c("01-2011", "02-2011", "09-2015",
"10-2015", "12-2009", "19-2010",
"21-2009", "21-2010", "23-2009"),
class = "factor")),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
d[!duplicated(d$ID), ]
# ID Smoker Asthma Age Sex COPD Event_Date
# 1 1 0 0 65 M 0 12-2009
# 4 2 1 0 67 M 0 19-2010
# 8 3 2 1 77 F 0 09-2015
Use max when you need a value further down and dplyr::first for others, here an example
library(dplyr)
df %>% group_by(ID) %>% summarise(Smoker=first(Smoker), Asthma=max(Asthma, na.rm = TRUE))

Create n data sets from one data set without repetition using stratified sampling

I have a data set train which has say 500 rows, I would like to get a data frame with n columns each containing 500/n values(row numbers without repetition in other columns) basing on stratified sampling of a column in train, say train$y.
I have tried the following but it returns duplicate values,
library(caret)
n <- 10 # I want to divide my data set in to 10 parts
data_partition <- createDataPartition(y = train$y, times = 10,
p = 1/n, list = F)
To summarize with an example,
If I have a data set train with 100 rows and one of the column train$y(value= 0 or 1). I would like to get 10 data sets with 10 rows each from the train and they should be stratified basing on train$y and they should not be seen on other 9 data sets.
Example input:
ID x y
1 1 0
2 2 0
3 3 1
4 1 1
5 2 1
6 4 1
7 4 0
8 4 1
9 3 1
10 1 1
11 2 1
12 3 0
13 4 1
14 5 1
15 6 1
16 10 1
17 9 1
18 3 0
19 7 0
20 8 1
Expected output (4 first column, with details of each set aside)
ID x y sample set 1 set 2 set 3
1 1 0 set 2 ID x y ID x y ID x y
2 2 0 set 3 8 4 1 11 2 1 17 9 1
3 3 1 set 3 9 3 1 12 3 0 5 2 1
4 1 1 set 3 10 1 1 13 4 1 6 4 1
5 2 1 set 3 18 3 0 1 1 0 7 4 0
6 4 1 set 3 19 7 0 14 5 1 2 2 0
7 4 0 set 3 20 8 1 15 6 1 3 3 1
8 4 1 set 1 16 10 1 4 1 1
9 3 1 set 1
10 1 1 set 1
11 2 1 set 2
12 3 0 set 2
13 4 1 set 2
14 5 1 set 2
15 6 1 set 2
16 10 1 set 2
17 9 1 set 3
18 3 0 set 1
19 7 0 set 1
20 8 1 set 1
In the above example given input as ID,x and y. I would like to get the column sample which I can segregate into those 3 tables(to the right) whenever I want to.
Please observe, the y in the data has 14- 1s and 6- 0s which are in the ratio of 70:30 and the output sets are almost in similar ratio.
Sample dataset in a copy/run friendly format:
data <- structure(list(ID = 1:20, x = c(1L, 2L, 3L, 1L, 2L, 4L, 4L, 4L,
3L, 1L, 2L, 3L, 4L, 5L, 6L, 10L, 9L, 3L, 7L, 8L), y = c(0L, 0L,
1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L,
0L, 1L)), .Names = c("ID", "x", "y"), class = "data.frame", row.names = c(NA,
-20L))
It can be done using the caret package. Try the code below
# Createing dataset
data <- structure(list(ID = 1:20, x = c(1L, 2L, 3L, 1L, 2L, 4L, 4L, 4L,
3L, 1L, 2L, 3L, 4L, 5L, 6L, 10L, 9L, 3L, 7L, 8L), y = c(0L, 0L,
1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L,
0L, 1L)), .Names = c("ID", "x", "y"), class = "data.frame", row.names = c(NA, -20L))
# Solution
library(caret)
k <- createFolds(data$y,k = 3,list = F)
addmargins(table(k,data$y))

calculate the rate under the same in using R

I have a question to calculate the rate under the same id numbers.
Here is the sample dataset d:
id answer
1 1
1 0
1 0
1 1
1 1
1 1
1 0
2 0
2 0
2 0
3 1
3 0
The ideal output is
id rate freq
1 4/7 (=0.5714) 7
2 0 3
3 1/2 (=0.5) 2
Thanks.
Just for fun, you can use aggregate
> aggregate(answer~id, function(x) c(rate=mean(x), freq=length(x)), data=df1)
id answer.rate answer.freq
1 1 0.5714286 7.0000000
2 2 0.0000000 3.0000000
3 3 0.5000000 2.0000000
Try
library(data.table)
setDT(df1)[,list(rate= mean(answer), freq=.N) ,id]
# id rate freq
#1: 1 0.5714286 7
#2: 2 0.0000000 3
#3: 3 0.5000000 2
Or
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(rate=mean(answer), freq=n())
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L), answer = c(1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L)), .Names = c("id", "answer"), class = "data.frame",
row.names = c(NA, -12L))

How to change the date in the data frame

I have a lot data frame, like this
ID 1 2 3 4 5 type c new_ee first_ee_t
A 20051110 20051111 20051114 20051208 20060105 DATE 1 none none
A NA 1 3 24 2 diff_date 2 1 20051110
B 20050422 20050613 20050711 20071023 NA DATE 1 none none
B NA 52 28 834 999 diff_date 2 1 20050422
C 20021206 20040224 20040423 20040507 20040528 DATE 1 none none
C NA 445 59 14 21 diff_date 2 1 20021206
D 20030708 20050228 20050228 20050815 20050915 DATE 1 none none
D NA 601 0 168 31 diff_date 2 1 20030708
E 20000123 20040306 20060919 20060919 20060920 DATE 1 none none
E NA 1504 927 0 1 diff_date 2 1 20000123
F 20070413 NA NA NA NA DATE 1 none none
F NA 999 999 999 999 diff_date 2 0 0
G 20020318 20020411 NA NA NA DATE 1 none none
G NA 24 999 999 999 diff_date 2 0 0
I have to change first_ee_t variable. if the ID first - second time >365 then the first_ee_t variable change second time
if the first-second time and second-third time >365 the change third time
such as
ID 1 2 3 4 5 type c new_ee first_ee_t
A 20051110 20051111 20051114 20051208 20060105 DATE 1 none none
A NA 1 3 24 2 diff_date 2 1 20051110
B 20050422 20050613 20050711 20071023 NA DATE 1 none none
B NA 52 28 834 999 diff_date 2 1 20050422
C 20021206 20040224 20040423 20040507 20040528 DATE 1 none none
C NA 445 59 14 21 diff_date 2 1 20040224
D 20030708 20050228 20050228 20050815 20050915 DATE 1 none none
D NA 601 0 168 31 diff_date 2 1 20050228
E 20000123 20040306 20060919 20060919 20060920 DATE 1 none none
E NA 1504 927 0 1 diff_date 2 1 20060919
F 20070413 NA NA NA NA DATE 1 none none
F NA 999 999 999 999 diff_date 2 0 0
G 20020318 20020411 NA NA NA DATE 1 none none
G NA 24 999 999 999 diff_date 2 0 0
Assuming your expected output above has a few errors, I think this is what you are after
#first, here's the data in a copy/paste-able form
dd <-
structure(list(ID = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L,
5L, 5L, 6L, 6L, 7L, 7L), .Label = c("A", "B", "C", "D", "E",
"F", "G"), class = "factor"), X1 = c(20051110L, NA, 20050422L,
NA, 20021206L, NA, 20030708L, NA, 20000123L, NA, 20070413L, NA,
20020318L, NA), X2 = c(20051111L, 1L, 20050613L, 52L, 20040224L,
445L, 20050228L, 601L, 20040306L, 1504L, NA, 999L, 20020411L,
24L), X3 = c(20051114L, 3L, 20050711L, 28L, 20040423L, 59L, 20050228L,
0L, 20060919L, 927L, NA, 999L, NA, 999L), X4 = c(20051208L, 24L,
20071023L, 834L, 20040507L, 14L, 20050815L, 168L, 20060919L,
0L, NA, 999L, NA, 999L), X5 = c(20060105L, 2L, NA, 999L, 20040528L,
21L, 20050915L, 31L, 20060920L, 1L, NA, 999L, NA, 999L), type = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("DATE",
"diff_date"), class = "factor"), c = c(1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), new_ee = structure(c(3L, 2L,
3L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 3L, 1L, 3L, 1L), .Label = c("0",
"1", "none"), class = "factor"), first_ee_t = c("none", "20051110",
"none", "20050422", "none", "20021206", "none", "20030708", "none",
"20000123", "none", "0", "none", "0")), .Names = c("ID", "X1",
"X2", "X3", "X4", "X5", "type", "c", "new_ee", "first_ee_t"), row.names = c(NA,
-14L), class = "data.frame")
And how, here's the code that will do the transformation
result<-unsplit(lapply(split(dd, dd$ID), function(x) {
if (all(is.na(x[1,4:6]))) {
x[2, "first_ee_t"]<-0
} else {
first<-min(which(x[2,2:6]<365))
if(is.finite(first)) {
x[2,"first_ee_t"]<-x[1, first]
}
}
x
}), dd$ID)
This is assuming that each ID has exactly two rows and that the second row always contains the datediffs and the first always contains the dates themselves.
This does produce a warning in the case of ID F who seems to have no values that satisfy the requirements so that row is left untouched.

Resources