apply diff() only on consecutive days - r

I have the following data and I would like to apply the function diff() only on consecutive days: diff(data$ch, differences = 1, lag = 1) returns the differences between all consecutive values of ch (23-12, 4-23, 78-4, 120-78, 94-120, ...). I would like the diff() function to return NA when the dates are not consecutive. The output I am trying to obtain from the data below is:
11, -19, 74, NA, -26, NA, -34, 39, NA
Is there anyone who knows how I can do that?
Date ch
2013-01-01 12
2013-01-02 23
2013-01-03 4
2013-01-04 78
2013-01-10 120
2013-01-11 94
2013-02-26 36
2013-02-27 2
2013-02-28 41
2003-03-05 22

You can do these in base R without installing any external packages.
Assuming that the 'Date' column is of Date class, we take the diff of the 'Date' and based on whether the difference between adjacent elements are greater than 1 or not, we can create a grouping index ('indx') by taking the cumulative sum (cumsum) of the logical vector.
indx <- cumsum(c(TRUE,abs(diff(df1$Date))>1))
In the second step, we can use ave with 'indx' as the grouping vector, and take the diff of 'ch'. The length of output of diff will be 1 less than the length of the 'ch' column. So we can append NA to make the lengths same.
ave(df1$ch, indx, FUN=function(x) c(diff(x),NA))
#[1] 11 -19 74 NA -26 NA -34 39 NA NA
data
df1 <- structure(list(Date = structure(c(15706, 15707, 15708, 15709,
15715, 15716, 15762, 15763, 15764, 12116), class = "Date"), ch = c(12L,
23L, 4L, 78L, 120L, 94L, 36L, 2L, 41L, 22L)), .Names = c("Date",
"ch"), row.names = c(NA, -10L), class = "data.frame")

The following just "...returns NA when the dates are not consecutive", unless there are tricky cases that it won't account for:
replace(diff(df1$ch), abs(diff(df1$Date)) > 1, NA)
#[1] 11 -19 74 NA -26 NA -34 39 NA

Try this with the libraries lubridate and dplyr
If you don't have them do this once install.packages("dplyr");install.packages("lubridate")
Code
library(lubridate)
library(dplyr)
data$Date <- ymd(data$Date)
data2 <- data %>% mutate(diff=ifelse(Date==lag(Date)+days(1), ch-lag(ch), NA))
Data
data <-
data.frame(Date=c("2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "2013-01-10",
"2013-01-11", "2013-01-26", "2013-01-27", "2013-01-28", "2013-03-05"),
ch=c(12, 23, 4, 78, 120, 94, 36, 2, 41, 22))

Related

How to join and bind similar dataframes in R, iterative solutions welcome

In R markdown through R Studio (R v. 4.0.3), I'm looking for a better solution to combining similarly structured dataframes while keeping all rows and matching entries on a key. Piping full_join() into a filter into a bind_rows() directly wasn't working, possibly because of the error message:
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
I have 23 dataframes (let's call these "semester data") of data I'm looking to combine into a single dataframe (intended to be a single dataset of individuals outcomes from semester-to-semester).
Each semester dataframe is roughly 3000-4000 observations (individuals) with 45-47 variables of relevant data. A simplified example of a semester (or term) dataframe is shown below.
Simplified example of a "semester" dataframe:
id
ACT_math
course_code
section_code
term_code
grade
term_GPA
0001
23
101
001
FA12
3.45
3.8
0002
28
201
003
FA12
3.2
3.4
Individuals will show up in multiple semester dataframes as they progress through the program (taking course 101 in the fall and course 102 in the spring).
I want to use the dplyr full_join() to match these individuals on an ID key.
Using the suffix argument, I hope to keep track of which semester and course a set of data (grade, term_GPA, etc) for an individual comes from.
There's some data (ACT score, gender, state residency, etc) that is the stable for an individual across semester dataframes. Ideally I could take the first input and drop the rest, but if I had to clean this afterwards, that's fine.
I started by defining an object programatic_database using the first semester of data SP11. To cut down on the duplication of stable data for an individual, I selected the relevant columns that I wanted to join.
programmatic_database <- programmatic_database %>%
full_join(select(fa12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".sp11", ".fa12"), keep = FALSE, name = "id")
However, every semester new students join the program. I would like to add these entries to the bottom of the growing programmatic_database.
I'm also looking to use rbind() or bind_rows() to add these individuals to the bottom of the programmatic_database, along with their relevant data.
After full_join(), I'm filtering out the entries that have already been added horizontally to the dataframe, then piping the remaining entries into bind_rows()
programmatic_database <- fa12[!which(fa12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, fa12)
Concatenated example of what my code is producing after several iterations:
id
ACT_math
course_code
section_code
section_code.db
section_code.db.db
term_code
grade.sp11
grade.fa12
grade.sp13
grade.sp15
term_GPA.sp11
term_GPA.fa12
term_GPA.sp15
0001
23
102
001
001
001
FA12
3.45
3.8
3.0
-
3.8
3.7
-
0002
28
201
003
003
003
FA12
3.2
3.4
3.0
-
3.8
3.7
-
1020
28
201
003
003
003
FA12
3.2
3.4
-
-
3.8
3.7
-
6783
30
101
-
-
-
SP15
-
-
-
3.8
-
-
4.0
where I have successfully added horizontally for students 0001 and 0002 for outcomes in subsequent courses in subsequent semesters. I have also managed to add vertically, like with student 6783, leaving blanks for previous semesters before they enrolled but still adding the relevant columns.
Questions:
Is there a way to pipe full_join() into a filter() into a bind_rows() without running into these errors?
rbind number of columns do not match
OR
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
Is there a easy way to keep certain columns and only add the suffix ".fa12" to certain columns? As you can see, the .db is piling up.
Is there any way to automate this? Loops aren't my strong suit, but I'm sure there's a better-looking code than doing each of the 23 joins/binds by hand.
Thank you for assistance!
Current code for simplicity:
#reproducible example
fa11 <- structure(list(id = c("1001", "1002", "1003",
"1013"), act6_05_composite = c(33L, 26L, 27L, 25L), course_code = c("101",
"101", "101", "101"), term_code = c("FA11", "FA11", "FA11", "FA11"
), section_code = c(1L, 1L, 1L, 1L), grade = c(4, 0, 0, 2.5
), repeat_status_flag = c(NA, "PR", NA, NA), class_code = c(1L,
1L, 1L, 1L), cum_atmpt_credits_prior = c(16, 0, 0, 0), cum_completed_credits_prior = c(0L,
0L, 0L, 0L), cum_passed_credits_prior = c(16, 0, 0, 0), cum_gpa_prior = c(0,
0, 0, 0), cum_atmpt_credits_during = c(29, 15, 18, 15), cum_completed_credits_during = c(13L,
1L, 10L, 15L), cum_passed_credits_during = c(29, 1, 14, 15),
term_gpa = c(3.9615, 0.2333, 2.3214, 2.9666), row.names = c(NA, 4L
), class = "data.frame")
sp12 <- structure(list(id = c("1007", "1013", "1355",
"2779", "2302"), act6_05_composite = c(24L, 26L, 25L, 24L,
24L), course_code = c(101L, 102L, 101L, 101L, 101L
), term_code = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_), section_code = c(1L, 1L, 1L, 1L, 1L), grade = c(2,
2.5, 2, 1.5, 3.5), repeat_status_flag = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), class_code = c(2L, 2L, 1L, 2L, 2L), cum_atmpt_credits_prior = c(44,
43, 12, 43, 30), cum_completed_credits_prior = c(41L, 43L,
12L, 43L, 12L), cum_passed_credits_prior = c(41, 43, 12,
43, 30), cum_gpa_prior = c(3.3125, 3.186, 3.5416, 3.1785,
3.8636), cum_atmpt_credits_during = c(56, 59, 25, 64, 43),
cum_completed_credits_during = c(53L, 56L, 25L, 56L, 25L),
cum_passed_credits_during = c(53, 59, 25, 64, 43), term_gpa = c(2.8333,
3.423, 3.1153, 2.1923, 3.6153), row.names = c(NA,
5L), class = "data.frame")
# make object from fall 2011 semester dataframe
programmatic_database <- fa11
# join the spring 2012 semester dataframe by id using select variables and attaching relevant suffix
programmatic_database <- programmatic_database %>%
full_join(select(sp12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".fa11", ".sp12"), keep = FALSE, name = "id")
#view results of join, force integer type on certain variables as needed (see error above)
#filter the joined entries from fall 2012 database, then bind the remaining entries to the bottom of the growing dataset
programmatic_database <- sp12[!which(sp12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, sp12)
It would be possible to use bind_rows here if you make the column types consistent between tables. For instance, you could make a function to re-type any particular columns that aren't consistent in your original data. (That might also be something you could fix upstream as you read it in.)
library(dplyr)
set_column_types <- function(df) {
df %>%
mutate(term_code = as.character(term_code),
course_code = as.character(course_code))
}
bind_rows(
fa11 %>% set_column_types(),
sp12 %>% set_column_types() %>% mutate(term_code = "SP12")
)
This will stack your data into a relatively "long" format, like below. You may want to then reshape it depending on what kind of subsequent calculations you want to do.
id act6_05_composite course_code term_code section_code grade repeat_status_flag class_code cum_atmpt_credits_prior cum_completed_credits_prior cum_passed_credits_prior cum_gpa_prior cum_atmpt_credits_during cum_completed_credits_during cum_passed_credits_during term_gpa
1 1001 33 101 FA11 1 4.0 <NA> 1 16 0 16 0.0000 29 13 29 3.9615
2 1002 26 101 FA11 1 0.0 PR 1 0 0 0 0.0000 15 1 1 0.2333
3 1003 27 101 FA11 1 0.0 <NA> 1 0 0 0 0.0000 18 10 14 2.3214
4 1013 25 101 FA11 1 2.5 <NA> 1 0 0 0 0.0000 15 15 15 2.9666
5 1007 24 101 SP12 1 2.0 <NA> 2 44 41 41 3.3125 56 53 53 2.8333
6 1013 26 102 SP12 1 2.5 <NA> 2 43 43 43 3.1860 59 56 59 3.4230
7 1355 25 101 SP12 1 2.0 <NA> 1 12 12 12 3.5416 25 25 25 3.1153
8 2779 24 101 SP12 1 1.5 <NA> 2 43 43 43 3.1785 64 56 64 2.1923
9 2302 24 101 SP12 1 3.5 <NA> 2 30 12 30 3.8636 43 25 43 3.6153

How to remove a list of observations from a dataframe with dplyr in R? [duplicate]

This question already has answers here:
How to specify "does not contain" in dplyr filter
(4 answers)
dplyr Exclude row [duplicate]
(1 answer)
Closed 3 years ago.
This is my dataframe x
ID Name Initials AGE
123 Mike NA 18
124 John NA 20
125 Lily NA 21
126 Jasper NA 24
127 Toby NA 27
128 Will NA 19
129 Oscar NA 32
I also have a list of ID's I want to remove from data frame x, num[1:3], which is the following: y
>print(y)
[1] 124 125 129
My goal is remove all the ID's in y from data frame x
This is my desired output
ID Name Initials AGE
123 Mike NA 18
126 Jasper NA 24
127 Toby NA 27
128 Will NA 19
I'm using the dplyr package and trying this but its not working,
FinalData <- x %>%
select(everything()) %>%
filter(ID != c(y))
Can anyone tell me what needs to be corrected?
We can use %in% and negate ! when the length of the 'y' is greater than 1. The select step is not needed as it is selecting all the columns with everything()
library(dplyr)
x %>%
filter(!ID %in% y)
# ID Name Initials AGE
#1 123 Mike NA 18
#2 126 Jasper NA 24
#3 127 Toby NA 27
#4 128 Will NA 19
Or another option is anti_join
x %>%
anti_join(tibble(ID = y))
In base R, subset can be used
subset(x, !ID %in% y)
data
y <- c(124, 125, 129)
x <- structure(list(ID = 123:129, Name = c("Mike", "John", "Lily",
"Jasper", "Toby", "Will", "Oscar"), Initials = c(NA, NA, NA,
NA, NA, NA, NA), AGE = c(18L, 20L, 21L, 24L, 27L, 19L, 32L)),
class = "data.frame", row.names = c(NA,
-7L))

R programming_ Subsetting rows on logic conditions

Sample data:
sampleData
Ozone Solar.R Wind Temp Month Day sampleData.Ozone
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
3 12 149 12.6 74 5 3 12
.........
Want to extract records on the condition $ozone > 31
Here is the code:
data <- sampleData[sampleData$ozone > 31]
And get the error below:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L) X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
How should I correct it? Thanks!
R is case sensitive, so your ozone has to match the name in your data.frame. Also to subset a data.frame, you need two indices (row and column) separated by a comma. If there is nothing after the comma, it means that you are selecting all the columns:
sampleData[sampleData$Ozone > 31,]
Other methods to subset a data.frame:
subset(sampleData, Ozone > 31)
or with dplyr:
library(dplyr)
sampleData %>%
filter(Ozone > 31)
Result:
Ozone Solar.R Wind Temp Month Day sampleData.Ozone
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
Data:
sampleData = structure(list(Ozone = c(41L, 36L, 12L), Solar.R = c(190L, 118L,
149L), Wind = c(7.4, 8, 12.6), Temp = c(67L, 72L, 74L), Month = c(5L,
5L, 5L), Day = 1:3, sampleData.Ozone = c(41L, 36L, 12L)), .Names = c("Ozone",
"Solar.R", "Wind", "Temp", "Month", "Day", "sampleData.Ozone"
), class = "data.frame", row.names = c("1", "2", "3"))

Subset multiple observations by ID then select first observation by time [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 6 years ago.
I have a large dataset of observations, with several observations in rows and several different variables for each ID.
e.g.
Data
ID V1 V2 V3 time
1 35 100 5.2 2015-07-03 07:49
2 25 111 6.2 2015-04-01 11:52
3 41 120 NA 2015-04-01 14:17
1 25 NA NA 2015-07-03 07:51
2 NA 122 6.2 2015-04-01 11:50
3 40 110 4.1 2015-04-01 14:25
I would like to extract the earliest (first) observation for each variable independently based on the time column, for each unique ID. i.e. I would like to combine multiple rows of the same ID together so that I have one row of the first observation for each variable (time variable will not be equal for all).
The min() function will return the earliest time for a set of observations, but the problem is I need to do this for each variable. To do this I have tried using the tapply function with minimum time
tapply(Data, ID, min(time)
but get an error saying
"Error in match.fun(FUN) :
'min(Data$time)' is not a function, character or symbol.
I suspect that there is also a problem because many of the rows of observations have missing data.
Alternatively I have tried to just do each variable one at a time using aggregate, and select the min(time) this way:
firstV1 <-aggregate(V1[min(time)]~ID, data=Data, na.rm=T)
From the example dataset, what I would like to see is:
Data
ID V1 V2 V3
1 35 100 5.2
2 25 122 6.2
3 41 120 4.1
Note the '25' for ID2 V1 was from the later observation because the first observation was missing. Same for ID3 V3.
Input data
structure(list(ID = c(1L, 2L, 3L, 1L, 2L, 3L), V1 = c(35L, 25L,
41L, 25L, NA, 40L), V2 = c(100L, 111L, 120L, NA, 122L, 110L),
V3 = c(5.2, 6.2, 4.2, NA, 6.2, 4.1), time = structure(c(1435906140,
1427885520, 1427894220, 1435906260, 1427885400, 1427894700
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID",
"V1", "V2", "V3", "time"), row.names = c(NA, -6L), class = "data.frame")
This should do what you need.
library(data.table)
Data <- rbind(cbind(1,35,100,5.2,"2015-07-03 07:49"),
cbind(2,25,111,6.2,"2015-04-01 11:52"),
cbind(3,41,120,4.2,"2015-04-01 14:17"),
cbind(1,25,NA,NA,"2015-07-03 07:51"),
cbind(2,NA,122,6.2,"2015-04-01 11:50"),
cbind(3,40,110,4.1,"2015-04-01 14:25"))
colnames(Data) <- c("ID","V1","V2","V3","time")
Data <- data.table(Data)
class(Data[,time])
Data[,time:=as.POSIXct(time)]
minTime.Data <- Data[,lapply(.SD, function(x) x[time==min(time)]),by=ID]
minTime.Data
The outcome will be
ID V1 V2 V3 time
1: 1 35 100 5.2 2015-07-03 07:49:00
2: 2 NA 122 6.2 2015-04-01 11:50:00
3: 3 41 120 4.2 2015-04-01 14:17:00
Let me know if this is what you were looking for, because there is a little ambiguity in your question.

R dplyr: min max function not working in mutate

I have an issue with dplyr I cannot resolve. Also I do not have a full workable example, since the problem only occurs with the full set of data (that I cannot share with you).
I do the following:
t %>% group_by(id, add=TRUE) %>%
summarise(minbplevel = min(ref, na.rm=T)
,maxbplevel = max(ref, na.rm=T)
) %>% filter(id %in% c(caseA,caseB))
Which results in
id minbplevel maxbplevel
(dbl) (dbl) (dbl)
1 B 33.0 73.0
2 A 39.4 80.4
But when I do
t %>% group_by(id, add=TRUE) %>%
mutate(minbplevel = min(ref, na.rm=T)
,maxbplevel = max(ref, na.rm=T)
) %>% filter(id %in% c(caseA,caseB))
It results in:
id Level refparmax refparmin ref meanbptest minbplevel maxbplevel
(dbl) (chr) (int) (int) (dbl) (dbl) (dbl) (dbl)
1 B 0SD 69 68 49.0 52.00000 33 73
2 B min1SD 69 68 41.0 52.00000 33 73
3 B min2SD 69 68 33.0 52.00000 33 73
4 B plus1SD 69 68 59.0 52.00000 33 73
5 B plus2SD 69 68 73.0 52.00000 33 73
6 A 0SD 100 95 56.4 35.33333 NA NA
7 A min1SD 100 95 47.4 35.33333 NA NA
8 A min2SD 100 95 39.4 35.33333 NA NA
9 A plus1SD 100 95 67.4 35.33333 NA NA
10 A plus2SD 100 95 80.4 35.33333 NA NA
Why the NA's in case A are produced, I have no clue. It seems that each time I try it on a subset of the data, the second case with data is the problem, but that is just a hunch.
It is only one case of the 18850 that gives this issue, but there is nothing identifiable that makes the problem case different than the rest.
Please advice what I can try to do to solve this?
I can think of workarounds, creating the summarized data and then merging the result with the original data. But I thought that dplyr would allow me to do this in one step.
I tried removing or adding the add = TRUE option. That does not make any difference.
Maybe I am using this in the wrong way.
Based on comment I tried:
subset(with(t,aggregate(ref~id, t, FUN= min, na.rm=TRUE, na.action= na.pass)),id %in% c(caseA,caseB))
Which results in
id ref
4 B 33.0
5 A 39.4
I have to mask some parts of the data.
dput(head(subset(t,id %in% c(caseA,caseB)) , 12))
gives:
Again I replaced the actual id's with variables caseB and caseA. Also this is not the full dataset in which the problem occurs.
structure(list(id = c(caseB, caseB, caseB, caseB, caseB,
caseA, caseA, caseA, caseA, caseA), Level = c("0SD", "min1SD",
"min2SD", "plus1SD", "plus2SD", "0SD", "min1SD", "min2SD", "plus1SD",
"plus2SD"), refparmax = c(69L, 69L, 69L, 69L, 69L, 100L, 100L,
100L, 100L, 100L), refparmin = c(68L, 68L, 68L, 68L, 68L, 95L,
95L, 95L, 95L, 95L), ref = c(49, 41, 33, 59, 73, 56.4, 47.4,
39.4, 67.4, 80.4), meanbptest = c(52, 52, 52, 52, 52, 35.3333333333333,
35.3333333333333, 35.3333333333333, 35.3333333333333, 35.3333333333333
)), .Names = c("id", "Level", "refparmax", "refparmin", "ref",
"meanbptest"), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), vars = list(id), drop = TRUE, indices = list(
0:4, 5:9), group_sizes = c(5L, 5L), biggest_group_size = 5L, labels = structure(list(
id = c(caseB, caseA)), class = "data.frame", row.names = c(NA,
-2L), vars = list(id), drop = TRUE, .Names = "id"))
If I replace all NA's in the ref column with zeros the mutate step is working fine. As aosmith suggested, it has probably something to do with the mutate and NA issue that is fixed in the developement version of dplyr.
I cannot test this suggestion due to workstation restrictions though. So I will work around the issue, with the NA replacement step, and process the zero values after the summary steps.

Resources