Flag column creation in dplyr - r

This is a really frustrating silly example. Let's say I have the following data below with an ID column...
dat <- data.frame(cbind(rep(c("a","b","c"),c(1,3,1))))
names(dat) <- c("ID")
dat
ID
1 a
2 b
3 b
4 b
5 c
6 c
I am trying to create a new column using the dplyr package which creates a 0 for the first row by ID and then any subsequent rows will have a 1. So the resulting data should look like this:
ID Flag
1 a 0
2 b 0
3 b 1
4 b 1
5 c 0
6 c 1
I have tried the following code but just get a column of zeros:
dat %>%
group_by(ID) %>%
mutate(
Readmission = ifelse(n() == 1,0,c(0,rep(1,n()-1)))
) %>% data.frame()
Any help appreciated! Surely this is a quick fix and I didn't sleep enough last night. This is actually a pretty simple task using lapply... but it takes too bloody long to run and I'm impatient.

n() is the number of rows, but you need the actual row number. Here the solution:
dat %>%
group_by(ID) %>%
mutate(
Readmission = ifelse(row_number()==1, 0, 1)
) %>%
data.frame()

Related

R dplyr filter data based on values in other rows

I am trying to filter a data frame using dplyr and I can't really think of a way to achieve what I want. I have a data frame of the following form:
A B C
-----------
1 2 5
1 4 6
2 2 7
2 4 6
Each value in column A appears exactly 2 times. Column B has exactly 2 different values, each appearing exactly once for each value of A. Column C can have any positive values. I want to keep all rows where for one value of A, the row with the bigger B value has a smaller C value than the row with the smaller B value. In the example above, this would result in:
A B C
-----------
2 2 7
2 4 6
Is there a way to achieve this using dplyr?
1) Sort by A and B to ensure that the larger B is always the second within A and then grouping by A use a filter based on diff(C) < 0.
library(dplyr)
DF %>%
arrange(A, B) %>%
group_by(A) %>%
filter((diff(C) < 0)) %>%
ungroup
## # A tibble: 2 × 3
## A B C
## <int> <int> <int>
## 1 2 2 7
## 2 2 4 6
2) Another possibility is to ensure that the maximum of B is on the same row as the minimum of C. This would also work with non-numeric data.
See comments below this answer for another idea along these lines.
DF %>%
group_by(A) %>%
filter(which.max(B) == which.min(C)) %>%
ungroup
3) If the slope of B with respect to C is negative then keep the group.
DF %>%
group_by(A) %>%
filter(coef(lm(B ~ C))[[2]] < 0) %>%
ungroup
or we can calculate the slope ourselves:
DF %>%
group_by(A) %>%
filter(diff(C) / diff(B) < 0) %>%
ungroup
Note
Lines <- "A B C
1 2 5
1 4 6
2 2 7
2 4 6"
DF <- read.table(text = Lines, header = TRUE)

Operations on single row in dplyr [duplicate]

This question already has answers here:
dplyr mutate/replace several columns on a subset of rows
(12 answers)
Closed 3 years ago.
Is it possible to performa dplyr operations with pipes for single rows of a dataframe? For example say I have the following a dataframe (call it df) and want to do some manipulations to the columns of that dataframe:
df <- df %>%
mutate(col1 = col1 + col2)
This code sets one column equal to the sum of that column and another. What if I want to do this, but only for a single row?
df[1,] <- df[1,] %>%
mutate(col1 = col1 + col2)
I realize this is an easy operation in base R, but I am super curious and would love to use dplyr operations and piping to make this happen. Is this possible or does it go against dplyr grammar?
Here's an example. Say I have a dataframe:
df = data.frame(a = rep(1, 100), b = rep(1,100))
The first example I showed:
df <- df %>%
mutate(a = a + b)
Would result in column a_xPlacexHolderxColumnaPlacexHolderx_ being 2 for all rows.
The second example would only result in the first row of column a_xPlacexHolderxColumnaPlacexHolderx_ being 2.
mutate() is for creating columns.
You can do something like df[1,1] <- df[1,1] + df[1,2]
An Example:
You can mutate() and case_when() for conditional manipulation.
df %>%
mutate(a = case_when(row_number(a) == 1 ~ a + b,
TRUE ~ a))
results in
# A tibble: 100 x 2
a b
<dbl> <dbl>
1 2 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
# … with 90 more rows
Data
library(tidyverse)
df <- tibble(a = rep(1, 100), b = rep(1,100))

R create sequence in dplyr by id beginning with zero

I'm looking for the cleanest way to create a sequence beginning with zero by id in a dataframe.
df <- data.frame (id=rep(1:10,each=10))
If I wanted to start sequence at 1 the following would do:
library(dplyr)
df<-df %>% group_by(id) %>%
mutate(start = 1:n()) %>%
ungroup()
but starting at 0 doesn't work because it creates an extra number (0-10 compared to 1-10) so I need to add an extra row, is there a way to do this all in one step, perhaps using dplyr? There is obviously a number of work arounds such as creating another dataset and appending it to the original.
df1 <- data.frame (id=1:10,
start=0)
new<-rbind(df,df1)
That just seems a bit awkward and not that tidy. I know you can use rbind in dplyr but not sure how to incorporate everything in one step especially if you had other non-timing varying variables you just wanted to copy over into the new row. Interested to see suggestions, thanks.
You could use complete() from the tidyverse:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(start = 1:n()) %>%
complete(start = c(0:10)) %>%
ungroup()
Which yields
# A tibble: 110 x 2
id start
<int> <int>
1 1 0
2 1 1
3 1 2
4 1 3
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
10 1 9
# ... with 100 more rows

Filter and return all rows of a group where specific row fulfills one condition

I am looking to filter and retrieve all rows from all groups where a specific row meets a condition, in my example when the value is more than 3 at the highest day per group. This is obviously simplified but breaks it down to the essential.
# Dummy data
id = rep(letters[1:3], each = 3)
day = rep(1:3, 3)
value = c(2,3,4,2,3,3,1,2,4)
my_data = data.frame(id, day, value, stringsAsFactors = FALSE)
My approach works, but it seems somewhat unsmart:
require(dplyr)
foo <- my_data %>%
group_by(id) %>%
slice(which.max(day)) %>% # gets the highest day
filter(value>3) # filters the rows with value >3
## semi_join with the original data frame gives the required result:
semi_join(my_data, foo, by = 'id')
id day value
1 a 1 2
2 a 2 3
3 a 3 4
4 c 1 1
5 c 2 2
6 c 3 4
Is there a more succint way to do this?
my_data %>% group_by(id) %>% filter(value[which.max(day)] > 3)

Grouping and filtering consecutively over a dataframe

I am working with a large dataframe in R but I got the next action and my solution looks too extent. I will use DF as an example of the dataframe I am using:
library(dplyr)
DF<-data.frame(ID=c(1:10),Cause1=c(rep("Yes 1",8),rep("No 1",2)),Cause2=c(rep("Yes 2",6),rep("No 2",4)),
Cause3=c(rep("Yes S",5),rep("No S",5)),Cause4=c(rep("Yes P",3),rep("No P",7)),
Cause5=c(rep("Yes",2),rep("No",8)),stringsAsFactors = F)
DF has the next structure:
ID Cause1 Cause2 Cause3 Cause4 Cause5
1 1 Yes 1 Yes 2 Yes S Yes P Yes
2 2 Yes 1 Yes 2 Yes S Yes P Yes
3 3 Yes 1 Yes 2 Yes S Yes P No
4 4 Yes 1 Yes 2 Yes S No P No
5 5 Yes 1 Yes 2 Yes S No P No
6 6 Yes 1 Yes 2 No S No P No
7 7 Yes 1 No 2 No S No P No
8 8 Yes 1 No 2 No S No P No
9 9 No 1 No 2 No S No P No
10 10 No 1 No 2 No S No P No
Where DF is composed of six variables (1 id variable and the others are variables that can be Yes or No). Then, for each of the variables with the prefix Cause I need to compute a summary of that variable, as first step, and after that I have to filter by that variable when it was achieved (or this is equal to Yes). For example I will do the first stage of this process with the next code and its respective explanation:
#Filtering stage
#N1
DF %>% group_by(Cause1) %>% summarise(N=n()) -> d1
DF %>% filter(Cause1=="Yes 1") -> DF2
In this case, using dplyr I group DF by variable Cause1 and summarise() to count the number of values it has (n()). Therefore, the result is saved in d1. After, I have to filter DF when Cause1 is equal to Yes 1 and that must be saved in a new data.frame called DF2. Once I get DF2 I must repeat a similar routine for Cause2, Cause3, Cause4 and Cause5. For that I use the next code:
#N2
DF2 %>% group_by(Cause2) %>% summarise(N=n()) -> d2
DF2 %>% filter(Cause2=="Yes 2") -> DF3
#N3
DF3 %>% group_by(Cause3) %>% summarise(N=n()) -> d3
DF3 %>% filter(Cause3=="Yes S") -> DF4
#N4
DF4 %>% group_by(Cause4) %>% summarise(N=n()) -> d4
DF4 %>% filter(Cause4=="Yes P") -> DF5
#N5
DF5 %>% group_by(Cause5) %>% summarise(N=n()) -> d5
DF5 %>% filter(Cause5=="Yes") -> DF6
The final result is DF6 but I have to make a control by combining all the dataframes d1,d2,d3,d4 and d5 and filtering all the No values. I used this code with that porpouse. The code sets a common names for all d's dataframes, rbind them and filter the No pattern.
#Connect
names(d1)<-names(d2)<-names(d3)<-names(d4)<-names(d5)<-c("Cause","N")
#Rbind
d<-rbind(d1,d2,d3,d4,d5)
d_reduced<-d[grepl("No",d$Cause),]
I obtain this:
Cause N
1 No 1 2
2 No 2 2
3 No S 1
4 No P 2
5 No 1
The final step is to compute the sum of N in d_reduced and the number of rows in DF minus that value must be the same that the number of rows of DF6:
(dim(DF)[1]-sum(d_reduced$N))==dim(DF6)[1]
That in this case is TRUE.
I would like to reduce this too long code because in my analysis the number of Cause variables can increase and the the code will be larger. Maybe by using the apply strategy or reshaping the data could be better. Any help about reducing the level of code would be marvelous. Thanks in advance.
How about something like this?
First we summarise how many "No" cases are in each column that starts with "Cause":
num_no <- DF %>% summarise_each(funs(substr(., 1, 1) == "N"), starts_with("Cause"))
> num_no
Cause1 Cause2 Cause3 Cause4 Cause5
1 2 4 5 7 8
You are interested in the incremental difference between each subsequent column, so lets just subtract a lagged version of num_no from num_no.
d_reduced <- num_no - lag(num_no, 1, 0)
> d_reduced
Cause1 Cause2 Cause3 Cause4 Cause5
1 2 2 1 2 1
This gives the values you wanted, but they are not the labelled, lets fix that, extracting the unique string that begins with N for each column:
labs <- lapply(DF, function(X){unique(X[grep("N", X)])}) %>% unlist
names(d_reduced) <- labs
> d_reduced
No 1 No 2 No S No P No
1 2 2 1 2 1
Then we do your final step would be, summing the occurrences of d_reduced and subtracting those from number of rows of DF and then checking if that is equal to the number of rows which are "Yes" for their entire row.
> (nrow(DF) - sum(d_reduced)) == sum(DF[, ncol(DF)] == "Yes")
[1] TRUE
Warning: This would only work because if someone has yes in the final column all preceding columns are yes (like in your example). If that assumption changed then this answer will not work.
You could reshape to long format, then count the votes, and then take the difference between the Yes values. data.table::melt uses regular expression for detecting measure variables, which should be useful in capturing all the Cause variables. Does this work?
d <-
melt(as.data.table(DF), # launch melt.data.table
id.vars = "ID",
measure.vars = patterns("Cause"), # grep columns
variable.name = "Cause") %>%
group_by(Cause) %>% # tabulate Yes's and No's
summarise(Yes = sum(grepl("Yes", value)),
No = sum(grepl("No", value))) %>%
mutate(N = lag(Yes) - Yes) %>% # N = difference between Yes's
rowwise() %>% # replace the NA in first row with the No value
mutate(N = replace(N, is.na(N), No))

Resources