R create sequence in dplyr by id beginning with zero - r

I'm looking for the cleanest way to create a sequence beginning with zero by id in a dataframe.
df <- data.frame (id=rep(1:10,each=10))
If I wanted to start sequence at 1 the following would do:
library(dplyr)
df<-df %>% group_by(id) %>%
mutate(start = 1:n()) %>%
ungroup()
but starting at 0 doesn't work because it creates an extra number (0-10 compared to 1-10) so I need to add an extra row, is there a way to do this all in one step, perhaps using dplyr? There is obviously a number of work arounds such as creating another dataset and appending it to the original.
df1 <- data.frame (id=1:10,
start=0)
new<-rbind(df,df1)
That just seems a bit awkward and not that tidy. I know you can use rbind in dplyr but not sure how to incorporate everything in one step especially if you had other non-timing varying variables you just wanted to copy over into the new row. Interested to see suggestions, thanks.

You could use complete() from the tidyverse:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(start = 1:n()) %>%
complete(start = c(0:10)) %>%
ungroup()
Which yields
# A tibble: 110 x 2
id start
<int> <int>
1 1 0
2 1 1
3 1 2
4 1 3
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
10 1 9
# ... with 100 more rows

Related

Subset rows preceding a specific row value in grouped data using R

Consider the following dataframe
df<-data.frame(group=c(1,1,1,2,2,2,3,3,3),
status=c(NA,1,1,NA,NA,1,NA,1,NA),
health=c(0,1,1,1,0,1,1,0,0))
For each group (i.e. first column), I'm looking for a way to subset the rows preceding the cells where 1 is first seen in the second column (labelled status). The expected output is
group status health
1 1 NA 0
2 2 NA 0
3 3 NA 1
I've tried resolving this with "filter" and "slice" functions, but have not succeed in subsetting preceding rows. Any help is greatly appreciated.
one solution is a tidyverse
df %>%
group_by(group) %>%
mutate(gr=which(status==1)[1]-1) %>%
slice(unique(gr)) %>%
select(-gr)
# A tibble: 3 x 3
# Groups: group [3]
group status health
<dbl> <dbl> <dbl>
1 1 NA 0
2 2 NA 0
3 3 NA 1
or
df %>%
group_by(group) %>%
filter(row_number() == which(status==1)[1]-1)
or
df %>%
group_by(group) %>%
slice(which(lead(status==1))[1])

dplyr or tidy way of counting the number of each unique value in a vector?

There are numerous ways of counting the values in a vector, including the familiar (but fraught) table()
Is there a safe/reliable method that uses dplyr / tidyverse?
Note plyr::count() seems to work nicely, but is obviously from plyr rather than dplyr
c(1,3,3,3,4,4) %>% plyr::count()
x freq
1 1 1
2 3 3
3 4 2
dplyr functions are better suited for dataframe/tibbles than vectors. You can use dplyr::count after converting vector to tibble.
c(1,3,3,3,4,4) %>% tibble::as_tibble() %>% count(value)
# value n
# <dbl> <int>
#1 1 1
#2 3 3
#3 4 2
We can also convert to data.frame
library(dplyr)
c(1,3,3,3,4,4) %>%
data.frame(value = .) %>%
count(value)
Or just use table
c(1,3,3,3,4,4) %>%
table %>%
as.data.frame

Flag column creation in dplyr

This is a really frustrating silly example. Let's say I have the following data below with an ID column...
dat <- data.frame(cbind(rep(c("a","b","c"),c(1,3,1))))
names(dat) <- c("ID")
dat
ID
1 a
2 b
3 b
4 b
5 c
6 c
I am trying to create a new column using the dplyr package which creates a 0 for the first row by ID and then any subsequent rows will have a 1. So the resulting data should look like this:
ID Flag
1 a 0
2 b 0
3 b 1
4 b 1
5 c 0
6 c 1
I have tried the following code but just get a column of zeros:
dat %>%
group_by(ID) %>%
mutate(
Readmission = ifelse(n() == 1,0,c(0,rep(1,n()-1)))
) %>% data.frame()
Any help appreciated! Surely this is a quick fix and I didn't sleep enough last night. This is actually a pretty simple task using lapply... but it takes too bloody long to run and I'm impatient.
n() is the number of rows, but you need the actual row number. Here the solution:
dat %>%
group_by(ID) %>%
mutate(
Readmission = ifelse(row_number()==1, 0, 1)
) %>%
data.frame()

r: Summarise for rowSums after group_by

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581

Selecting unique rows in R

There is a data.frame with duplicate values for the variable "Time"
> data.old
Time Count Direction
1 100000630955 95 1
2 100000637570 5 0
3 100001330144 7 1
4 100001330144 33 1
5 100001331413 39 0
6 100001331413 43 0
7 100001334038 1 1
8 100001357594 50 0
You must leave all values without duplicates. And sum the values of the variable "Count" with duplicate values, i.e.
> data.new
Time Count Direction
1 100000630955 95 1
2 100000637570 5 0
3 100001330144 40 1
4 100001331413 82 0
5 100001334038 1 1
6 100001357594 50 1
All I could find these unique values with the help of the command
> data.old$Time[!duplicated(data.old$Time)]
[1] 100000630955 100000637570 100001330144 100001331413 100001334038 100001357594
I can do this in a loop, but maybe there is a more elegant solution
Here's one approach using dplyr. Is this what you want to do?
library(tidyverse)
data.old %>%
group_by(Time) %>%
summarise(Count = sum(Count))
Edit: Keeping other variables
OP has indicated a desire to keep the values of other variables in the dataframe, which summarise deletes. Assuming that all values of those other variables are the same for all the rows being summarised, you could use the Mode function from this SO question.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Then change my answer to the following, with one call to Mode for each variable you want kept. This works with both numeric and character data.
library(tidyverse)
data.old %>%
group_by(Time) %>%
summarise(Count = sum(Count), Direction = Mode(Direction))
here is the one by using aggregating function
data.new<-aggregate( Count~Time , data=data.old, sum, na.rm=TRUE)
library(dplyr)
data.old %>% group_by(Time) %>% summarise(Count = sum(Count),
Direction = unique(Direction))
Of course, assuming you want to keep unique values of Direction column

Resources