library(tidyverse)
df <- tibble(a = c(1, 2, 3, 0, 5, 0, 7, 0, 0, 0)) %>% print()
df[1:max(which(df$a>0)),]
This little code chunk above determines that the 7th row of df is the last row to contain a positive value, and filters every row after this 7th row out of the data frame.
I go from this
# A tibble: 10 x 1
a
<dbl>
1 1.
2 2.
3 3.
4 0.
5 5.
6 0.
7 7.
8 0.
9 0.
10 0.
to this
# A tibble: 7 x 1
a
<dbl>
1 1.
2 2.
3 3.
4 0.
5 5.
6 0.
7 7.
How can I perform this df[1:max(which(df$a>0)),] using dplyr tidyverse slang? I need to learn base R, and will, but right now I got to do this in the tidyverse.
We can use slice
library(tidyverse)
df %>%
slice(1:max(which(a > 0)))
# a
# <dbl>
#1 1
#2 2
#3 3
#4 0
#5 5
#6 0
#7 7
Or filter, where we select the rows which are less than the max index of a > 0.
df %>%
filter(row_number() <= max(which(a > 0)))
Related
I have a dataset with a number of cases. Every case has two observations. The first observation for case number 1 has value 3 and the second observation has value 7. The two observations for case number 2 have missing values. I need to write code to fill the empty cells with the same values from case number 1 so that the first row for case 2 will have the same value as case 1 for obs = 1 and the second row will have the same value for obs = 2. Of course, this is a very short version of a much bigger dataset so I need something that is flexible enough to accommodate for a couple of hundred cases and where the values to use as fillers change for every subjects.
Here is a toy data set:
# toy dataset
df <- data.frame(
case = c(1, 1, 2, 2),
obs = c(1, 2, NA, NA),
value = c(3, 7, NA, NA)
)
# case obs value
# 1 1 1 3
# 2 1 2 7
# 3 2 NA NA
# 4 2 NA NA
#Desired output:
case obs value
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7
We may use fill with grouping on the row sequence (rowid) of case
library(dplyr)
library(data.table)
library(tidyr)
df %>%
group_by(grp = rowid(case)) %>%
fill(obs, value) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 × 3
case obs value
<dbl> <dbl> <dbl>
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7
I have a follow-up on this question: Sum values from rows with conditions in R
Here is my data:
ID <- c("A", "B", "C", "D", "E", "F")
Q1 <- c(0, 1, 7, 9, NA, 3)
Q2 <- c(0, 3, 2, 2, NA, 3)
Q3 <- c(0, 0, 7, 9, NA, 3)
dta <- data.frame(ID, Q1, Q2, Q3)
I need to sum every value below 7, but in lines with values over 7, I need to sum all the numbers below 7 and ignore the ones over it. Rows with all NAs should be preserved. Result should look like this:
ProxySum
0
4
2
2
NA
9
I have tried this code based on the response from the last post:
dta2 <- dta %>%
rowwise() %>%
mutate(ProxySum = ifelse(all(c_across(Q1:Q3) < 7), Reduce(`+`, c_across(Q1:Q3)), (ifelse(any(c_across(Q1:Q3) > 7), sum(.[. < 7]), NA))))
But in the rows with numbers over 7 I end up with a sum of all the rows and columns. What I am missing?
One way to do it in base:
rowSums(dta[, 2:4] * (dta[, 2:4] < 7))
# [1] 0 4 2 2 NA 9
Adding explanation, according to #tjebo comment
With dta[, 2:4] < 7 you produce a dataframe populated with logical values, where TRUE or FALSE corresponds to the values which are less or greater than 7. It is possible to do in one line, since this operation is vectorized;
Than, you multiply above logical dataframe, and a dataframe populated with your original values. Under the hood, R converts logical types into numeric types, so all FALSE and TRUEs from your logical dataset, are converted to 0s and 1s. Which means that you multiply your original values by 1 if they are less than 7, and by 0s otherwise;
Since NA < 7 produces NA, and following multiplication by NA will produce NAs as well - you preserve the original NAs;
Last step is to call rowSums() on a resulting dataframe, which will sum up the values for each particular row. Since those of them that exceed 7 are turned into 0s, you exclude them from resulting sum;
In case, when you want to get a sum for the rows where at least one value is not NA, you can use na.rm = TRUE argument to your rowSums() call. However, in this case, for the rows with NAs only you will get 0.
Another option making use of rowSums and dplyr::across:
ID <- LETTERS[1:6]
Q1 <- c(0,1,7,9,NA,3)
Q2 <- c(0,3,2,2,NA,3)
Q3 <- c(0,0,7,9,NA,3)
dta <- data.frame(ID,Q1,Q2,Q3)
library(dplyr)
dta %>%
mutate(ProxySum = rowSums(across(Q1:Q3, function(.x) { .x[.x >= 7] <- 0; .x })))
#> ID Q1 Q2 Q3 ProxySum
#> 1 A 0 0 0 0
#> 2 B 1 3 0 4
#> 3 C 7 2 7 2
#> 4 D 9 2 9 2
#> 5 E NA NA NA NA
#> 6 F 3 3 3 9
How about a slightly different approach - first pivot longer, then sum by condition by group, then pivot back.
In this current version, rows that contain only "some" NAs will return a value other than NA. (NA will be considered as 0). If you want to return NA for those rows, change all to any.
library(tidyverse)
ID <- c("A","B","C","D","E","F")
Q1 <- c(0,1,7,9,NA,3)
Q2 <- c(0,3,2,2,NA,3)
Q3 <- c(0,0,7,9,NA,3)
dta <- data.frame(ID,Q1,Q2,Q3)
dta %>%
pivot_longer(-ID) %>%
group_by(ID) %>%
mutate(ProxySum = ifelse(all(is.na(value)), NA, sum(value[which(value<7)]))) %>%
pivot_wider()
#> # A tibble: 6 × 5
#> # Groups: ID [6]
#> ID ProxySum Q1 Q2 Q3
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0 0 0 0
#> 2 B 4 1 3 0
#> 3 C 2 7 2 7
#> 4 D 2 9 2 9
#> 5 E NA NA NA NA
#> 6 F 9 3 3 3
Created on 2021-12-14 by the reprex package (v2.0.1)
Update: See #tjebo comment of identical solution as stefan:
Here is a non identical solution: using hablar:
library(dplyr)
library(hablar)
dta %>%
rowwise() %>%
mutate(sum = sum_(across(Q1:Q3, ~case_when(.<7 ~sum_(.)))))
First answer: Possible identical to stefan's answer:
Here is another dplyr solution:
library(dplyr)
dta %>%
mutate(across(where(is.numeric), ~ifelse(.>=7,0,.)),
sum = rowSums(across(where(is.numeric))))
ID Q1 Q2 Q3 sum
1 A 0 0 0 0
2 B 1 3 0 4
3 C 0 2 0 2
4 D 0 2 0 2
5 E NA NA NA NA
6 F 3 3 3 9
none of these functions are particularly hard to do, but I'm wondering how to combine them.
df <- tibble::tibble(index = seq(1:8),
amps = c(7, 6, 7, 0, 7, 6, 0, 6))
As long as there is a positive value for amps, I'd like to sum them up. If amps = 0, then that's a break in the sequence and I'd like to return the 0, then start over. I'd also like to return the corresponding index value. The result would look like this:
index amps
<dbl> <dbl>
1 1 20
2 4 0
3 5 13
4 7 0
5 8 6
I can do this in VBA but I'd like to beef up my R skills in functional programming. I would prefer to use functions rather than loops just because they're cleaner. Any help is appreciated.
Another base R solution using rle + tapply
u <- with(rle(df$amps == 0), rep(seq_along(lengths), lengths))
dfout <- data.frame(
index = which(!duplicated(u)),
amps = tapply(df$amps, u, sum)
)
which gives
> dfout
index amps
1 1 20
2 4 0
3 5 13
4 7 0
5 8 6
One dplyr option could be:
df %>%
group_by(grp = with(rle(amps == 0), rep(seq_along(lengths), lengths))) %>%
summarise(index = first(index),
amps = sum(amps))
grp index amps
<int> <int> <dbl>
1 1 1 20
2 2 4 0
3 3 5 13
4 4 7 0
5 5 8 6
We can create a new group where amps = 0 or where previous value of amps is 0, get the first value of index and sum of amps for each group.
library(dplyr)
df %>%
group_by(gr = cumsum(amps == 0 | lag(amps, default = first(amps)) == 0)) %>%
summarise(index = first(index), amps = sum(amps)) %>%
select(-gr)
# A tibble: 5 x 2
# index amps
# <int> <dbl>
#1 1 20
#2 4 0
#3 5 13
#4 7 0
#5 8 6
Using the same logic in data.table :
library(data.table)
setDT(df)[, .(index = first(index), amps = sum(amps)),
cumsum(amps == 0 | shift(amps, fill = first(amps)) == 0)]
In base R we could use aggregate based on the rle.
ll <- rle(df$amps != 0)$lengths
rr <- aggregate(amps ~ cbind(index=rep(index[!!c(amps[1]>0, diff(amps!=0))], ll)), df, sum)
rr
# index amps
# 1 1 20
# 2 4 0
# 3 5 13
# 4 7 0
# 5 8 6
This question already has answers here:
Count number of rows within each group
(17 answers)
Count number of rows per group and add result to original data frame
(11 answers)
Closed 3 years ago.
I have a data frame in which each ID belongs to a unique group. I wish to create a summarize table which tells me the number of observations for each id and which group it belongs to.
dat=data.frame(id=c(1,1,1,2,2,2,2,3,4,4,4,4,4),group=c(1,1,1,0,0,0,0,1,0,0,0,0,0))
count=dat%>% group_by(id)%>% tally()
## A tibble: 4 x 2
id n
<dbl> <int>
1 1 3
2 2 4
3 3 1
4 4 5
with the code above I can count the number of observations. But I have no idea how to create a third column for group. The desired result is:
# A tibble: 4 x 3
id n group
<dbl> <int> <dbl>
1 1 3 1
2 2 4 0
3 3 1 1
4 4 5 0
When I do
dat %>% group_by(id) %>% summarise(n=count(id), group = unique(group))
I go a error: Error in quickdf(.data[names(cols)]) : length(rows) == 1 is not TRUE
However, when I do
dat %>% group_by(id) %>% summarise( group = unique(group))
It worked. I was so confused why the summarise command can not take multiple arguments.
Update: the error is caused by another package called"plyr". Summarise is working well when I detached plyr.
We can use count
library(dplyr)
dat %>%
count(id, group)
# A tibble: 4 x 3
# id group n
# <dbl> <dbl> <int>
#1 1 1 3
#2 2 0 4
#3 3 1 1
#4 4 0 5
akrun's answer is more elegant, but as an alternative you can simply add the group variable to your group_by() call:
library(dplyr)
dat <- tibble(id = c(1, 1, 1, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4),
group = c(1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0))
dat %>%
group_by(id, group) %>%
tally()
# A tibble: 4 x 3
# Groups: id [4]
id group n
<dbl> <dbl> <int>
1 1 1 3
2 2 0 4
3 3 1 1
4 4 0 5
Notice that if your id and group are not straightfoward correspondent like in your example (id = 1 -> group = 1, id = 2 -> group = 0, and so on), it will generate a row for each combination (which obviously is very useful). For example,
dat2 <- tibble(id = c(1, 1, 1, 2, 2), group = c(1, 0, 0, 1, 0))
dat2 %>%
group_by(id, group) %>%
tally()
# A tibble: 4 x 3
# Groups: id [2]
id group n
<dbl> <dbl> <int>
1 1 0 2
2 1 1 1
3 2 0 1
4 2 1 1
Newbie question
I have 2 columns in a data frame that looks like
Name Size
A 1
A 1
A 1
A 2
A 2
B 3
B 5
C 7
C 17
C 17
I need a third column that will run continuously as a sequence until either Name or Size changes value
Name Size NewCol
A 1 1
A 1 2
A 1 3
A 2 1
A 2 2
B 3 1
B 5 1
C 7 1
C 17 1
C 17 2
Basically a dummy field to reference each record separately even if Name and Size are the same.
So the index changes from k to k+1 when it encounters both same values for Name and Size otherwise resets.
Therefore in my data set if I have 200 A and 1s suppose each will be indexed between 1..200. Then when it moves to A and 2 the index shall reset
We can try with data.table
library(data.table)
setDT(df1)[, NewCol := match(Size, unique(Size)), by = .(Name)]
df1
# Name Size NewCol
#1: A 1 1
#2: A 1 1
#3: A 2 2
#4: B 3 1
#5: C 7 1
#6: C 17 2
If there is a typo somewhere in the expected output, may be this would be the output
setDT(df1)[, NewCol := seq_len(.N), .(Name, Size)]
Or using dplyr
library(dplyr)
df1 %>%
group_by(Name) %>%
mutate(NewCol = match(Size, unique(Size)))
Or
df1 %>%
group_by(Name) %>%
mutate(NewCol = row_number())
Or we can use the same approach with ave from base R
I guess this might not be the most efficient solution, but at least a good start :
# Reproducing the example
df <- data.frame(Name=LETTERS[c(1, 1, 1, 1, 1, 2, 2, 3, 3, 3)], Size=c(1, 1, 1, 2, 2, 3, 5, 7, 17, 17))
# Create new colum with unique id
df$NewCol <- paste0(df$Name, df$Size)
# Modify column to write count instead
df$NewCol <- unlist(sapply(unique(df$NewCol), function(id) 1:table(df$NewCol)[id]))
df
Name Size NewCol
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 1
5 A 2 2
6 B 3 1
7 B 5 1
8 C 7 1
9 C 17 1
10 C 17 2