How do I remove offsetting rows in a tibble? - r

I am trying to remove rows that have offsetting values.
library(dplyr)
a <- c(1, 1, 1, 1, 2, 2, 2, 2,2,2)
b <- c("a", "b", "b", "b", "c", "c","c", "d", "d", "d")
d <- c(10, 10, -10, 10, 20, -20, 20, 30, -30, 30)
o <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
df <- tibble(ID = a, SEQ = b, VALUE = d, OTHER = o)
Generates this ordered table that is grouped by ID and SEQ.
> df
# A tibble: 10 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 B
3 1 b -10 C
4 1 b 10 D
5 2 c 20 E
6 2 c -20 F
7 2 c 20 G
8 2 d 30 H
9 2 d -30 I
10 2 d 30 J
I want to drop the row pairs (2,3), (5,6), (8,9) because VALUE negates the VALUE in the matching previous row.
I want the resulting table to be
> df2
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 D
3 2 c 20 G
4 2 d 30 J
I know that I can't use group_by %>% summarize, because I need to keep the value that is in OTHER. I've looked at the dplyr::lag() function but I don't see how that can help. I believe that I could loop through the table with some type of for each loop and generate a logical vector that can be used to drop the rows, but I was hoping for a more elegant solution.

What about:
vec <- cbind(
c(head(df$VALUE,-1) + df$VALUE[-1], 9999) ,
df$VALUE + c(9999, head(df$VALUE,-1))
)
vec <- apply(vec,1,prod)
vec <- vec!=0
df[vec,]
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
The idea is to take your VALUE field and subtract it with a slightly subset version of it. When the result is 0, than you remove the line.

Here's another solution with dplyr. Not sure about the edge case you mentioned in the comments, but feel free to test it with my solution:
library(dplyr)
df %>%
group_by(ID, SEQ) %>%
mutate(diff = VALUE + lag(VALUE),
diff2 = VALUE + lead(VALUE)) %>%
mutate_at(vars(diff:diff2), funs(coalesce(., 1))) %>%
filter((diff != 0 & diff2 != 0)) %>%
select(-diff, -diff2)
Result:
# A tibble: 4 x 4
# Groups: ID, SEQ [4]
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
Note:
This solution first creates two diff columns, one adding the lag, another adding the lead of VALUE to each VALUE. Only the offset columns will either have a zero in diff or in diff2, so I filtered out those rows, resulting in the desired output.

Related

Reorder one row in tibble - move it to the last row

How do I rearrange the rows in tibble?
I wish to reorder rows such that: row with x = "c" goes to the bottom of the tibble, everything else remains same.
library(dplyr)
tbl <- tibble(x = c("a", "b", "c", "d", "e", "f", "g", "h"),
y = 1:8)
An alternative to dplyr::arrange(), using base R:
tbl[order(tbl$x == "c"), ] # Thanks to Merijn van Tilborg
Output:
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
tbl |> dplyr::arrange(x == "c")
Using forcats, convert to factor having c the last, then arrange. This doesn't change the class of the column x.
library(forcats)
tbl %>%
arrange(fct_relevel(x, "c", after = Inf))
# # A tibble: 8 x 2
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
If the order of x is important, it is better to keep it as factor class, below will change the class from character to factor with c being last:
tbl %>%
mutate(x = fct_relevel(x, "c", after = Inf)) %>%
arrange(x)

How to remove unique rows from dataframe using tidyverse

I want to remove unique rows based on a variable:
Letters Val
A 1
A 1
B 1
B 3
In this case, entries with A is removed as the Val values are unique resulting in:
Letters Val
B 1
B 3
I have tried to use count, then filter out n > 1 however in this process Val is lost.
In essence how do I filter(count(letters) > 1)?
md <- tibble::tribble(
~Letters, ~Val,
"A", 1,
"A", 1,
"B", 1,
"B", 3
)
library(dplyr)
md |>
group_by(Letters, Val) |>
filter(n() == 1)
#> # A tibble: 2 × 2
#> # Groups: Letters, Val [2]
#> Letters Val
#> <chr> <dbl>
#> 1 B 1
#> 2 B 3

Adding dataset identifier variable in full_join in R

I want to automatically add a new dataset identifier variable when using full_join() in R.
df1 <- tribble(~ID, ~x,
"A", 1,
"B", 2,
"C", 3)
df2 <- tribble(~ID, ~y,
"D", 4,
"E", 5,
"F", 6)
combined <- df1 %>% dplyr::full_join(df2)
I know from ?full_join that it joins all rows from df1 followed by df2. But, I couldn't find an option to create an index variable automatically.
Currently, I'm adding an extra variable in df1 first
df1 <- tribble(~ID, ~x, ~dataset,
"A", 1, 1,
"B", 2, 1,
"C", 3, 1)
and following it up with df1 %>% dplyr::full_join(df2) %>% dplyr::mutate(dataset = replace_na(dataset, 2))
Any suggestions to do it in a better way?
I'm not sure if it's more efficient than yours', but if there always do not exist overlapping columns except id, then you may try
df1 %>%
full_join(df2) %>%
mutate(dataset = as.numeric(is.na(x))+1)
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
But to be safe, it might be better just define it's index(?) thing beforehand.
df1 %>%
mutate(dataset = 1) %>%
full_join(df2 %>% mutate(dataset = 2))
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
New data
df1 <- tribble(~ID, ~x,~y,
"A", 1,1,
"B", 2,1,
"C", 3,1)
df2 <- tribble(~ID, ~x,~y,
"D", 4,1,
"E", 5,1,
"F", 6,1)
full_join(df1, df2)
ID x y
<chr> <dbl> <dbl>
1 A 1 1
2 B 2 1
3 C 3 1
4 D 4 1
5 E 5 1
6 F 6 1
Instead of a "join", maybe try bind_rows from dplyr:
library(dplyr)
bind_rows(df1, df2, .id = "dataset")
This will bind rows, and the missing columns are filled in with NA. In addition, you can specify an ".id" argument with an identifier. If you provide a list of dataframes, the labels are taken from names in the list. If not, a numeric sequence is used (as seen below).
Output
dataset ID x y
<chr> <chr> <dbl> <dbl>
1 1 A 1 NA
2 1 B 2 NA
3 1 C 3 NA
4 2 D NA 4
5 2 E NA 5
6 2 F NA 6

How to alter a variable in every row after a condition has been met

Essentially, I need to alter every row that occurs after a certain condition has been met. Though I also need the loop to obey a grouping variable. A simplified version of my data (shown below), is the grouping variable (Groups), followed by a value (N) and then the conditional variable (R). You can create a simplified version of my data as follows:
Groups <- c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C")
N <- c(1,1,1,1,1,1,1,1,1,1)
R <- c("N", "N", "Y", "N", "N", "N", "Y", "N", "N", "N")
Dat <- as.data.frame(cbind(Groups, N, R))
What I need, is for when R == "Y", that row and every row after that for that group, has +1 added to the N variable. So the solution should look like this:
Groups N R
1 A 1 N
2 A 1 N
3 A 2 Y
4 A 2 N
5 B 1 N
6 B 1 N
7 B 2 Y
8 B 2 N
9 C 1 N
10 C 1 N
So the loop needs to restart with each new group. Ideally, a solution within dplyr is preferred but I have not been able to find one yet.
Any help or guidance would be much appreciated!
Do a group by cumsum on a logical vector and add to the 'N'
library(dplyr)
Dat %>%
group_by(Groups) %>%
mutate(N = cumsum(R == "Y") + N) %>%
ungroup()
-output
# A tibble: 10 × 3
Groups N R
<chr> <dbl> <chr>
1 A 1 N
2 A 1 N
3 A 2 Y
4 A 2 N
5 B 1 N
6 B 1 N
7 B 2 Y
8 B 2 N
9 C 1 N
10 C 1 N
data
Dat <- data.frame(Groups, N, R)
# NOTE: Using `cbind` converts to `matrix` and matrix can have only a single class. Directly use `data.frame` instead of roundabout way which is not a correct approach.
You may do this with the help of match.
library(dplyr)
Dat %>%
group_by(Groups) %>%
mutate(N = N + as.integer(row_number() >= match('Y', R, nomatch = n() + 1))) %>%
ungroup
# Groups N R
# <chr> <dbl> <chr>
# 1 A 1 N
# 2 A 1 N
# 3 A 2 Y
# 4 A 2 N
# 5 B 1 N
# 6 B 1 N
# 7 B 2 Y
# 8 B 2 N
# 9 C 1 N
#10 C 1 N

R dplyr calculating group and column percentages

I am new to R and have a simple 'how to' question, specifically, what is the best way to calculate Group and overall percentages on data frame columns? My data looks like this:
# A tibble: 13 x 3
group resp id
<chr> <dbl> <chr>
1 A 1 ssa
2 A 1 das
3 A NA fdsf
4 B NA gfd
5 B 1 dfg
6 B 1 dg
7 C 1 gdf
8 C NA gdf
9 C NA hfg
10 D 1 hfg
11 D 1 trw
12 D 1 jyt
13 D NA ghj
the test data is this:
structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "D"), resp = c(1, 1, NA, NA, 1, 1, 1, NA,
NA, 1, 1, 1, NA), id = c("ssa", "das", "fdsf", "gfd", "dfg",
"dg", "gdf", "gdf", "hfg", "hfg", "trw", "jyt", "ghj")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
I managed to do the group percentages by doing the following (which seems overcomplicated):
a <- test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE))
b <- test %>%
group_by(group) %>%
summarise(all = n_distinct(id, na.rm = TRUE))
result <- a %>%
left_join(b) %>%
mutate(a,resp_rate = round(no_resp/all*100))
this gives me:
# A tibble: 4 x 4
group no_resp all resp_rate
<chr> <dbl> <int> <dbl>
1 A 2 3 67
2 B 2 3 67
3 C 1 2 50
4 D 3 4 75
which is fine, but I wondered how I could make this simpler? Also, how would I do an overall percentage? E.g. an overall distinct count of resp/distinct count of id, without grouping.
Many thanks
You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Using base R we may apply tapply and table functions.
res <- transform(with(test, data.frame(no_resp=tapply(resp, group, sum, na.rm=TRUE),
all=colSums(table(id, group) > 0))),
resp_rate=round(no_resp/all*100),
overall_perc=prop.table(no_resp)*100
)
res
# no_resp all resp_rate overall_perc
# A 2 3 67 25.0
# B 2 3 67 25.0
# C 1 2 50 12.5
# D 3 4 75 37.5

Resources