There is a similar question for data.table at Replace sets of rows in a data.table with a single row but I am looking for an equivalent solution in tidyverse. So, I have a tibble like:
DT <- tibble (
id=c(1,1,1,1,2,2,2,2),
location = c("a","b","c","d","a","b","d","e"),
seq = c(1,2,3,4,1,2,3,4))
For every id, I want to look for the sequence b,c,d and if there is such a thing, I want to replace the rows with b and c with a single row, let's say z. The values for the other variables should retain the values of the previous b (in this case id and seq)
So in this case, the new tibble should be
DT.Tobe <- tibble (
id=c(1,1,1,2,2,2,2),
place = c("a","z","d","a","b","d","e"),
seq = c(1,2,4,1,2,3,4))
I was not able to find even a starting point for this...
library(dplyr)
# library(zoo) # rollapply
DT %>%
group_by(id) %>%
mutate(
isseq = zoo::rollapply(location, 3, FUN = function(z) identical(z, c("b", "c", "d")), align = "left", partial = TRUE),
isseq = isseq | lag(isseq, default = FALSE)
) %>%
group_by(id, isseq) %>%
summarize(
across(everything(), ~ {
if (cur_group()$isseq) {
if (cur_column() == "location") "z" else first(.)
} else .
})
) %>%
ungroup() %>%
select(-isseq)
# # A tibble: 7 x 3
# id location seq
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 d 4
# 3 1 z 2
# 4 2 a 1
# 5 2 b 2
# 6 2 d 3
# 7 2 e 4
The order is changed because the group_by(isseq) tends to keep "like" together. This should be easy to either re-order (assuming "seq" is meaningful) or pre-add an order variable and using it later.
If it is possible for a single id to have multiple of such sequences (if so, say something), then run-length encoding will be needed here as well (to differentiate between different b-c-d sequences in the same id).
One possible option would be to use a for-loop. Here is my pseudocode.
for (i in nrows(DT)){ # Repeat this if statement for each row in your DT
if (place[i] == "b & place[i+1] == "c"){ # if the first item is B and the second item is C
DT <- DT %>%
dplyr::replace(place[i] == "z") # Replaces item B with the z character
DT[-(i+1)] # Deletes item C's row
}
}
The Dplyr cheat sheet has some useful functions that may help with finding the right tools for the if-statement part of this pseudocode.
What are your thoughts?
Related
I have two dataframes df1 and df2 which I have merged together into another dataframe df3
df1 <- data.frame(
Name = c("A", "B", "C"),
Value = c(1, 2, 3),
Method = c("Indirect"))
df2 <- data.frame(
Name = c("A", "B"),
Value = c(4, 5),
Method = c("Direct"))
df3 <- rbind(df1, df2)
So df3 looks something like this
Now I need to identify all the unique entries in the Name column (which is C in this case) and for each of the unique entries, a row is to be added which would have the same "Name" but "Value" would be 0 and the "Method" would be the opposite one. The output should look like this.
Finally the rows with similar "Name" are to be arranged one below the other.
I have a huge dataframe and I need to achieve the above mentioned outcome in the most efficient way in R. How do I proceed?
One way
tmp=df3[!(df3$Name %in% df3$Name[duplicated(df3$Name)]),]
tmp$Value=0
tmp$Method=ifelse(tmp$Method=="Direct","Indirect","Direct")
Name Value Method
3 C 0 Direct
you can now rbind this to your original data (and sort it).
Please find another solution using data.table
Reprex
Code
library(data.table)
library(magrittr) # for the pipe!
setDT(df3)
df3 <- rbindlist(list(df3,
df3[!(df3$Name %in% df3[duplicated(Name)]$Name)
][, `:=` (Value = 0, Method = fifelse(Method == "Indirect", "Direct", "Indirect"))])) %>%
setorder(., Name)
Output
df3
#> Name Value Method
#> 1: A 1 Indirect
#> 2: A 4 Direct
#> 3: B 2 Indirect
#> 4: B 5 Direct
#> 5: C 3 Indirect
#> 6: C 0 Direct
Created on 2021-12-15 by the reprex package (v2.0.1)
I think that with 10,000 rows you will barely notice it:
library(dplyr)
df3 |>
add_count(Name) |>
filter(n == 1) |>
mutate(
Value = 0,
Method = c(Indirect = 'Direct', Direct = 'Indirect')[Method],
n = NULL
) |>
bind_rows(df3) |>
arrange(Name, Value, Method)
# Name Value Method
# 1 A 1 Indirect
# 2 A 4 Direct
# 3 B 2 Indirect
# 4 B 5 Direct
# 5 C 0 Direct
# 6 C 3 Indirect
I'm trying for my first time to code a function. It's supposed to split a string into severals ones and returned each piece into a tibble row.
For example, let's say I have that kind of data.
nasty_entry <- tibble(ID = 1:3, Var = c("ABC", "AB", "A"))
I would like to get that.
nice_entry <- tibble(ID = c(1, 1, 1, 2, 2, 3), var = c("A", "B", "C", "A", "B", "A"))
So, I try to code a function using different kind of loops (for practice) because my orignal data have about 300 entries.
nice_entry <- function(data, var, pattern)
#--------------------DECLARATION--------------------#
# data : The tibble containing the data to split.
# var : The variable containing the data to split.
# pattern : The pattern to use for the spliting.
if(!require(tidyverse)){install.packages("tidyverse")}
library(tidyverse)
if(!require(magrittr)){install.packages("tidyverse")}
library(magrittr)
c1 <- 0 # Reset the counter #1
c2 <- 0 # Reset the counter #2
unchanged_rows <- 0 # The number of rows that has been unchanged.
changed_rows <- 0 # The number of rows that has been changed.
new_data <- tibble() # The tibble where the data will be stored.
repeat{
c1 <- c1 +1 # Increase the counter #1 by one at each loop.
c2 <- 0 # Reset the counter #2 at each loop.
# Split the string into several strings.
splited_str <- str_split(string = data %>% select({{ var }}) %>% slice(c1), pattern = pattern) %>%
unlist()
# Add the row into the "new_data" variable if the original string hasn't been splited.
if(length(splited_str) <= 1) {
unchanged_rows <- unchanged_rows +1
new_data <- new_data %>%
bind_rows(slice(data, c1))
next
}
# Duplicate the row of the original string. It duplicates it several times according to the
# number of times the original string has been splited.
if(length(splited_str) > 1){
changed_rows <- changed_rows +1
duplicated_rows <- data %>%
slice(rep(c1, each = length(splited_str)))
# Replace each original string with the new splited strings.
while (c2 < length(splited_str)) {
c2 <- c2 +1
duplicated_rows <- duplicated_rows %>%
mutate({{ var }} = replace(x = {{ var }}, list = c2, values = splited_str[c2]))
new_data <- new_data %>%
bind_rows(slice(duplicated_rows, c2))
}
}
# Break the loop if the entire tibble has been analyse and return the "new_data" variable.
if(c1 == length(nrow(data))) {
break
return(new_data)
}
}
}
I tried the same code by using "real variables" inside the loops and it seems to work. The problem comes when I embrace them into the function. I get this error.
Error: object 'c1' not found
}
Error: unexpected '}' in " }"
}
Error: unexpected '}' in "}"
What do I do wrong? Maybe it's indexing problem?.
I would also like to have some advices for coding function and if there's alternatives to do the same.
Thank you very much!
Mathieu
Here is another approach you may want to get
library(tidyverse)
nasty_entry2 <- nasty_entry %>%
mutate(Var = strsplit(as.character(Var), "")) %>%
tidyr::unnest(Var)
# A tibble: 6 x 2
# ID Var
# <int> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 2 A
# 5 2 B
# 6 3 A
We can use separate_rows. Specify a regex lookaround to match between two characters. The . in regex match any character. So, it is basically splitting between two adjacent characters
library(dplyr)
library(tidyr)
nasty_entry %>%
separate_rows(Var, sep="(?<=.)(?=.)")
# A tibble: 6 x 2
# ID Var
# <int> <chr>
#1 1 A
#2 1 B
#3 1 C
#4 2 A
#5 2 B
#6 3 A
I have a data set as following:-
a <- data.frame(X1="A", X2="B", X3="C", X4="D", X5="0",
X6="0", X7="0", X8="0", X9="0", X10="0")
Basically it is a 1 row X 10 column data.frame.
The resulting data.frame should have the column elements of a as rows rather than columns. And any columns in a which are equal to "0" should not be present in the new data.frame. For ex. -
# b
# [1] A
# [2] B
# [3] C
# [4] D
Use a transpose and subset with a logical condition
data.frame("b" = t(df1)[t(df1) != 0])
A second look gave me chance to play with code, you did not need a transpose
data.frame("b" = df1[df1 != 0])
You could unlist and then subset
subset(data.frame(b = unlist(a), row.names = NULL), b != 0)
# b
#1 A
#2 B
#3 C
#4 D
Using pivot_longer function, you can reshape your dataframe into a longer format and then filter values that are "0". With the function column_to_rownames from tibble package, you can pass the first column as rownames.
Altogether, you can do something like this:
library(tidyr)
library(dplyr)
library(tibble)
a %>% pivot_longer(everything(), names_to = "Row", values_to = "b") %>%
filter(b != "0") %>%
column_to_rownames("Row")
b
X1 A
X2 B
X3 C
X4 D
This question already has answers here:
drop columns that take less than n values?
(2 answers)
Closed 3 years ago.
I have some data that I would like to investigate and would like to pull out
all features which have a certain number of unique values, whether that's 2,
5, 10, etc.
I'm not sure how to go about doing this though.
For example :
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst
tst %>%
filter(<variables with x unique values>)
Where x=2 would just filter to a, x=3 filter to b, etc
You can use select_if with the n_distinct function.
tst %>%
select_if(~n_distinct(.) == 2)
# a
# 1 1
# 2 1
# 3 1
# 4 0
# 5 0
Here is one way in base R:
x <- 2
tst[, apply(tst, 2, function(row) length(unique(row))) == x, drop = FALSE]
This example code will create a variable combination of abcd. Then will identify which are duplicate combinations, then will return only those combinations that are not duplicates. I hope this is what you were asking for...
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst %>%
unite(new,a,b,c,d,sep="") %>%
mutate(duplicate=duplicated(new)) %>%
filter(duplicate !="TRUE")
I'm looking to subset rows by the value of the next row for one column.
df <- data.frame(t = c(1,2,3,4,5,6,7,8),
b = c(1,2,1,0,1,0,1,2))
So I want to subset df and get the rows where b == 2 following any row where b == 1. So subset should return 2 rows (where t=1 and t=7)
I tried using which and lag from dplyr, as mentioned in other answers, but I couldn't get that to work.
We can get the next value with lead, create a condition to check whether it is equal to 2 and the current value is 1 and use that expression in the filter
library(dplyr)
df %>%
filter(b == 1, lead(b)==2)
# t b
#1 1 1
#2 7 1
Or use subset from base R
subset(df, c(b[-1] == 2, FALSE) & b == 1)