R: group_id by changing row values - r

1) Firstly, I have this data frame:
df <- data.frame(value=c("a","a","a", "b", "b", "b", "a", "a", "a"), ,
desired_id=c(1,1,1,2,2,2,3,3,3))
How do I generate the desired_id column?
My groups are assigned by row order.
That is, everytime the value column changes, I want the group indices to assign the next higher group indices.
I tried df$desired_id_replicate <- df %>% group_by(value) %>% group_indices
but that doesn't work as all value=="a" will be assigned the same group indices.
2)Secondly, I have this data frame:
df <- data.frame(value=c("a","a","a", "b", "b", "b", "a", "a", "a"),
value2=c("a","a","c", "b", "b", "c", "a", "a", "d"),
desired_id=c(1,1,2,3,3,4,5,5,6))
How do I generate the desired_id from the value and value2 column.
My groups are assigned row-wise again. That is, everytime a unique combination of value and value2 changes, the next higher desired_id should be assigned.
Similar to the above, I tried df$desired_id_replicate <- df %>% group_by(value, value2) %>% group_indices
but that doesn't work as all value=="a"&value2=="a" will be assigned the same group indices.
Thank you!

We can use rleid (run-length-encoding id) from data.table which would basically increment 1 for each element that is not equal to the previous element
library(data.table)
library(dplyr)
df%>%
mutate(newcol = rleid(value))
and for the second dataset, it would be
df %>%
mutate(new = rleid(value, value2))
# value value2 desired_id new
#1 a a 1 1
#2 a a 1 1
#3 a c 2 2
#4 b b 3 3
#5 b b 3 3
#6 b c 4 4
#7 a a 5 5
#8 a a 5 5
#9 a d 6 6
Or with rle from base R
df$newcol <- with(rle(df$value), rep(seq_along(values), lengths))

Related

How to count the frequency of unique factor across each row in r dataframe

I have a dataset like the following:
Age Monday Tuesday Wednesday
6-9 a b a
6-9 b b c
6-9 c a
9-10 c c b
9-10 c a b
Using R, I want to get the following data set/ results (where each column represents the total frequency of each of the unique factor):
Age a b c
6-9 2 1 0
6-9 0 2 1
6-9 1 0 1
9-10 0 1 2
9-10 1 1 1
Note: My data also contains missing values
couple of quick and dirty tidyverse solutions - there should be a way to reduce steps though.
library(tidyverse) # install.packages("tidyverse")
input <- tribble(
~Age, ~Monday, ~Tuesday, ~Wednesday,
"6-9", "a", "b", "a",
"6-9", "b", "b", "c",
"6-9", "", "c", "a",
"9-10", "c", "c", "b",
"9-10", "c", "a", "b"
)
# pivot solution
input %>%
rowid_to_column() %>%
mutate_all(function(x) na_if(x, "")) %>%
pivot_longer(cols = -c(rowid, Age), values_drop_na = TRUE) %>%
count(rowid, Age, value) %>%
pivot_wider(id_cols = c(rowid, Age), names_from = value, values_from = n, values_fill = list(n = 0)) %>%
select(-rowid)
# manual solution (if only a, b, c are expected as options)
input %>%
unite(col = "combine", Monday, Tuesday, Wednesday, sep = "") %>%
transmute(
Age,
a = str_count(combine, "a"),
b = str_count(combine, "b"),
c = str_count(combine, "c")
)
In base R, we can replace empty values with NA, get unique values in the dataframe, and use apply row-wise and count the occurrence of values using table.
df[df == ''] <- NA
vals <- unique(na.omit(unlist(df[-1])))
cbind(df[1], t(apply(df, 1, function(x) table(factor(x, levels = vals)))))
# Age a b c
#1 6-9 2 1 0
#2 6-9 0 2 1
#3 6-9 1 0 1
#4 9-10 0 1 2
#5 9-10 1 1 1

Organize subgroup strings (text)

I am trying to convert something like this df format:
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df
first words
1 a about
2 a among
3 b blue
4 b but
5 b both
6 c cat
into the following format:
df1
first words
1 a about, among
2 b blue, but, both
3 c cat
>
I have tried
aggregate(words ~ first, data = df, FUN = list)
first words
1 a 1, 2
2 b 3, 5, 4
3 c 6
and tidyverse:
df %>%
group_by(first) %>%
group_rows()
Any suggestions would be appreciated!
A data.table solution:
library(data.table)
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df <- setDT(df)[, lapply(.SD, toString), by = first]
df
# first words
# 1: a about, among
# 2: b blue, but, both
# 3: c cat
# convert back to a data.frame if you want
setDF(df)
Using tidyverse, after the group_by use summarise to either paste
library(dplyr)
df %>%
group_by(first) %>%
summarise(words = toString(words))
# A tibble: 3 x 2
# first words
# <fct> <chr>
#1 a about, among
#2 b blue, but, both
#3 c cat
or keep it as a list column
df %>%
group_by(first) %>%
summarise(words = list(words))

How to delete rows if repeated more than 5 times? [duplicate]

This question already has answers here:
Truncating a dataframe according to count of vector elements
(2 answers)
Closed 3 years ago.
I'm really desperately looking for an answer.
I have only one column with duplicated IDs.
I want to have this kind of code:
ID
a
a
a
a
a
b
b
b
b
b
so if there are 6 a's, the 6th row should be deleted.
Here are couple of options. Grouped by the 'ID' column, slice the first 5 rows (with head and row_number())
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(head(row_number(), 5))
or with filter to create a logical expression based on row_number() after grouping by the 'ID' column
df1 %>%
group_by(ID) %>%
filter(row_number() < 6)
In base R, we can use ave with seq_along and subset for each ID.
subset(df, ave(ID, ID, FUN = seq_along) <= 5)
# ID
#1 a
#2 a
#3 a
#4 a
#5 a
#7 b
#8 b
#9 b
#10 b
#11 b
data
df <- structure(list(ID = c("a", "a", "a", "a", "a", "a", "b", "b",
"b", "b", "b")), class = "data.frame", row.names = c(NA, -11L))

Count number of duplicates in other dataframe

I have two data.frames dfA and dfB. Both of them have a column called key.
Now I'd like to know how many duplicates for A$key there are in B$key.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
It should be A=2, B=3, C=0 and D=1. Whats the most easiest way to do this?
Use table
table(factor(B$key, levels = sort(unique(A$key))))
#A B C D
#2 3 0 1
factor is needed here such that we also 'count' entries that do not appear in B$key, that is C.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
library(dplyr)
library(tidyr)
B %>%
filter(key %in% A$key) %>% # keep values that appear in A
count(key) %>% # count values
complete(key = A$key, fill = list(n = 0)) # add any values from A that don't appear
# # A tibble: 4 x 2
# key n
# <chr> <dbl>
# 1 A 2
# 2 B 3
# 3 C 0
# 4 D 1
Using tidyverse you can do:
A %>%
left_join(B %>% #Merging df A with df B for which the count in "key" was calculated
group_by(key) %>%
tally(), by = c("key" = "key")) %>%
mutate(n = ifelse(is.na(n), 0, n)) #Replacing NA with 0
key n
1 A 2
2 B 3
3 C 0
4 D 1
Actually you mean how many occurrences of each value of A$key you have in B$key?
You can obtain this by coding B$key as factor with the unique values of A$key as levels.
o <- table(factor(B$key, levels=unique(A$key)))
Yielding:
> o
A B C D
2 3 0 1
If you really want to count duplicates, do
dupes <- ifelse(o - 1 < 0, 0, o - 1)
Yielding:
> dupes
A B C D
1 2 0 0

How to add new column to R dataframe based on values in multiple columns

I have created the following dataframe
df<-data.frame("A"<-(1:5), "B"<-c("A","B", "C", "B",'C' ), "C"<-c("A", "A",
"B", 'B', "B"))
names(df)<-c("A", "B", "C")
I am triyng to obtain the duplicated values between columns A and C following output and add the corresponding values in column B . The expected dataframe should be
df2<- "B" "Dupvalues"
1 A
4 B
I am unable to do this. I request some help here
df<-data.frame(A = (1:5),
B = c("A","B", "C", "B",'C' ),
C = c("A", "A","B", 'B', "B"), stringsAsFactors = F)
library(dplyr)
df %>%
filter(B == C) %>% # keep rows when B equals C
group_by(A) %>% # for each A
transmute(DupValues = B) %>% # keep the duplicate value
ungroup() # forget the grouping
# # A tibble: 2 x 2
# A DupValues
# <int> <chr>
# 1 1 A
# 2 4 B
Note that this works if your variables are not factors, but character varaibles.

Resources