Paste column content by group into a new group - r

Here is my data frame:
a <- data.frame(x=c(rep("A",2),rep("B",4)),
y=c("AA","BB","CC","AA","DD","AA"))
What I want is group the data frame by x and for each member of the group (here A or B), I would like to paste the content of column y into a single element, separated by _. I would like to sort it by alphabetical order and remove identical characters. Here is the desired result:
out <- data.frame(x=c(rep("A",1),rep("B",1)),
y=c("AA_BB","AA_CC_DD"))
I tried the following code, which produces an error message:
library(dplyr)
a %>% group_by(x) %>% mutate(y_comb=paste(as.character(sort(unique(y))))) %>%
slice(1) %>% ungroup()

We get the distinct element of 'x', 'y' column (as there is only two columns, simply use distinct on the entire data), then arrange the rows by 'x', 'y' column, grouped by 'x', paste (str_c) the 'y' elements into a single string by collapseing with _
library(dplyr)
library(stringr)
a %>%
distinct %>%
arrange(x, y) %>%
group_by(x) %>%
summarise(y = str_c(y, collapse="_"))
-output
# A tibble: 2 x 2
# x y
#* <chr> <chr>
#1 A AA_BB
#2 B AA_CC_DD
The error in OP's code is because of the difference in length after doing the unique and paste by itself doesn't do anything. We need to either collapse (or sep - in this case it is collapse). mutate is particular about returning the same length as the number of rows of original data while summarise is not

Perhaps we can do like this
a %>%
group_by(x) %>%
summarise(y = paste0(sort(unique(y)), collapse = "_"))
which gives
# A tibble: 2 x 2
x y
<chr> <chr>
1 A AA_BB
2 B AA_CC_DD

Base R option with aggregate :
aggregate(y~x, unique(a), function(x) paste0(sort(x), collapse = '_'))
# x y
#1 A AA_BB
#2 B AA_CC_DD

Related

R: unique column values, combine rows of second column

From a data frame I need a list of all unique values of one column. For possible later check we need to keep information from a second column, though for simplicity combined.
Sample data
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df
id source
1 1 x
2 3 y
3 1 z
The desired outcome is
df2
id source
1 1 x,z
2 3 y
It should be pretty easy, still I cannot find the proper function / grammar?
E.g. something like
df %>%
+ group_by(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
or
df %>%
+ distinct(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
What am I missing? Thanks for any advice!
You can use aggregate from stats to combine per group.
aggregate(source ~ id, df, paste, collapse = ",")
# id source
#1 1 x,z
#2 3 y
Using your code here is a solution:
library(dplyr)
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df %>%
group_by(id) %>%
summarise(vlist = paste0(source, collapse = ",")) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 2
id vlist
<dbl> <chr>
1 1 x,z
2 3 y
Your second approach doesn't work because you call distinct before you aggregate the data. Also, you need to use .keep_all = TRUE to also keep the other column.
Your first approach was missing the distinct.
aggregate(source ~ id, df, toString)

R - Identifying only strings ending with A and B in a column

I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:
df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))
What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.
The end result of what I'm looking for would look like this:
"S_026" and "S_028" as these are the only ones that have A and B at the end.
All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.
Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.
Any suggestions?
We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring
library(dplyr)
df %>%
group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>%
filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 1
Samples
<chr>
1 S_026A
2 S_026B
3 S_028A
4 S_028B
5 S_026B
You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.
library(dplyr)
library(tidyr)
df %>%
extract(Samples, c('value', 'last'), '(.*)(.)') %>%
group_by(value) %>%
filter(all(c('A', 'B') %in% last)) %>%
ungroup %>%
distinct(value)
# value
# <chr>
#1 S_026
#2 S_028

Sum duplicated columns in dataframe in R

Hello i have the following dataframe :
colnames(tv_viewing time) <-c("channel_1", "channel_2", "channel_1", "channel_2")
Each row gives a the viewing time for an individual on channel 1 and channel 2, for instance for individual 1 i get :
tv_viewing_time[1,] <- c(1,2,4,5)
What I would like is actually a dataframe that sums up the values of duplicated columns.
I.e. I would get
colnames(tv_viewing time) <-c("channel_1", "channel_2")
Where for instance for individual 1 i would get :
tv_viewing_time[1,] <- c(5,7)
As all two row entries are summed when they correspond to duplicated column names.
I have looked for an answer but all suggested on other threads did not work for my dataframe case.
Note that there are many more duplicated columns, so i am looking for a solution that can be efficiently applied to all my duplicates.
We could use split.default with rowSums
sapply(split.default(tv_viewing_time,
sub("\\.\\d+$", "", names(tv_viewing_time))), rowSums)
-output
# channel_1 channel_2
# 5 7
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
tv_viewing_time %>%
pivot_longer(cols = everything()) %>%
group_by(name = str_remove(name, "\\.\\d+$")) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 1 x 2
# channel_1 channel_2
# <dbl> <dbl>
#1 5 7
data
tv_viewing_time <- data.frame(channel_1 = 1, channel_2 = 2,
channel_1 = 4, channel_2 = 5)

Get the unique elements from the data frame by comparing two column values

I have extracted some elements from regrex and combined them. Now in the final df , I have got two columns in a data frame. I have to get unique elements from f1 column based on f2 column.
df <- as.data.frame(rbind(c('11061002','11862192','11083069'),
c(" ",'1234567','452589')))
df$f1 <-paste0(df$V1,
',',
df$V2,
',',
df$V3)
df_1 <- as.data.frame(rbind(c('11862192'),
c('145')))
names(df_1)[1] <-'f2'
df <- as.data.frame(cbind(df,df_1))
df <-df[,c(4,5)]
The expected output is the third column with values :
11061002,11083069 as 11862192 was present in both.
,1234567,452589 as there is not 145 present in second column.
Please guide.
You can split the string on , in f1 and use setdiff to get values that are not present in f2 after removing empty values.
mapply(function(x, y) toString(setdiff(x[x!=' '], y)),
strsplit(df$f1, ','), df$f2)
#[1] "11061002, 11083069" "1234567, 452589"
If there could be multiple comma-separated values in f2, we can split f2 as well.
mapply(function(x, y) toString(setdiff(x[x!=' '], y)),
strsplit(df$f1, ','), strsplit(df$f2, ','))
We can use tidyverse
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(f1, f2) %>%
group_by(rn)%>%
summarise(new = toString(setdiff(setdiff(f1, f2), ""))) %>%
select(-rn) %>%
bind_cols(df, .)
# A tibble: 2 x 6
# V1 V2 V3 f1 f2 new
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 "11061002" 11862192 11083069 "11061002,11862192,11083069" 11862192 11061002, 11083069
#2 " " 1234567 452589 " ,1234567,452589" 145 1234567, 452589

issues calculating rowwise maximum

suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)
Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5

Resources