I have a data frame on R. I would like to get the unique rows based on the first three columns and also append the min value of the 4th column in each unique row.
dat <- tibble(
x = c("a", "a", "k", "k"),
y = c("a", "a", "l", "l"),
z = c("e", "e", "m" ,"m"),
t = c("4", "3", "8" ,"9"))
What I would like to see is below.
x
y
z
t
a
a
e
3
k
l
m
8
I believe there is a very easy way to do that but I can not see it at that moment.
With tidyverse, use group_by with summarise
library(dplyr)
dat %>%
group_by(across(x:z)) %>%
summarise(t = min(t), .groups = 'drop')
-output
# A tibble: 2 × 4
x y z t
<chr> <chr> <chr> <chr>
1 a a e 3
2 k l m 8
Or do an arrange and use distinct
dat %>%
arrange(across(everything())) %>%
distinct(across(x:z), .keep_all = TRUE)
# A tibble: 2 × 4
x y z t
<chr> <chr> <chr> <chr>
1 a a e 3
2 k l m 8
We may call to apply() to find the unique rows values per row in dat. Then, we can used duplicated() to look for duplicates and use the negation ! to return rows that are not duplicates. We use which to obtain integers corresponding to the rows in dat that are not duplicates. Finally, use these integers (unique_rows) to extract the unique rows from dat. As such, we do not have to append.
unique_rows <- which(!duplicated(apply(dat[, 1:3], 1, unique)))
out <- dat[unique_rows, ]
Output
> out
x y z t
1 a a e 4
3 k l m 8
Another way to deal with this would be to take minimum value of t column and keep remaining columns as group in aggregate function.
aggregate(t~., dat, min)
# x y z t
#1 a a e 3
#2 k l m 8
Related
Hi I have two dataframes, based on the id match, i wanted to replace table a's values with that of table b.
sample dataset is here :
a = tibble(id = c(1, 2,3),
type = c("a", "x", "y"))
b= tibble(id = c(1,3),
type =c("d", "n"))
Im expecting an output like the following :
c= tibble(id = c(1,2,3),
type = c("d", "x", "n"))
In dplyr v1.0.0, the rows_update() function was introduced for this purpose:
rows_update(a, b)
# Matching, by = "id"
# # A tibble: 3 x 2
# id type
# <dbl> <chr>
# 1 1 d
# 2 2 x
# 3 3 n
Here is an option using dplyr::left_join and dplyr::coalesce
library(dplyr)
a %>%
rename(old = type) %>%
left_join(b, by = "id") %>%
mutate(type = coalesce(type, old)) %>%
select(-old)
## A tibble: 3 × 2
# id type
#. <dbl> <chr>
#1 1 d
#2 2 x
#3 3 n
The idea is to join a with b on column id; then replace missing values in type from b with values from a (column old is the old type column from a, avoiding duplicate column names).
How could combine rows and keep the information on the other column I want?
for example:
I want to combine duplicate name while keeping the information that has 'g' in column 'a', if the other column e.g. column ' b'has 'NA' element from the duplicate will replace the NA.
name
a
b
c
xy
w
h
i
xy
g
NA
k
x
m
l
o
x
g
q
r
z
n
o
p
the result I'm looking for is
name
a
b
c
xy
g
h
k
x
g
q
r
z
n
o
p
Similar to DPH's solution, but I use filtering to extract the rows containing g:
library(dplyr)
library(tidyr)
df %>%
group_by(name) %>%
fill(where(~ any(is.na(.))), .direction="down") %>%
filter((any(a=="g") & a == "g") | !any(a=="g")) %>%
ungroup()
returns
# A tibble: 3 x 4
name a b c
<chr> <chr> <chr> <chr>
1 xy g h k
2 x g q r
3 z n o p
one possible solution to your task is this (correct me if I got something wrong):
library(tidyverse)
df <- data.frame(name = c("xy", "xy", "x", "x", "z"),
a = c("w", "g", "m", "g", "m") ,
b = c("h", NA, "l", "q", "o"),
c = c("i", "k", "o", "r", "p"))
df %>%
# build grouping
dplyr::group_by(name) %>%
# fill the groups downwards
tidyr::fill(where(is.character), .direction = "down") %>%
# get the last row of each group
dplyr::summarise(across(everything(), ~last(.x))) %>%
# ungroup as this prevents unwanted behaviour down stream
dplyr::ungroup()
# A tibble: 3 x 4
name a b c
<chr> <chr> <chr> <chr>
1 x g q r
2 xy g h k
3 z m o p
There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w
I have a data frame with two columns and many rows.
The first column is a character vector where each element P is a string concatenating a number (K) of strings with a comma. K is unknown in advance and can vary across rows, such that K = 5 for the first row and K = 3 for the second. The values that are concatenated themselves may or may not be the same across rows, although they do not repeat within a row. We can call these "variable names."
The second column - we can call this "variable values" - is a character vector where each element is also a string concatenating K strings with commas. Importantly, the number of strings concatenated is identical to that of the variable names. Put another way, the variable names column contains a string containing the names of variables and the variable values column contains the values that correspond to the variable names for that row.
Here's a minimal example of my data. Note that the number of substrings in e.g. var_names[i] equals the same number in values[i] but need not equal the same as var_names[j]:
# Example data
data <-
data.frame(
var_names = c(
paste("a", "b", "c", "e", "j", sep = ","),
paste("d", "a", "f", sep = ","),
paste("f", "k", "b", "a", sep = ",")
),
values = c(
paste("212", "12", "sfd", "3", "1", sep = ","),
paste("fds", "23", "g", sep = ","),
paste("df", "sdf", "w2", "w", sep = ",")
),
stringsAsFactors = FALSE
)
Given this data, I am trying to create a data frame where each of the unique values in var_names is a column name and the values for each column are based on the corresponding index in values for each row in the data. Specifically, I am looking to produce:
data.frame(a = c("212","23","w"),
b = c("12",NA,"w2"),
c = c("sfd",NA,NA),
d = c(NA,"fds",NA),
e = c("3", NA, NA),
f = c(NA, "g", "df"),
j = c("1"," NA, NA),
k = c(NA,NA,"sdf"))
I was able to produce what I wanted using the below. However, I was wondering whether there might be some function/package that would let me skip some of these steps and accomplish this more quickly. Currently, I create a loop that generates entire data frame for each row and then combine them into a single data frame. My initial thought was to take the var_val object in my code and use tidyr::pivot_wider() to generate each row's data frame, but that did not work due to a spec error.
# Split variable names and values into a list
# where each element is a row's values/names
vars_name_l <- strsplit(data$var_names, split = ",")
values_l <- strsplit(data$values, split = ",")
# Initialize a list to store each row's
# data frame
combined <- list()
# Loop through each row's data and generate a
# list of data frames
for (i in 1:length(nrow(data))) {
# Get a row's variable names and values into
# a data frame.
var_val <- data.frame(var_names = vars_name_l[[i]],
values = values_l[[i]],
stringsAsFactors = FALSE)
# Create an empty data frame then add variable
# names and the values for the variables, store in
# our list
df <- as.data.frame(matrix(numeric(), nrow = 0, ncol = length(var_val$var_names)))
colnames(df) <- var_val$var_names
df[1, ] <- var_val$values
combined[[i]] <- df
}
# Collapse list to a single data frame, rearrange
result <- bind_rows(combined)
result[ ,order(colnames(result))]
We can do this with bind_rows easily
library(dplyr)
bind_rows(do.call(Map, c(f = setNames, lapply(unname(data)[2:1], strsplit, ","))))
# A tibble: 3 x 8
# a b c e j d f k
#* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 212 12 sfd 3 1 <NA> <NA> <NA>
#2 23 <NA> <NA> <NA> <NA> fds g <NA>
#3 w w2 <NA> <NA> <NA> <NA> df sdf
Or it can be
bind_rows(do.call(Map, c(f = function(x, y)
setNames(as.list(x), y), lapply(unname(data)[2:1], strsplit, ","))))
Or another option is unnest_wider from tidyr
library(tidyr)
library(purrr)
data %>%
mutate_all(strsplit, ",") %>%
transmute(new = map2(values, var_names, ~ set_names(as.list(.x), .y))) %>%
unnest_wider(c(new))
# A tibble: 3 x 8
# a b c e j d f k
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 212 12 sfd 3 1 <NA> <NA> <NA>
#2 23 <NA> <NA> <NA> <NA> fds g <NA>
#3 w w2 <NA> <NA> <NA> <NA> df sdf
Or using rbindlist from data.table
library(data.table)
rbindlist(do.call(Map, c(f = function(x, y)
setNames(as.list(x), y), lapply(unname(data)[2:1], strsplit, ","))),
fill = TRUE)
# a b c e j d f k
#1: 212 12 sfd 3 1 <NA> <NA> <NA>
#2: 23 <NA> <NA> <NA> <NA> fds g <NA>
#3: w w2 <NA> <NA> <NA> <NA> df sdf
We can first get data in separate rows from column var_names and values and then get data in wide format.
library(dplyr)
library(tidyr)
data %>%
mutate(row = row_number()) %>%
separate_rows(var_names, values) %>%
pivot_wider(names_from = var_names, values_from = values) %>%
select(-row)
# a b c e j d f k
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 212 12 sfd 3 1 NA NA NA
#2 23 NA NA NA NA fds g NA
#3 w w2 NA NA NA NA df sdf
I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333