Combining rows with repeated ids by means in R

Combining rows with repeated ids by means in R - r

I would like to combine the following dataframe where repeated ids will be combined together by the mean values of repeated observation.
id V1 V2
AA 21.76410 1
BB 25.57568 0
BB 20.91222 0
CC 21.71828 1
CC 22.89878 1
FF 22.20535 0
structure(list(id = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("AA",
"BB", "CC", "FF"), class = "factor"), V1 = c(21.7640981693372,
25.575675904744, 20.9122208946358, 21.7182828011676, 22.8987775530191,
22.2053520672232), V2 = c(1, 0, 0, 1, 1, 0)), class = "data.frame", row.names = c(NA,
-6L))
After data reduction by mean, it should like this-
id V1 V2
AA 21.76410 1
BB 23.24395 0 # mean reduction for BB in V1 and V2
CC 22.30853 1 # same as above
FF 22.20535 0
structure(list(id = structure(1:4, .Label = c("AA", "BB", "CC",
"FF"), class = "factor"), V1 = c(21.7641, 23.24395, 22.30853,
22.20535), V2 = c(1, 0, 1, 0)), class = "data.frame", row.names = c(NA,
-4L))
How can I do that in R?
Any package function or custom function code you want to share with me would be of great help.
Thanks.

With base R, it can be done with aggregate
aggregate(.~ id, df1, mean)

Using dplyr:
df %>%
group_by(id) %>%
mutate(V1 = ifelse(n() > 1, mean(V1), V1)) %>%
unique()
# A tibble: 4 x 3
# Groups: id [4]
# id V1 V2
#<fct> <dbl> <dbl>
#1 AA 21.8 1
#2 BB 23.2 0
#3 CC 22.3 1
#4 FF 22.2 0

Another way of using aggregate from base R
dfout <- aggregate(df[-1],df[1],FUN = mean)
such that
> dfout
id V1 V2
1 AA 21.76410 1
2 BB 23.24395 0
3 CC 22.30853 1
4 FF 22.20535 0

Related

Merging data frame and filling missing values [duplicate]

This question already has answers here:
Merging a lot of data.frames [duplicate]
(1 answer)
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Closed 2 years ago.
I want to merge the following 3 data frames and fill the missing values with -1. I think I should use the fct merge() but not exactly know how to do it.
> df1
Letter Values1
1 A 1
2 B 2
3 C 3
> df2
Letter Values2
1 A 0
2 C 5
3 D 9
> df3
Letter Values3
1 A -1
2 D 5
3 B -1
desire output would be:
Letter Values1 Values2 Values3
1 A 1 0 -1
2 B 2 -1 -1 # fill missing values with -1
3 C 3 5 -1
4 D -1 9 5
code:
> dput(df1)
structure(list(Letter = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), Values1 = c(1, 2, 3)), class = "data.frame", row.names = c(NA,
-3L))
> dput(df2)
structure(list(Letter = structure(1:3, .Label = c("A", "C", "D"
), class = "factor"), Values2 = c(0, 5, 9)), class = "data.frame", row.names = c(NA,
-3L))
> dput(df3)
structure(list(Letter = structure(c(1L, 3L, 2L), .Label = c("A",
"B", "D"), class = "factor"), Values3 = c(-1, 5, -1)), class = "data.frame", row.names = c(NA,
-3L))

You can get data frames in a list and use merge with Reduce. Missing values in the new dataframe can be replaced with -1.
new_df <- Reduce(function(x, y) merge(x, y, all = TRUE), list(df1, df2, df3))
new_df[is.na(new_df)] <- -1
new_df
# Letter Values1 Values2 Values3
#1 A 1 0 -1
#2 B 2 -1 -1
#3 C 3 5 -1
#4 D -1 9 5
A tidyverse way with the same logic :
library(dplyr)
library(purrr)
list(df1, df2, df3) %>%
reduce(full_join) %>%
mutate(across(everything(), replace_na, -1))

Here's a dplyr solution
df1 %>%
full_join(df2, by = "Letter") %>%
full_join(df3, by = "Letter") %>%
mutate_if(is.numeric, function(x) replace_na(x, -1))
output:
Letter Values1 Values2 Values3
<chr> <dbl> <dbl> <dbl>
1 A 1 0 -1
2 B 2 -1 -1
3 C 3 5 -1
4 D -1 9 5

How to sum df when it contains characters?

I am trying to prep my data and I am stuck with one issue. Lets say I have the following data frame:
df1
Name C1 Val1
A a x1
A a x2
A b x3
A c x4
B d x5
B d x6
...
and I want to narrow down the df to
df2
Name C1 Val
A a,b,c x1+x2+x3+x4
B d x5+x6
...
while a is a character value and x is numeric value
I have been trying using sapply, rowsum and
df2<- aggregate(df1, list(df1[,1]), FUN= summary)
but it just can't put the character values in a list for each Name.
Can someone help me how to receive df2?

m <- function(x) if(is.numeric(x<- type.convert(x)))sum(x) else toString(unique(x))
aggregate(.~Name,df1,m)
Name C1 Val1
1 A a, b, c 10
2 B d 11
where
df1
Name C1 Val1
1 A a 1
2 A a 2
3 A b 3
4 A c 4
5 B d 5
6 B d 6

This is your df, I give it numbers 1 to 6 in Val1
df <-
structure(list(Name = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), C1 = structure(c(1L, 1L, 2L, 3L, 4L,
4L), .Label = c("a", "b", "c", "d"), class = "factor"), Val1 = 1:6), row.names = c(NA,
-6L), class = "data.frame")
We just use summarise:
df %>%
group_by(Name) %>%
summarise(C1=paste(unique(C1),collapse=","),Val1=sum(Val1))
# A tibble: 2 x 3
Name C1 Val1
<fct> <chr> <int>
1 A a,b,c 10
2 B d 11

Quick and easy dplyr solution:
library(dplyr)
library(stringr)
df1 %>%
mutate(Val1_num = as.numeric(str_extract(Val1, "\\d+"))) %>%
group_by(Name) %>%
summarise(C1 = paste(unique(C1), collapse = ","),
Val1 = paste(unique(Val1), collapse = "+"),
Val1_num = sum(Val1_num))
#> # A tibble: 2 x 4
#> Name C1 Val1 Val1_num
#> <chr> <chr> <chr> <dbl>
#> 1 A a,b,c x1+x2+x3+x4 10
#> 2 B d x5+x6 11
Or in base:
df2 <- aggregate(df1, list(df1[,1]), FUN = function(x) {
if (all(grepl("\\d", x))) {
sum(as.numeric(gsub("[^[:digit:]]", "", x)))
} else {
paste(unique(x), collapse = ",")
}
})
df2
#> Group.1 Name C1 Val1
#> 1 A A a,b,c 10
#> 2 B B d 11
data
df1 <- read.csv(text = "
Name,C1,Val1
A,a,x1
A,a,x2
A,b,x3
A,c,x4
B,d,x5
B,d,x6", stringsAsFactors = FALSE)

Is there a way to capture the sequence of values based on there rank

Hi all I have got a dataframe. I need to create another column so that it should tell at what place each categories are there. For example PLease refer expected output
df
ColB ColA
X A>B>C
U B>C>A
Z C>A>B
Expected output
df1
ColB ColA A B C
X A>B>C 1 2 3
U B>C>A 3 1 2
Z C>A>B 2 3 1

We can first bring ColA into separate rows, group_by ColB and give an unique row number for each entry and then convert the data into wide format using pivot_wider.
library(dplyr)
library(tidyr)
df %>%
mutate(ColC = ColA) %>%
separate_rows(ColC, sep = ">") %>%
group_by(ColB) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = ColC, values_from = row)
# ColB ColA A B C
# <fct> <fct> <int> <int> <int>
#1 X A>B>C 1 2 3
#2 U B>C>A 3 1 2
#3 Z C>A>B 2 3 1
data
df <- structure(list(ColB = structure(c(2L, 1L, 3L), .Label = c("U",
"X", "Z"), class = "factor"), ColA = structure(1:3, .Label = c("A>B>C",
"B>C>A", "C>A>B"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))

We can do this in base R
df[LETTERS[1:3]] <- t(sapply(regmatches(df$ColA, gregexpr("[A-Z]",
df$ColA)), match, x = LETTERS[1:3]))
df
# ColB ColA A B C
#1 X A>B>C 1 2 3
#2 U B>C>A 3 1 2
#3 Z C>A>B 2 3 1
data
df <- structure(list(ColB = structure(c(2L, 1L, 3L), .Label = c("U",
"X", "Z"), class = "factor"), ColA = structure(1:3, .Label = c("A>B>C",
"B>C>A", "C>A>B"), class = "factor")), class = "data.frame",
row.names = c(NA,
-3L))

Replace a subset of data frame

I have a data frame with some error
T item V1 V2
1 a 2 .1
2 a 5 .8
1 b 1 .7
2 b 2 .2
I have another data frame with corrections for items concerning V1 only
T item V1
1 a 2
2 a 6
How do I get the final data frame? Should I use merge or rbind. Note: actual data frames are big.

An option would be a data.table join on the 'T', 'item' and assigning the 'V1' with the the corresponding 'V1' column (i.V1) from the second dataset
library(data.table)
setDT(df1)[df2, V1 := i.V1, on = .(T, item)]
df1
# T item V1 V2
#1: 1 a 2 0.1
#2: 2 a 6 0.8
#3: 1 b 1 0.7
#4: 2 b 2 0.2
data
df1 <- structure(list(T = c(1L, 2L, 1L, 2L), item = c("a", "a", "b",
"b"), V1 = c(2L, 5L, 1L, 2L), V2 = c(0.1, 0.8, 0.7, 0.2)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(T = 1:2, item = c("a", "a"), V1 = c(2L, 6L)),
class = "data.frame", row.names = c(NA,
-2L))

This should work -
library(dplyr)
df1 %>%
left_join(df2, by = c("T", "item")) %>%
mutate(
V1 = coalesce(as.numeric(V1.y), as.numeric(V1.x))
) %>%
select(-V1.x, -V1.y)

How to add columns that have same name

I have large datasets, that is two data frame. and want to add value that has the same column name in the other one data frame. how do I set the code?
df1
a b c
0 0 0
0 0 0
df2
a c d
1 1 0
0 1 0
what I expected is:
a b c
1 0 1
0 0 1
it means I'm in charge to stay with colnames df1 but the value is in df2. thanks for the help. have a good day

Works with data.frame
data.frame(lapply(X = split.default(x = cbind(df1, df2),
f = c(names(df1), names(df2))),
FUN = rowSums))[names(df1)]
# a b c
#1 1 0 1
#2 0 0 1
Works with data.frame and matrix
nm = intersect(colnames(df1), colnames(df2))
nm1 = colnames(df1)[!colnames(df1) %in% nm]
m = cbind(df1[, nm1, drop = FALSE], df1[, nm, drop = FALSE] + df2[, nm, drop = FALSE])
colnames(m) = c(nm1, nm)
m[,colnames(df1)]
# a b c
#1 1 0 1
#2 0 0 1
#DATA
df1 = structure(list(a = c(0L, 0L), b = c(0L, 0L), c = c(0L, 0L)),
class = "data.frame",
row.names = c(NA, -2L))
df2 = structure(list(a = 1:0, c = c(1L, 1L), d = c(0L, 0L)),
class = "data.frame",
row.names = c(NA, -2L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combining rows with repeated ids by means in R - r

With base R, it can be done with aggregate aggregate(.~ id, df1, mean)

Using dplyr: df %>% group_by(id) %>% mutate(V1 = ifelse(n() > 1, mean(V1), V1)) %>% unique() # A tibble: 4 x 3 # Groups: id [4] # id V1 V2 #<fct> <dbl> <dbl> #1 AA 21.8 1 #2 BB 23.2 0 #3 CC 22.3 1 #4 FF 22.2 0

Another way of using aggregate from base R dfout <- aggregate(df[-1],df[1],FUN = mean) such that > dfout id V1 V2 1 AA 21.76410 1 2 BB 23.24395 0 3 CC 22.30853 1 4 FF 22.20535 0

Related

Merging data frame and filling missing values [duplicate]

How to sum df when it contains characters?

Is there a way to capture the sequence of values based on there rank

Replace a subset of data frame

How to add columns that have same name

Categories

Resources