Aggregate function to group,count and mean - r

I have a dataset with three variable a b and c.
a 45 345
a 45 345
a 34 234
a 35 456
b 45 123
b 65 345
b 34 456
c 23 455
c 54 567
c 34 345
c 87 567
c 67 345
I want to aggregate the data set by a and b and give count and mean. Please find the below output. Is there any function to do both together.
A B numobs c
a 34 1 234
a 35 1 456
a 45 2 345
b 34 1 456
b 45 1 123
b 65 1 345
c 23 1 455
c 34 1 345
c 54 1 567
c 67 1 345
c 87 1 567
numobs is the count and c is the mean value

We can use dplyr
library(dplyr)
df1 %>%
group_by(A, B) %>%
mutate(numbobs =n(), C= mean(C))
Or with data.table
library(data.table)
setDT(df1)[, c("numbobs", "C") := .(.N, mean(C)) , by = .(A, B)]

Related

How to stack multiple columns using tidyverse

I have a data frame like this in wide format
setseed(1)
df = data.frame(item=letters[1:6], field1a=sample(6,6),field1b=sample(60,6),
field1c=sample(200,6),field2a=sample(6,6),field2b=sample(60,6),
field2c=sample(200,6))
what would be the best way to stack all a columns together and all b together and all c together like this
items fielda fieldb fieldc
a 2 52 121
a 1 44 57
using base R:
cbind(item=df$item,unstack(transform(stack(df,-1),ind=sub("\\d+","",ind))))
item fielda fieldb fieldc
1 a 2 57 138
2 b 6 39 77
3 c 3 37 153
4 d 4 4 99
5 e 1 12 141
6 f 5 10 194
7 a 3 17 97
8 b 4 23 120
9 c 5 1 98
10 d 1 22 37
11 e 2 49 163
12 f 6 19 131
Or you can use the reshape function in Base R:
reshape(df,varying = split(names(df)[-1],rep(1:3,2)),idvar = "item",direction = "long")
item time field1a field1b field1c
a.1 a 1 2 57 138
b.1 b 1 6 39 77
c.1 c 1 3 37 153
d.1 d 1 4 4 99
e.1 e 1 1 12 141
f.1 f 1 5 10 194
a.2 a 2 3 17 97
b.2 b 2 4 23 120
c.2 c 2 5 1 98
d.2 d 2 1 22 37
e.2 e 2 2 49 163
f.2 f 2 6 19 131
You can also decide to separate the name of the dataframe by yourself then format it:
names(df)=sub("(\\d)(.)","\\2.\\1",names(df))
reshape(df,varying= -1,idvar = "item",direction = "long")
If we are using tidyverse, then gather into 'long' format, do some rearrangements with the column name and spread
library(tidyverse)
out <- df %>%
gather(key, val, -item) %>%
mutate(key1 = gsub("\\d+", "", key),
key2 = gsub("\\D+", "", key)) %>%
select(-key) %>%
spread(key1, val) %>%
select(-key2)
head(out, 2)
# item fielda fieldb fieldc
#1 a 2 57 138
#2 a 3 17 97
Or a similar option is melt/dcast from data.table, where we melt into 'long' format, substring the 'variable' and then dcast to 'wide' format
library(data.table)
dcast(melt(setDT(df), id.var = "item")[, variable := sub("\\d+", "", variable)
], item + rowid(variable) ~ variable, value.var = 'value')[
, variable := NULL][]
# item fielda fieldb fieldc
# 1: a 2 57 138
# 2: a 3 17 97
# 3: b 6 39 77
# 4: b 4 23 120
# 5: c 3 37 153
# 6: c 5 1 98
# 7: d 4 4 99
# 8: d 1 22 37
# 9: e 1 12 141
#10: e 2 49 163
#11: f 5 10 194
#12: f 6 19 131
NOTE: Should also work when the lengths are not balanced for each cases
data
set.seed(1)
df = data.frame(item = letters[1:6],
field1a=sample(6,6),
field1b=sample(60,6),
field1c=sample(200,6),
field2a=sample(6,6),
field2b=sample(60,6),
field2c=sample(200,6))

How to identify data which does not show link between two data sets? [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 4 years ago.
Dataset1:
id1 id2 abc n
1 111 yes 2
2 121 no 1
3 122 yes 2
4 224 no 2
5 441 no 3
6 665 yes 1
Dataset2:
id1 id2 age gen
1 111 45 m
1 111 46 f
2 1 52 f
121 122 41 f
121 122 44 m
4 224 54 f
4 221 56 m
5 441 44 m
5 441 45 f
5 441 58 f
6 665 54 f
I have two data sets. Both are linked by id1 and id2. How to identify those data from both data sets which fails to link???
We can use anti_join from the dplyr package to filter the rows with no match.
library(dplyr)
Dataset1_anti <- Dataset1 %>% anti_join(Dataset2, by = c("id1", "id2"))
Dataset1_anti
# id1 id2 abc n
# 1 2 121 no 1
# 2 3 122 yes 2
Dataset2_anti <- Dataset2 %>% anti_join(Dataset1, by = c("id1", "id2"))
Dataset2_anti
# id1 id2 age gen
# 1 2 1 52 f
# 2 121 122 41 f
# 3 121 122 44 m
# 4 4 221 56 m
DATA
Dataset1 <- read.table(text = "id1 id2 abc n
1 111 yes 2
2 121 no 1
3 122 yes 2
4 224 no 2
5 441 no 3
6 665 yes 1 ",
header = TRUE, stringsAsFactors = FALSE)
Dataset2 <- read.table(text = "id1 id2 age gen
1 111 45 m
1 111 46 f
2 1 52 f
121 122 41 f
121 122 44 m
4 224 54 f
4 221 56 m
5 441 44 m
5 441 45 f
5 441 58 f
6 665 54 f ",
header = TRUE, stringsAsFactors = FALSE)

Splitting columns of a dataframe to merge a repetitive variable

I normally find an answer in previous questions posted here, but I can't seem to find this one, so here is my maiden question:
I have a dataframe with one column with repetitive values, I would like to split the other columns and have only 1 value in the first column and more columns than in the original dataframe.
Example:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
The original dataframe has 3 columns and 15 rows.
And it would turn into a dataframe with 5 rows and the columns would be split into 7 columns: 'test', 'time1', 'time2', 'time3', 'score1', score2', 'score3'.
Does anyone have an idea how this could be done?
I think using dcast with rowid from the data.table-package is well suited for this task:
library(data.table)
dcast(setDT(df), test ~ rowid(test), value.var = c('time','score'), sep = '')
The result:
test time1 time2 time3 score1 score2 score3
1: 1 52 3 29 21 131 45
2: 2 79 44 6 119 1 186
3: 3 67 95 39 18 459 121
4: 4 83 50 40 493 466 497
5: 5 46 14 4 465 9 24
Please try this:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
df$class <- c(rep('a', 5), rep('b', 5), rep('c', 5))
df <- split(x = df, f = df$class)
binded <- cbind(df[[1]], df[[2]], df[[3]])
binded <- binded[,-c(5,9)]
> binded
test time score class time.1 score.1 class.1 time.2 score.2 class.2
1 1 40 404 a 57 409 b 70 32 c
2 2 5 119 a 32 336 b 93 177 c
3 3 20 345 a 44 91 b 100 42 c
4 4 47 468 a 60 265 b 24 478 c
5 5 16 52 a 38 219 b 3 92 c
Let me know if it works for you!

Transposing of Dataset

My dataset looks like this:
A B C B E
1 144 119 120 52
2 102 44 97 40
3 128 81 88 39
Now I want to transpose the dataset in the following format:
A Vars Values
1 B 43
2 B 78
3 B 110
1 C 46
2 C 49
3 C 130
1 B 39
2 B 86
3 B 143
1 E 59
2 E 134
3 E 49
But when I'm using the following code, one of the duplicate variable is not coming in the dataset.
df_transpose<-reshape2::melt(df,id.vars="A")
A Vars Values
1 B 43
2 B 78
3 B 110
1 C 46
2 C 49
3 C 130
1 E 59
2 E 134
3 E 49
I can't rename the duplicate variable "B" before transposing, as the location of B is dynamic. E.g. There might be 2 or 3 or more variables before "B". So, each time finding the location of "B", renaming it and then transposing is bit of a hastle.
Can anyone please help me to resolve the problem?
Thanks a lot!
How about something like,
data.frame(df1$A, stack(df1[,-1]))
# df1.A values ind
#1 1 144 B
#2 2 102 B
#3 3 128 B
#4 1 119 C
#5 2 44 C
#6 3 81 C
#7 1 120 B.1
#8 2 97 B.1
#9 3 88 B.1
#10 1 52 E
#11 2 40 E
#12 3 39 E
This is not pretty but it works:
data.frame(A=df$A, Vars=rep(names(df)[-1], each=nrow(df)), Values=c(as.matrix(df[-1])))
or:
data.frame(A=df$A, Vars=rep(names(df)[-1], each=nrow(df)), Values=stack(df[-1])[[1]])
Data used:
df <- read.table(header=TRUE, check.names=FALSE, text=
"A B C B E
1 144 119 120 52
2 102 44 97 40
3 128 81 88 39")

Indexing subgroups by sorted positions in R dataframe

I have a dataframe which contains information about several categories, and some associated variables. It is of the form:
ID category sales score
227 A 109 21
131 A 410 24
131 A 509 1
123 B 2 61
545 B 19 5
234 C 439 328
654 C 765 41
What I would like to do is be able to introduce two new columns, salesRank and scoreRank, where I find the item index per category, had they been ordered by sales and score, respectively. I can solve the general case like this:
dF <- dF[order(-dF$sales),]
dF$salesRank<-seq.int(nrow(dF))
but this doesn't account for the categories and so far I've only solved this by breaking up the dataframe. What I want would result in the following:
ID category sales score salesRank scoreRank
227 A 109 21 3 2
131 A 410 24 2 1
131 A 509 1 1 3
123 B 2 61 2 1
545 B 19 5 1 2
234 C 439 328 2 1
654 C 765 41 1 2
Many thanks!
Try:
library(dplyr)
df %>%
group_by(category) %>%
mutate(salesRank = row_number(desc(sales)),
scoreRank = row_number(desc(score)))
Which gives:
#Source: local data frame [7 x 6]
#Groups: category
#
# ID category sales score salesRank scoreRank
#1 227 A 109 21 3 2
#2 131 A 410 24 2 1
#3 131 A 509 1 1 3
#4 123 B 2 61 2 1
#5 545 B 19 5 1 2
#6 234 C 439 328 2 1
#7 654 C 765 41 1 2
From the help:
row_number(): equivalent to rank(ties.method = "first")
min_rank(): equivalent to rank(ties.method = "min")
desc(): transform a vector into a format that will be sorted in descending
order.
As #thelatemail pointed out, for this particular dataset you might want to use min_rank() instead of row_number() which will account for ties in sales/score more appropriately:
> row_number(c(1,2,2,4))
#[1] 1 2 3 4
> min_rank(c(1,2,2,4))
#[1] 1 2 2 4
Use ave in base R with rank (the - is to reverse the rankings from low-to-high to high-to-low):
dF$salesRank <- with(dF, ave(-sales, category, FUN=rank) )
#[1] 3 2 1 2 1 2 1
dF$scoreRank <- with(dF, ave(-score, category, FUN=rank) )
#[1] 2 1 3 1 2 1 2
I have just a base R solution with tapply.
salesRank <- tapply(dat$sales, dat$category, order, decreasing = T)
scoreRank <- tapply(dat$score, dat$category, order, decreasing = T)
cbind(dat, salesRank = unlist(salesRank), scoreRank= unlist(scoreRank))
ID category sales score salesRank scoreRank
A1 227 A 109 21 3 2
A2 131 A 410 24 2 1
A3 131 A 509 1 1 3
B1 123 B 2 61 2 1
B2 545 B 19 5 1 2
C1 234 C 439 328 2 1
C2 654 C 765 41 1 2

Resources