Mean by factor level for last three rows - r

I have a dataframe, which looks like this (but has more factor levels and values)
ID <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C")
Value <- rep(1:5)
test <- cbind.data.frame(ID, Value)
I would like to calculate the mean of the first 3 and last 3 values (rows) of each factor level.
For the first 3 values I used ddply:
library(plyr)
mean_start <- ddply(test, .(ID), summarise, mean_start = mean(Value[1:3]))
This works great. But how can I refer to the last 3 rows, given that each factor level has a different amount of rows?

Using headand tail:
library(plyr)
(means <- ddply(test, .(ID), summarise, mean_start = mean(head(Value, 3)), mean_end = mean(tail(Value, 3))))
# ID mean_start mean_end
# 1 A 2.000000 4
# 2 B 2.000000 3
# 3 C 2.666667 4

Related

How do I compare the order of two lists in R?

Hi I'm trying to determine the change in ordering of two lists in R.
e.g. Comparing rankings of tennis players from two different months.
Feb <- c("A", "B", "C", "D")
Mar <- c("D", "B", "C", "A")
orderChange(Feb, Mar)
I would like to get a result which shows the difference in ordering/ranking.
(-3, 0, 0, 3)
I've tried which() but that only tells me whether an element is present and doesn't compare the ordering.
which(Mar %in% Feb)
[1] 1 2 3 4
You can use seq_along and subtract match.
Feb <- c("A", "B", "C", "D")
Mar <- c("D", "B", "C", "A")
Apr <- c("C", "B", "D", "A")
seq_along(Feb) - match(Feb, Mar)
#[1] -3 0 0 3
seq_along(Feb) - match(Feb, Apr)
#[1] -3 0 2 1
and can pack this in a function if needed.
orderChange <- function(x, y) seq_along(x) - match(x, y)

Matching rows to columns and counting same occurences R

I have a dataset which is of the following form:-
a <- data.frame(X1=c("A", "B", "C", "A", "B", "C"),
X2=c("B", "C", "C", "A", "A", "B"),
X3=c("B", "E", "A", "A", "A", "B"),
X4=c("E", "C", "A", "A", "A", "C"),
X5=c("A", "C", "C", "A", "B", "B")
)
And I have another set of the following form:-
b <- data.frame(col_1=c("ASD", "ASD", "BSD", "BSD"),
col_2=c(1, 1, 1, 1),
col_3=c(12, 12, 31, 21),
col_4=("A", "B", "B", "A")
)
What I want to do is to take the column col_4 from set b and match row wise in set a, so that it tell me which row has how many elements from col_4 in a new column. The name of the new column does not matters.
For ex:- The first and fifth row in set a has all the elements of col_4 from set b.
Also, duplicates shouldn't be found. For ex. sixth row in set a has 3 "B"s. But since col_4 from set b has only two "B"s, it should tell me 2 and not 3.
Expected output is of the form:-
c <- data.frame(X1=c("A", "B", "C", "A", "B", "C"),
X2=c("B", "C", "C", "A", "A", "B"),
X3=c("B", "E", "A", "A", "A", "B"),
X4=c("E", "C", "A", "A", "A", "C"),
X5=c("A", "C", "C", "A", "B", "B"),
found=c(4, 1, 2, 2, 4, 2)
)
We can use vecsets::vintersect which takes care of duplicates.
Using apply row-wise we can count how many common values are there between b$col4 and each row in a.
apply(a, 1, function(x) length(vecsets::vintersect(b$col_4, x)))
#[1] 4 1 2 2 4 2
An option using data.table:
library(data.table)
#convert a into a long format
m <- melt(setDT(a)[, rn:=.I], id.vars="rn", value.name="col_4")
#order by row number and create an index for identical occurrences in col_4
setorder(m, rn, col_4)[, vidx := rowid(col_4), rn]
#create a similar index for b
setDT(b, key="col_4")[, vidx := rowid(col_4)]
#count occurrences and lookup this count into original data
a[b[m, on=.(col_4, vidx), nomatch=0L][, .N, rn], on=.(rn), found := N]
output:
X1 X2 X3 X4 X5 rn found
1: A B B E A 1 4
2: B C E C C 2 1
3: C C A A C 3 2
4: A A A A A 4 2
5: B A A A B 5 4
6: C B B C B 6 2
Another idea to operate on sets efficiently is to count and compare the element occurences of b$col_4 in each row of a:
b1 = c(table(b$col_4))
#b1
#A B
#2 2
a1 = table(factor(as.matrix(a), names(b1)), row(a))
#a1
#
# 1 2 3 4 5 6
# A 2 0 2 5 3 0
# B 2 1 0 0 2 3
Finally, identify the least amount of occurences per element (for each row) and sum:
colSums(pmin(a1, b1))
#1 2 3 4 5 6
#4 1 2 2 4 2
In case of a larger dimension a "data.frame" and more elements, Matrix::sparseMatrix offers an appropriate alternative:
library(Matrix)
a.fac = factor(as.matrix(a), names(b1))
.i = as.integer(a.fac)
.j = c(row(a))
noNA = !is.na(.i) ## need to remove NAs manually
.i = .i[noNA]
.j = .j[noNA]
a1 = sparseMatrix(i = .i, j = .j, x = 1L, dimnames = list(names(b1), 1:nrow(a)))
a1
#2 x 6 sparse Matrix of class "dgCMatrix"
# 1 2 3 4 5 6
#A 2 . 2 5 3 .
#B 2 1 . . 2 3
colSums(pmin(a1, b1))
#1 2 3 4 5 6
#4 1 2 2 4 2

Count number of matching paired values in array in R

Rookie question. I'm looking for a simple way in R to count the number of matching pair of values in an array such as
c("A","A","A") # 3 matched pairs
c("A","B","A") # 1 matched pair
c("A","B") # 0 matched pair
etc
Thank you
It seems like you want to find all possible pairs of identical elements, where their order does not matter. Then:
matchPairs <- function(x) sum(choose(table(x), 2))
matchPairs(c("A", "A", "A"))
# [1] 3
matchPairs(c("A", "B", "A"))
# [1] 1
matchPairs(c("A", "B"))
# [1] 0
matchPairs(c("A", "A", "A", "B"))
# [1] 3
matchPairs(c("A", "A", "A", "B", "B"))
# [1] 4
matchPairs(c("A", "A", "A", "B", "B", "A"))
# [1] 7

R: select top products in groups

I need to select 3 top selling products in each category, but if category dose not have 3 products I should add more products from best available category ("a" being the best category, "c" worst).
Every day the products change so I would like to this automatically. Previously I did choose top 3 products and if there was not available I did not bothered, but unfortunately the conditions changed. For that I used code as follows:
Selected <- items %>% group_by(Cat) %>% dplyr:: filter(row_number() < 3) %>% ungroup
Sample data:
items <- data.frame(Cat = c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
Desired results:
"a", "a", "a", "b", "b", "c", "c", "c", "c"
Sample data - 2:
items <- data.frame(Cat = c("a", "a", "a", "a", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
Desired results - 2:
"a", "a", "a", "a", "b", "c", "c", "c", "c"
Here is a possible answer. I'm not entirely sure if I'm getting what you are after - if not let me know.
items <- data.frame(Cat = c("a", "a", "a",
"b", "b",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
First we order the data according from best to worst category and add the count number within category.
Selected <- items %>% group_by(Cat) %>%
mutate(id = row_number()) %>%
ungroup() %>% arrange(Cat)
Then we can make the filter and fill up with remaining rows from best to worst
Selected %>% filter(id<=3) %>% # Select top 3 in each group
bind_rows(Selected %>% filter(id>3)) %>% # Merge with the ones that weren't selected
mutate(id=row_number()) %>%
filter(id <= 3*length(unique(Cat))) # Extract the right number
This produces
# A tibble: 9 x 3
Cat ranking id
<fctr> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 4 4
5 b 5 5
6 c 6 6
7 c 7 7
8 c 8 8
9 c 9 9
The second data example yields
# A tibble: 9 x 3
Cat ranking id
<fctr> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 5 4
5 c 6 5
6 c 7 6
7 c 8 7
8 a 4 8
9 c 9 9
which seems to be what you were after.

How to remove duplicate pair-wise columns [duplicate]

This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 7 years ago.
Consider the following dataframe:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
How can I remove duplicate pair-wise columns so that the output looks like:
# V1 V2 n
#1 A B 1
#2 A C 3
#3 B C 2
I tried unique() and duplicated() to no avail.
Not sure if this is the simplest way of doing it (transposing can be computationally expensive) but this would work with your data frame:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
First, sort the data frame row-wise, so your value-pairs become true duplicates.
df <- data.frame(t(apply(df, 1, sort)))
Then you can just apply the unique function.
df <- unique(df)
If your column names and order are important, you'll have to re-establish those.
names(df) <- c("n", "V1", "V2")
df <- df[, c("V1", "V2", "n")]
Another option would be to reshape (xtabs(n~..)) the dataset ('df') to wide format, set the lower triangular matrix to 0, and remove the rows with "Freq" equal to 0.
m1 <- xtabs(n~V1+V2, df)
m1[lower.tri(m1)] <- 0
subset(as.data.frame(m1), Freq!=0)
# V1 V2 Freq
#4 A B 1
#7 A C 3
#8 B C 2

Resources