Normalizing matched data in R - r

Consider:
Pair X
1 2
1 3
2 1
2 2
If I have a dataframe that is structured like the example above, how do I normalize column X given the Pair column?
In other words, the two elements should be translated to 0.4 (2/(2+3)) and 0.6 (3/(2+3)) for Pair 1 and .333 (1/(1+2)) and .666 (2/(1+2)) for Pair 2.

We can use base R
df$norm <- with(df, X/rowsum(X, Pair)[Pair])
df$norm
#[1] 0.4000000 0.6000000 0.3333333 0.6666667
data
df <- structure(list(Pair = c(1L, 1L, 2L, 2L), X = c(2L, 3L, 1L, 2L
)), class = "data.frame", row.names = c(NA, -4L))

You can do this in base R:
df$norm <- with(df, X/ave(X, Pair, FUN = sum))
df
# Pair X norm
#1 1 2 0.40
#2 1 3 0.60
#3 2 1 0.33
#4 2 2 0.67
dplyr
library(dplyr)
df %>% group_by(Pair) %>% mutate(norm = X/sum(X))
and data.table:
library(data.table)
setDT(df)[, norm := X/sum(X), Pair]
data
df <- structure(list(Pair = c(1L, 1L, 2L, 2L), X = c(2L, 3L, 1L, 2L
)), class = "data.frame", row.names = c(NA, -4L))

Related

How can I calculate the sum of the column wise differences using dplyr

Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())

Merge (make mean) columns with partially matched header name

I have a data which look like:
AAA_1 AAA_2 AAA_3 BBB_1 BBB_2 BBB_3 CCC
1 1 1 1 2 2 2 1
2 3 1 4 0 0 0 0
3 5 3 0 1 1 1 1
For each row, I want to make a mean for those columns which have a common feature as follow
feature <- c("AAA","BBB","CCC")
the desired output should look like:
AAA BBB CCC
1 1 2 1
2 2.6 0 0
3 2.6 1 1
for each pattern separately I was able to do that:
data <- read.table("data.txt",header=T,row.name=1)
AAA <- as.matrix(rowMeans(data[ , grepl("AAA" , names( data ) ) ])
But I did not know how to do partially match for different patterns in one row
Also tried some other things like :
for (i in 1:length(features)){
feature[i] <- as.matrix(rowMeans(data[ , grepl(feature[i] , names( data ) ) ]))
}
Assuming your colnames are always structured as shown in your example, then you can split the names and aggregate.
new_names <- unlist(strsplit(names(df),"\\_.*"))
colnames(df) <- new_names
#Testing with your data, we need to prevent the loss of dimension by using drop = FALSE
sapply(unique(new_names), function(i) rowMeans(df[, new_names==i, drop = FALSE]))
# AAA BBB CCC
#[1,] 1.000000 2 1
#[2,] 2.666667 0 0
#[3,] 2.666667 1 1
Data:
df <- structure(list(AAA_1 = c(1L, 3L, 5L), AAA_2 = c(1L, 1L, 3L),
AAA_3 = c(1L, 4L, 0L), BBB_1 = c(2L, 0L, 1L), BBB_2 = c(2L,
0L, 1L), BBB_3 = c(2L, 0L, 1L), CCC = c(1L, 0L, 1L)), .Names = c("AAA_1",
"AAA_2", "AAA_3", "BBB_1", "BBB_2", "BBB_3", "CCC"), class = "data.frame", row.names = c(NA,
-3L))
Here is another option for you. Seeing your column pattern, I chose to use gsub() and get the first three letters. Using ind which includes AAA, BBB, and CCC, I used lapply(), subsetted the data for each element of ind, calculated row means, and extracted a column for row mean only. Then, I used bind_cols() and created foo. The last thing was to assign column names to foo.
library(dplyr)
ind <- unique(gsub("_\\d+$", "", names(mydf)))
lapply(ind, function(x){
select(mydf, contains(x)) %>%
transmute(out = rowMeans(.))
}) %>%
bind_cols() %>%
add_rownames -> foo
names(foo) <- ind
# AAA BBB CCC
# (dbl) (dbl) (dbl)
#1 1.000000 2 1
#2 2.666667 0 0
#3 2.666667 1 1
DATA
mydf <- structure(list(AAA_1 = c(1L, 3L, 5L), AAA_2 = c(1L, 1L, 3L),
AAA_3 = c(1L, 4L, 0L), BBB_1 = c(2L, 0L, 1L), BBB_2 = c(2L,
0L, 1L), BBB_3 = c(2L, 0L, 1L), CCC = c(1L, 0L, 1L)), .Names = c("AAA_1",
"AAA_2", "AAA_3", "BBB_1", "BBB_2", "BBB_3", "CCC"), class = "data.frame", row.names = c(NA,
-3L))
library(dplyr)
library(tidyr)
data %>%
add_rownames() %>%
gather("variable", "value", -rowname) %>%
mutate(variable = gsub("_.*$", "", variable)) %>%
group_by(rowname, variable) %>%
summarise(mean = mean(value)) %>%
spread(variable, mean)

Merge data frame using non-unique probabilistic key

The goal is to merge df2 into df1 where the key values in df2 are not unique but are in groups where each has a probabilistic value. A simple example:
df1
# key
#1 A
#2 B
#3 C
#4 C
#5 A
#6 A
#7 D
df2
# key code prob
#1 A 1 0.75
#2 A 2 0.25
#3 B 1 0.95
#4 B 2 0.05
#5 C 1 0.20
#6 C 2 0.25
#7 C 3 0.55
#8 D 1 0.33
#9 D 2 0.33
#10 D 3 0.33
The expected result would be something like the following where code has been assigned based upon the probabilities in df2:
# key code
#1 A 1
#2 B 1
#3 C 3
#4 C 3
#5 A 2
#6 A 1
#7 D 2
Data:
df1 <- structure(list(key = structure(c(1L, 2L, 3L, 3L, 1L, 1L, 4L), .Label = c("A",
"B", "C", "D"), class = "factor")), .Names = "key", class = "data.frame", row.names = c(NA,
-7L))
df2 <- structure(list(key = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 3L,
4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
code = c(1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 3L), prob = c(0.75,
0.25, 0.95, 0.05, 0.2, 0.25, 0.55, 0.33, 0.33, 0.33)), .Names = c("key",
"code", "prob"), class = "data.frame", row.names = c(NA, -10L
))
Using apply, for each row in df1, sample from the available codes in df2, weighted by prob, for the current value of key:
df1$code = apply(df1, 1, function(x) {
sample(df2$code[df2$key==x["key"]], 1, prob=df2$prob[df2$key==x["key"]])
})
I'm pretty sure you just want:
library(dplyr)
df2 %>%
group_by(key) %>%
sample_n(1, weight = prob) %>%
right_join(df1)
I think this is what you want.
library(dplyr)
df1$id <- seq(nrow(df1))
df3 <- merge(df1, df2, by = "key", all.x = TRUE)
df3 %>% group_by(id) %>% sample_n(1, weight = prob)
I generated id variable for df1, and merged df1 with all possible code in df2. Then, dplyr::sample_n provides a weighted sampling for each id.
Typical outcome will be
Source: local data frame [7 x 4]
Groups: id
key id code prob
1 A 1 1 0.75
2 B 2 1 0.95
3 C 3 3 0.55
4 C 4 3 0.55
5 A 5 1 0.75
6 A 6 1 0.75
7 D 7 1 0.33

R: Unique count by first occurrence of grouping variable

I would like to create a new variable "Count" that is a count of the unique values of a factor "Period", by grouping variable "ID". The following data includes a column with the values I would want in "Count":
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), Period = c(1.1, 1.1,
1.2, 1.3, 1.2, 1.3, 1.5, 1.5), Count = c(1L, 1L, 2L, 3L, 1L,
2L, 3L, 3L)), .Names = c("ID", "Period", "Count"), class = "data.frame", row.names = c(NA,
-8L))
I tried to use mutate with Count = 1:length(Period) but it creates a cumulative count of each value of "Period", whereas I want a cumulative count of only unique values. This is what I tried:
library(plyr)
samp1<-ddply(samp, .(ID, Period), mutate, Count = 1:length(Period))
Could anyone provide the correct function to use?
Edit- New answer
Now that come to think of it some more, my initial approach won't return correct results if each groups elements aren't grouped together, so for example for
v <- c(1, 3, 2, 2, 1, 2)
My function will put non-consecutive 1s and 2 in different groups
myrleid(v)
## [1] 1 2 3 3 4 5
Thus, the best approach seem to be
match(v, unique(v))
## [1] 1 2 3 3 1 3
Will will both preserve the appearance order and keep un-ordered values in the same group.
Thus, I would recommend just doing
library(data.table)
setDT(df)[, Count2 := match(Period, unique(Period)), by = ID]
or (with base R)
with(df, ave(Period, ID, FUN = function(x) match(x, unique(x))))
Old answer
Looks like a good candidate for the rleid function from the data.table devel version on GH
### Devel version installation instructions
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table) # v 1.9.5+
setDT(df)[, Count2 := rleid(Period), by = ID]
df
# ID Period Count Count2
# 1: a 1.1 1 1
# 2: a 1.1 1 1
# 3: a 1.2 2 2
# 4: a 1.3 3 3
# 5: b 1.2 1 1
# 6: b 1.3 2 2
# 7: b 1.5 3 3
# 8: b 1.5 3 3
Or, If you don't want to load external packages, we could define this function on our own
myrleid <- function(x) {
temp <- rle(x)$lengths
rep.int(seq_along(temp), temp)
}
with(df, ave(Period, ID, FUN = myrleid))
## [1] 1 1 2 3 1 2 3 3
Or if the groups are in increasing order, you could try ranking them too
library(data.table) ## V1.9.5+
setDT(df)[, Count2 := frank(Period, ties.method = "dense"), by = ID]
Or
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count2 = dense_rank(Period))
samp <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), Period = c(1.1, 1.1,
1.2, 1.3, 1.2, 1.3, 1.5, 1.5), Count = c(1L, 1L, 2L, 3L, 1L,
2L, 3L, 3L)), .Names = c("ID", "Period", "Count"), class = "data.frame", row.names = c(NA,
-8L))
select(samp, -Count) %>%
arrange(ID, Period) %>%
group_by(ID) %>%
mutate(dup = !duplicated(Period),
Count = cumsum(dup))
The key steps are to arrange by ID and Period, and then to identify that first new representation of Period as "not duplicated".
A solution in base R with transform:
transform(df, Count2 = unlist(
tapply(df$Period, df$ID, function(x)
as.numeric(factor(x)))
))
ID Period Count Count2
a1 a 1.1 1 1
a2 a 1.1 1 1
a3 a 1.2 2 2
a4 a 1.3 3 3
b1 b 1.2 1 1
b2 b 1.3 2 2
b3 b 1.5 3 3
b4 b 1.5 3 3
as David suggested this solution does not work well if data Period are not monotonic increasing.

Finding the max number of occurrences from the available result

I have a dataframe which looks like -
Id Result
A 1
B 2
C 1
B 1
C 1
A 2
B 1
B 2
C 1
A 1
B 2
Now I need to calculate how many 1's and 2's are there for each Id and then select the number whose frequency of occurrence is the greatest.
Id Result
A 1
B 2
C 1
How can I do that? I have tried using the table function in some way but not able to use it effectively. Any help would be appreciated.
Here you can use aggregate in one step:
df <- structure(list(Id = structure(c(1L, 2L, 3L, 2L, 3L, 1L, 2L, 2L,
3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"),
Result = c(1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)),
.Names = c("Id", "Result"), class = "data.frame", row.names = c(NA, -11L)
)
res <- aggregate(Result ~ Id, df, FUN=function(x){which.max(c(sum(x==1), sum(x==2)))})
res
Result:
Id Result
1 A 1
2 B 2
3 C 1
With data.table you can try (df is your data.frame):
require(data.table)
dt<-as.data.table(df)
dt[,list(times=.N),by=list(Id,Result)][,list(Result=Result[which.max(times)]),by=Id]
# Id Result
#1: A 1
#2: B 2
#3: C 1
Using dplyr, you can try
library(dplyr)
df %>% group_by(Id, Result) %>% summarize(n = n()) %>% group_by(Id) %>%
filter(n == max(n)) %>% summarize(Result = Result)
Id Result
1 A 1
2 B 2
3 C 1
An option using table and ave
subset(as.data.frame(table(df1)),ave(Freq, Id, FUN=max)==Freq, select=-3)
# Id Result
# 1 A 1
# 3 C 1
# 5 B 2

Resources