I need to aggregate a number between two factors, but I need the output of the aggregation to be a a vector the same length as the original data frame, rather than a summary table, so I can attach it and eventually output it as an .xlsx report.
data <- data.frame(A = c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","C","C","C","C","C","C"),
B = c(1,1,2,2,2,3,1,1,1,1,2,2,2,3,3,1,1,1,1,2),
X=c(0.17,0.15,0.30,0.36,0.47,0.43,0.50,0.38,0.38,0.47,0.40,0.29,0.46,0.14,0.03,0.34,0.42,0.35,0.19,0.27))
I need to sum X grouped by both A, and unique combination of A and B, and append it to the data frame, so that it looks like this
I'm aware of the aggregate function, which calculates the quantities I need but outputs them in a summary table format which I can't then append to the data frame.
So far this is the only method I've come up with - it takes 10 minutes to run on my actual, 13000 row data frame, it seems very hacky and it also seems to be causing some other bugs that I'm hoping redoing this bit will solve.
TBL <- as.data.frame(table(data$A, data$B))
colnames(TBL) <- c("A", "B", "Freq")
#contains every unique combination of A and B
for (i in 1:NROW(TBL)){
INDEX <- which(data$A == TBL$A[i] & data$B == TBL$B[i])
Data$`X by AB`[INDEX] <- sum(data$X[INDEX])
}
Seems like you need to group_by A AND A and B and get sum of X. With dplyr, we can use two group_by statements with mutate
library(dplyr)
data %>%
group_by(A, B) %>%
mutate(XbyAB = sum(X)) %>%
group_by(A) %>%
mutate(XbyA = sum(X))
# A B X XbyAB XbyA
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 0.12 0.12 0.46
# 2 A 2 0.34 0.34 0.46
# 3 B 1 0.5 0.9 1.59
# 4 B 1 0.4 0.9 1.59
# 5 B 3 0.69 0.69 1.59
# 6 C 1 0.42 0.42 0.5
# 7 C 2 0.08 0.08 0.5
# 8 D 2 0.9 0.9 0.9
# 9 E 3 0.74 0.74 0.94
#10 E 4 0.2 0.2 0.94
Or in base R two ave with transform
transform(data, XbyAB = ave(X, A, B, FUN = sum), XbyA = ave(X, A, FUN = sum))
data.table solution.
library("data.table")
data <- as.data.table(data)
First, let's sum X by A:
data[, .( `X by A`=sum(X) ), by=A]
# A X by A
# 1: A 1.88
# 2: B 3.02
# 3: C 1.60
We merge this summary data.table with the original one on column A:
data[data[, .( `X by A`=sum(X) ), by=A], on=.(A)]
We can also summarize and then merge on two columns:
data[data[, .( `X by AB`=sum(X) ), by=.(A, B)], on=.(A, B)]
The problem is, to the uninitiated, the data.table syntax isn't very readable, but I swear by its speed (as compared to dplyr and especially data.frame). Although the difference shouldn't be very noticeable with 13K rows.
Related
I have a column of ID's which have duplicates. What I want to do is create a new column which in each row divides 1 by the number of duplicates that exists of the ID in that column.
So for example in row 2 the ID is 101, in the same column the 101 is repeated three times. Therefore in a new column I want in row 2 to be 1 divided by the number of duplicates (1/3) of the 101 ID which gives a value of 0.33'. How would I go about this? Sorry if this doesn't make sense, happy to clarify.
Thanks so much in advance!
We can use ave:
# create data
set.seed(123)
ids <- sample(1:3, 10, replace = TRUE)
dat <- data.frame(id = ids)
# id
# 1 3
# 2 3
# 3 3
# 4 2
# 5 3
# 6 2
# 7 2
# 8 2
# 9 3
# 10 1
# use ave to count instances of id
1/ave(dat$id, dat$id, FUN = length)
# 0.20 0.20 0.20 0.25 0.20 0.25 0.25 0.25 0.20 1.00
If your id variable is a character, we can use the seq_along function within our ave call:
dat$id <- as.character(dat$id)
1/ave(seq_along(dat$id), dat$id, FUN = length)
# [1] 0.20 0.20 0.20 0.25 0.20 0.25 0.25 0.25 0.20 1.00
In the following code, I assume the data frame to be named df.
require("tidyverse")
df <- table(df$ID) %>%
enframe() %>%
mutate(value = 1 / value) %>%
left_join(df, ., by = c("ID" = "name"))
Using data.table
library(data.table)
setDT(dat)[, value := 1/.N, id]
data
set.seed(123)
ids <- sample(1:3, 10, replace = TRUE)
dat <- data.frame(id = ids)
In R I'm trying to multiply select columns in a dataframe (df1) with matching columns in a second dataframe (df2). The number of rows is unequal (just 1 for df2) and there are columns that I wish to retain but do not wish to multiply. An example follows:
df1 <- data.frame(group = c('A','B','C','A'), var1 = c(1,0,1,0), var2 = c(0,1,1,0))
df2 <- data.frame(var1 = 0.06, var2 = 0.04)
The expected result would be:
Group var1 var2
A 0.06 0
B 0 0.04
C 0.06 0.04
A 0 0
I'm happy to adjust the format of df2 if required. A tidyverse solution would be great.
I've read several other questions attempting to do something similar but I can't get them to work in my situation e.g.
data.frame(Map(function(x,y) if(all(is.numeric(x),is.numeric(y))) x * y
else x, df1, df2))
# Multiply's by position not name
Thanks in advance.
We can make the lengths same with rep or col and then do the multiplication
nm1 <- intersect(names(df1), names(df2))
df1[nm1] <- df1[nm1] * unlist(df2[nm1])[col(df1[nm1])]
df1
# group var1 var2
#1 A 0.06 0.00
#2 B 0.00 0.04
#3 C 0.06 0.04
#4 A 0.00 0.00
Or using Map
df1[nm1] <- Map(`*`, df1[nm1], df2[nm1])
Or using tidyverse
library(dplyr)
library(purrr)
map2_dfc(select(df1, nm1), select(df2, nm1), `*`) %>%
bind_cols(select(df1, -one_of(nm1)), .)
# group var1 var2
#1 A 0.06 0.00
#2 B 0.00 0.04
#3 C 0.06 0.04
#4 A 0.00 0.00
We can use sweep in base R after getting the common column names from both the dataframe.
cols <- intersect(colnames(df1), colnames(df2))
df1[cols] <- sweep(df1[cols], 2, unlist(df2[cols]), `*`)
df1
# group var1 var2
#1 A 0.06 0.00
#2 B 0.00 0.04
#3 C 0.06 0.04
#4 A 0.00 0.00
I have a data frame with multiple columns as follows:
Frequency Alels
0.5 C
0.6 C,G
0.02 A,T,TTT
And I want to split the value of second column and the new rows have frequency = 0.
I'm trying with separate() from tidyr package but I can't change the frequency column in new rows and I get the above results:
Frequency Alels
0.5 C
0.6 C
0.6 G
0.02 A
0.02 T
0.02 TTT
But I want the output as follows:
Frequency Alels
0.5 C
0.6 C
0 G
0.02 A
0 T
0 TTT
I'm trying with separate() from tidyr package but I can't change the frequency column in new rows.
This should work:
d <- read.table(text = "Frecuency Alels
0.5 C
0.6 C,G",
header = T, stringsAsFactors = F)
counts <- sapply(strsplit(d$Alels, split = ","), length)
data.frame("Frecuency" = unlist(lapply(seq_along(d$Frecuency),
function(x) c(d$Frecuency[x],
rep(0, counts[x] -1)))),
"Alels" = unlist(strsplit(d$Alels, split = ",")))
Not pretty, but I think it works.
# Create data frame
df <- data.frame(frequency = c(0.5, 0.6),
alels = c("C", "C, G, T"),
stringsAsFactors = FALSE)
# Duplicate the alels column, separate rows
# Requires magrittr, dplyr, tidyr
df %<>%
mutate(alels_check = alels) %>%
separate_rows(alels, sep = ",", convert = TRUE)
# Check for dupes and set them to zero
df[duplicated(df$frequency, df$alels_check),]$frequency <- 0
# Remove the duplicated alels column
df %<>% select(-alels_check)
Original:
# frequency alels
# 1 0.5 C
# 2 0.6 C, G, T
Result:
# frequency alels
# 1 0.5 C
# 2 0.6 C
# 3 0.0 G
# 4 0.0 T
Using your data:
# frequency alels
# 1 0.50 C
# 2 0.60 C, G
# 3 0.02 A, T, TTT
# frequency alels
# 1 0.50 C
# 2 0.60 C
# 3 0.00 G
# 4 0.02 A
# 5 0.00 T
# 6 0.00 TTT
the data from your example:
df <- read.table(text = " Frequency Alels
0.5 C
0.6 C,G
0.02 A,T,TTT",
header = T, stringsAsFactors = F)
and another solution for you to consider:
library(dplyr)
lapply(1:nrow(df),
function(row_num){
s <- strsplit(df$Alels[row_num], ",") %>% unlist
data.frame(Frequency = c(df$Frequency[row_num], rep(0,length(s)-1)),
Alels = s)
}) %>% do.call(rbind, .)
df
instead of do.call(rbind, .) you can also choose to use rbindlist()from the package data.table
I have a dataframe and the row values are first ordered from smallest to largest. I compute row value differences between adjacent rows, combine rows with similar differences (e.g., smaller than 1), and return averaged values of combined rows. I could check each row differences with a for loop, but seems a very inefficient way. Any better ideas? Thanks.
library(dplyr)
DF <- data.frame(ID=letters[1:12],
Values=c(1, 2.2, 3, 5, 6.2, 6.8, 7, 8.5, 10, 12.2, 13, 14))
DF <- DF %>%
mutate(Diff=c(0, diff(Values)))
The expected output of DF would be
ID Values
a 1.0
b/c 2.6 # (2.2+3.0)/2
d 5.0
e/f/g 6.67 # (6.2+6.8+7.0)/3
h 8.5
i 10.0
j/k 12.6 # (12.2+13.0)/2
i 14.0
Here is an option with data.table
library(data.table)
setDT(DF)[, .(ID = toString(ID), Values = round(mean(Values), 2)),
by = .(Diff = cumsum(c(TRUE, diff(Values)>=1)))][, -1, with = FALSE]
# ID Values
#1: a 1.00
#2: b, c 2.60
#3: d 5.00
#4: e, f, g 6.67
#5: h 8.50
#6: i 10.00
#7: j, k 12.60
#8: l 14.00
Calculate difference between Values of every row and check if those are >= 1. Cumulative sum of that >=1 will provide you distinct group on which one can summarize to get desired result.
library(dplyr)
DF %>% arrange(Values) %>%
group_by(Diff = cumsum(c(1,diff(Values)) >= 1) ) %>%
summarise(ID = paste0(ID, collapse = "/"), Values = mean(Values)) %>%
ungroup() %>% select(-Diff)
# # A tibble: 8 x 2
# ID Values
# <chr> <dbl>
# 1 a 1.00
# 2 b/c 2.60
# 3 d 5.00
# 4 e/f/g 6.67
# 5 h 8.50
# 6 i 10.0
# 7 j/k 12.6
# 8 l 14.0
library(magrittr)
df <- DF[order(DF$Values),]
df$Values %>%
#Find groups for each row
outer(., ., function(x, y) x >= y & x < y + 1) %>%
# Remove sub-groups
`[<-`(apply(., 1, cumsum) > 1, F) %>%
# Remove sub-group columns
.[, colSums(.) > 0] %>%
# select these groups from data
apply(2, function(x) data.frame(ID = paste(df$ID[x], collapse = '/')
, Values = mean(df$Values[x]))) %>%
# bind results by row
do.call(what = rbind)
# ID Values
# 1 a 1.000000
# 2 b/c 2.600000
# 4 d 5.000000
# 5 e/f/g 6.666667
# 8 h 8.500000
# 9 i 10.000000
# 10 j/k 12.600000
# 12 l 14.000000
Note:
This method is different from those using diff because it groups rows together only if all Values are within < 1 of each other.
Example:
Change the dataset so that Value is 7.3 at ID g.
Above method: The IDs e, f, and g are no longer grouped together because the value at ID e is 6.2 and 7.2 - 6.2 > 1.
Diff Method: IDs e, f, and g are still grouped together because the diff of IDs at e and f is < 1 and the diff of IDs F and G is < 1
I have a dataframe df1 which contains 6 columns, two of which (var1 & var3) I am using to split df1 by, resulting in a list of dataframes ls1.
For each sub dataframe in ls1 I want to sample() x$var2, x$num times with x$probs probabilities as follows:
Create data:
var1 <- rep(LETTERS[seq( from = 1, to = 3 )], each = 6)
var2 <- rep(LETTERS[seq( from = 1, to = 3 )], 6)
var3 <- rep(1:2,3, each = 3)
num <- rep(c(10, 11, 13, 8, 20, 5), each = 3)
probs <- round(runif(18), 2)
df1 <- as.data.frame(cbind(var1, var2, var3, num, probs))
ls1 <- split(df1, list(df1$var1, df1$var3))
have a look at the first couple list elements:
$A.1
var1 var2 var3 num probs
1 A A 1 10 0.06
2 A B 1 10 0.27
3 A C 1 10 0.23
$B.1
var1 var2 var3 num probs
7 B A 1 13 0.93
8 B B 1 13 0.36
9 B C 1 13 0.04
lapply over ls1:
ls1 <- lapply(ls1, function(x) {
res <- table(sample(x$var2, size = as.numeric(as.character(x$num)),
replace = TRUE, prob = as.numeric(as.character(x$probs))))
res <- as.data.frame(res)
cbind(x, res = res$Freq)
})
df2 <- do.call("rbind", ls1)
df2
Have a look at the first couple list elements of the result:
$A.1
var1 var2 var3 num probs res
1 A A 1 10 0.06 2
2 A B 1 10 0.27 4
3 A C 1 10 0.23 4
$B.1
var1 var2 var3 num probs res
7 B A 1 13 0.93 10
8 B B 1 13 0.36 3
9 B C 1 13 0.04 0
So for each dataframe a new variable res is created, the sum of res equals num and the elements of var2 are represented in res in proportions relating to probs. This does what I want but it becomes very slow when there is a lot of data.
My Question: is there a way to replace the lapply piece of code with something more efficient/faster?
I am just beginning to learn about vectorization and am guessing this could be vectorized? but I am unsure of how to achieve it.
ls1 is eventually returned to a dataframe structure so if it doesn't need to become a list to begin with all the better (although it doesn't really matter how the data is structured for this step).
Any help would be much appreciated.
First, you should create df1 using data.frame() rather than converting from a matrix, because the matrix forces all data types to the be the same even though you have both numeric and character variables.
df1 <- data.frame(var1, var2, var3, num, probs)
Next, instead of using the sample function, the rmultinom function is much more efficient because it directly outputs the number of draws for each value in x$var2:
ls1 <- lapply(ls1, function(x) {
x$res <- rmultinom(1, x$num[1], x$probs)
x
})
This should be noticeably faster than using the sample approach.
Rather than splitting your data frame in groups, I would use package {dplyr} with a group_by+mutate:
library(dplyr)
df1 %>%
mutate_at(vars(num, probs), as.numeric) %>%
group_by(var1, var3) %>%
mutate(res = c(rmultinom(1, num[1], probs)))
This should be fast and you can keep the original data structure.
Learn more there.