We have a data frame with one column for a category and one column for discrete values. We want to get all possible intersections (number of common values) for all combinations of categories.
I came up with the following code. However, is there something shorter out there? I am sure there is a better way of doing this, a specialized function that does exactly this. The code below can be shortened, of course, for example with purrr:map, but that is not my question.
## prepare an example data set
df <- data.frame(category=rep(LETTERS[1:5], each=20),
value=sample(letters[1:10], 100, replace=T))
cats <- unique(df$category)
n <- length(cats)
## all combinations of 1...n unique elements from category
combinations <- lapply(1:n, function(i) combn(cats, i, simplify=FALSE))
combinations <- unlist(combinations, recursive=FALSE)
names(combinations) <- sapply(combinations, paste0, collapse="")
## for each combination of categories, get the values which belong
## to this category
intersections <- lapply(combinations,
function(co)
lapply(co, function(.x) df$value[ df$category == .x ]))
intersections <- lapply(intersections,
function(.x) Reduce(intersect, .x))
intersections <- sapply(intersections, length)
This brings us to my desired outcome:
> intersections
A B C D E AB AC AD AE BC
20 20 20 20 20 10 8 8 9 8
BD BE CD CE DE ABC ABD ABE ACD ACE
8 9 7 8 8 8 8 9 7 8
ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE
8 7 8 8 7 7 8 8 7 7
ABCDE
7
Question: is there a way of achieving the same result with less fuzz?
Here is a possible approach with data.table to cast the data.frame and model.matrix to count the higher-order interactions:
Cast to wide-format by grouping all matching values between categories in the rows (credits to #chinsoon12 for the dcast syntax).
Identify all higher-order interactions with model.matrix and sum over the columns.
library(data.table)
df_wide <- dcast(setDT(df), value + rowid(category, value) ~ category, fun.aggregate = length, fill = 0)
head(df_wide)
#> value category A B C D E
#> 1: a 1 1 1 1 1 1
#> 2: a 2 1 0 0 1 1
#> 3: a 3 0 0 0 1 0
#> 4: b 1 1 1 1 0 1
#> 5: b 2 1 0 1 0 1
#> 6: c 1 1 1 1 1 1
colSums(model.matrix(~(A + B + C + D + E)^5, data = df_wide))[-1]
#> A B C D E A:B A:C
#> 20 20 20 20 20 13 11
#> A:D A:E B:C B:D B:E C:D C:E
#> 12 12 11 13 13 11 13
#> D:E A:B:C A:B:D A:B:E A:C:D A:C:E A:D:E
#> 10 8 9 9 7 9 7
#> B:C:D B:C:E B:D:E C:D:E A:B:C:D A:B:C:E A:B:D:E
#> 8 9 7 8 5 7 5
#> A:C:D:E B:C:D:E A:B:C:D:E
#> 5 6 4
Data
set.seed(1)
df <- data.frame(category=rep(LETTERS[1:5], each=20),
value=sample(letters[1:10], 100, replace=T))
Related
Hi I have dataframe with multiple columns ,I.e first 5 columns are my metadata and remaing
columns (columns count will be even) are actual columns which need to be calculated
formula : (col6*col9) + (col7*col10) + (col8*col11)
country<-c("US","US","US","US")
name <-c("A","B","c","d")
dob<-c(2017,2018,2018,2010)
day<-c(1,4,7,9)
hour<-c(10,11,2,4)
a <-c(1,3,4,5)
d<-c(1,9,4,0)
e<-c(8,1,0,7)
f<-c(10,2,5,6)
j<-c(1,4,2,7)
m<-c(1,5,7,1)
df=data.frame(country,name,dob,day,hour,a,d,e,f,j,m)
how to get final summation if i have more columns
I have tried with below code
df$final <-(df$a*df$f)+(df$d*df$j)+(df$e*df$m)
Here is one way to do generalize the computation:
x <- ncol(df) - 5
df$final <- rowSums(df[6:(5 + x/2)] * df[(ncol(df) - x/2 + 1):ncol(df)])
# country name dob day hour a d e f j m final
# 1 US A 2017 1 10 1 1 8 10 1 1 19
# 2 US B 2018 4 11 3 9 1 2 4 5 47
# 3 US c 2018 7 2 4 4 0 5 2 7 28
# 4 US d 2010 9 4 5 0 7 6 7 1 37
I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17
I play with the following two simple datasets:
(myData <- data.frame(ID=c(1:7, 5), Sum=c(10, 20, 30, 40, 50, 60, 700, 200)))
# ID Sum
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 6 60
# 7 7 700
# 8 5 200
and
(myMap <- data.frame(ID=c(1:5, 7), Name=c("a", "b", "c", "d", "e", "g")))
# ID Name
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
# 6 7 g
I will map the data with the map, this way:
myData$Name<-myMap$Name[match(myData$ID, myMap$ID)]
However since there is no map entry for the ID == 6, the output is:
ID Sum Name
1 1 10 a
2 2 20 b
3 3 30 c
4 4 40 d
5 5 50 e
6 6 60 <NA>
7 7 700 g
8 5 200 e
What I am trying to do now: in the record where Name is NA, the Name should become ID.
My attempts:
myData$Dummy<-ifelse( is.na(myData$Name),myData$ID, myData$Name)
or
for (i in 1:length(myData$Name) )
if (is.na(myData$Name[i]))
{
x <- myData$ID[i]
# print(x)
myData$Name[i]<- as.factor(x)
print(myData$Name[i])
}
are wrong. Could you please give me a hint?
It's the fact that the column you think is character is really a factor. Either use stringsAsFactors=FALSE when creating the data frame or you'll need to account for it when manipulating the data. I've provided dplyr + piping and base R solutions below. Note the use of left_join (dplyr) or merge (base) vs your subset & matching:
library(dplyr)
myData <- read.csv(text="ID;Sum
1;10
2;20
3;30
4;40
5;50
6;60
7;700
5;200", sep=";")
myMap <- read.csv(text="ID;Name
1;a
2;b
3;c
4;d
5;e
7;g", sep=";")
# dplyr -------------------------------------------------------------------
myData %>%
left_join(myMap) %>%
mutate(Name=as.character(Name),
Name=ifelse(is.na(Name), ID, Name)) -> dplyr_myData
## Joining by: "ID"
dplyr_myData
## ID Sum Name
## 1 1 10 a
## 2 2 20 b
## 3 3 30 c
## 4 4 40 d
## 5 5 50 e
## 6 6 60 6
## 7 7 700 g
## 8 5 200 e
# base --------------------------------------------------------------------
base_myData <- merge(myData, myMap, all.x=TRUE)
base_myData$Name <- as.character(base_myData$Name)
base_myData$Name <- ifelse(is.na(base_myData$Name),
base_myData$ID, base_myData$Name)
base_myData
## ID Sum Name
## 1 1 10 a
## 2 2 20 b
## 3 3 30 c
## 4 4 40 d
## 5 5 50 e
## 6 5 200 e
## 7 6 60 6
## 8 7 700 g
An option using data.table
library(data.table)#1.9.5+
setkey(setDT(myData), ID)[myMap, Name:=i.Name][is.na(Name),
Name:= as.character(ID)]
# ID Sum Name
#1: 1 10 a
#2: 2 20 b
#3: 3 30 c
#4: 4 40 d
#5: 5 50 e
#6: 5 200 e
#7: 6 60 6
#8: 7 700 g
NOTE: As commented by #Arun, in the devel version v1.9.5, we can also set the key as an argument inside setDT, i.e. setDT(myData, key='ID')
Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2
I have the data frame df and I want to subset df based on a number sequence within a categorical.
x <- c(1,2,3,4,5,7,9,11,13)
x2 <- x+77
df <- data.frame(x=c(x,x2),y= c(rep("A",9),rep("B",9)))
df
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 7 A
7 9 A
8 11 A
9 13 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
15 84 B
16 86 B
17 88 B
18 90 B
I want only the rows where x increments by 1 and not the rows where x increases by two: e.g.
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
I figured I have to do some dort of subtraction between elements and check if the difference is >1 and combine this with a ddply but this seems cumbersome. Is there a sort of sequence function I am missing?
using diff
df[which(c(1,diff(df$x))==1),]
Your example seems to behave well and can be nicely handled by #agstudy's answer. Should your data act up one day, though...
myfun <- function(d, whichDiff = 1) {
# d is the data.frame you'd like to subset, containing the variable 'x'
# whichDiff is the difference between values of x you're looking for
theWh <- which(!as.logical(diff(d$x) - whichDiff))
# Take the diff of x, subtract whichDiff to get the desired values equal to 0
# Coerce this to a logical vector and take the inverse (!)
# which() gets the indexes that are TRUE.
# allWh <- sapply(theWh, "+", 1)
# Since the desired rows may be disjoint, use sapply to get each index + 1
# Seriously? sapply to add 1 to a numeric vector? Not even on a Friday.
allWh <- theWh + 1
return(d[sort(unique(c(theWh, allWh))), ])
}
> library(plyr)
>
> ddply(df, .(y), myfun)
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 78 B
7 79 B
8 80 B
9 81 B
10 82 B