Related
We have a data frame with one column for a category and one column for discrete values. We want to get all possible intersections (number of common values) for all combinations of categories.
I came up with the following code. However, is there something shorter out there? I am sure there is a better way of doing this, a specialized function that does exactly this. The code below can be shortened, of course, for example with purrr:map, but that is not my question.
## prepare an example data set
df <- data.frame(category=rep(LETTERS[1:5], each=20),
value=sample(letters[1:10], 100, replace=T))
cats <- unique(df$category)
n <- length(cats)
## all combinations of 1...n unique elements from category
combinations <- lapply(1:n, function(i) combn(cats, i, simplify=FALSE))
combinations <- unlist(combinations, recursive=FALSE)
names(combinations) <- sapply(combinations, paste0, collapse="")
## for each combination of categories, get the values which belong
## to this category
intersections <- lapply(combinations,
function(co)
lapply(co, function(.x) df$value[ df$category == .x ]))
intersections <- lapply(intersections,
function(.x) Reduce(intersect, .x))
intersections <- sapply(intersections, length)
This brings us to my desired outcome:
> intersections
A B C D E AB AC AD AE BC
20 20 20 20 20 10 8 8 9 8
BD BE CD CE DE ABC ABD ABE ACD ACE
8 9 7 8 8 8 8 9 7 8
ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE
8 7 8 8 7 7 8 8 7 7
ABCDE
7
Question: is there a way of achieving the same result with less fuzz?
Here is a possible approach with data.table to cast the data.frame and model.matrix to count the higher-order interactions:
Cast to wide-format by grouping all matching values between categories in the rows (credits to #chinsoon12 for the dcast syntax).
Identify all higher-order interactions with model.matrix and sum over the columns.
library(data.table)
df_wide <- dcast(setDT(df), value + rowid(category, value) ~ category, fun.aggregate = length, fill = 0)
head(df_wide)
#> value category A B C D E
#> 1: a 1 1 1 1 1 1
#> 2: a 2 1 0 0 1 1
#> 3: a 3 0 0 0 1 0
#> 4: b 1 1 1 1 0 1
#> 5: b 2 1 0 1 0 1
#> 6: c 1 1 1 1 1 1
colSums(model.matrix(~(A + B + C + D + E)^5, data = df_wide))[-1]
#> A B C D E A:B A:C
#> 20 20 20 20 20 13 11
#> A:D A:E B:C B:D B:E C:D C:E
#> 12 12 11 13 13 11 13
#> D:E A:B:C A:B:D A:B:E A:C:D A:C:E A:D:E
#> 10 8 9 9 7 9 7
#> B:C:D B:C:E B:D:E C:D:E A:B:C:D A:B:C:E A:B:D:E
#> 8 9 7 8 5 7 5
#> A:C:D:E B:C:D:E A:B:C:D:E
#> 5 6 4
Data
set.seed(1)
df <- data.frame(category=rep(LETTERS[1:5], each=20),
value=sample(letters[1:10], 100, replace=T))
I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17
I play with the following two simple datasets:
(myData <- data.frame(ID=c(1:7, 5), Sum=c(10, 20, 30, 40, 50, 60, 700, 200)))
# ID Sum
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 6 60
# 7 7 700
# 8 5 200
and
(myMap <- data.frame(ID=c(1:5, 7), Name=c("a", "b", "c", "d", "e", "g")))
# ID Name
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
# 6 7 g
I will map the data with the map, this way:
myData$Name<-myMap$Name[match(myData$ID, myMap$ID)]
However since there is no map entry for the ID == 6, the output is:
ID Sum Name
1 1 10 a
2 2 20 b
3 3 30 c
4 4 40 d
5 5 50 e
6 6 60 <NA>
7 7 700 g
8 5 200 e
What I am trying to do now: in the record where Name is NA, the Name should become ID.
My attempts:
myData$Dummy<-ifelse( is.na(myData$Name),myData$ID, myData$Name)
or
for (i in 1:length(myData$Name) )
if (is.na(myData$Name[i]))
{
x <- myData$ID[i]
# print(x)
myData$Name[i]<- as.factor(x)
print(myData$Name[i])
}
are wrong. Could you please give me a hint?
It's the fact that the column you think is character is really a factor. Either use stringsAsFactors=FALSE when creating the data frame or you'll need to account for it when manipulating the data. I've provided dplyr + piping and base R solutions below. Note the use of left_join (dplyr) or merge (base) vs your subset & matching:
library(dplyr)
myData <- read.csv(text="ID;Sum
1;10
2;20
3;30
4;40
5;50
6;60
7;700
5;200", sep=";")
myMap <- read.csv(text="ID;Name
1;a
2;b
3;c
4;d
5;e
7;g", sep=";")
# dplyr -------------------------------------------------------------------
myData %>%
left_join(myMap) %>%
mutate(Name=as.character(Name),
Name=ifelse(is.na(Name), ID, Name)) -> dplyr_myData
## Joining by: "ID"
dplyr_myData
## ID Sum Name
## 1 1 10 a
## 2 2 20 b
## 3 3 30 c
## 4 4 40 d
## 5 5 50 e
## 6 6 60 6
## 7 7 700 g
## 8 5 200 e
# base --------------------------------------------------------------------
base_myData <- merge(myData, myMap, all.x=TRUE)
base_myData$Name <- as.character(base_myData$Name)
base_myData$Name <- ifelse(is.na(base_myData$Name),
base_myData$ID, base_myData$Name)
base_myData
## ID Sum Name
## 1 1 10 a
## 2 2 20 b
## 3 3 30 c
## 4 4 40 d
## 5 5 50 e
## 6 5 200 e
## 7 6 60 6
## 8 7 700 g
An option using data.table
library(data.table)#1.9.5+
setkey(setDT(myData), ID)[myMap, Name:=i.Name][is.na(Name),
Name:= as.character(ID)]
# ID Sum Name
#1: 1 10 a
#2: 2 20 b
#3: 3 30 c
#4: 4 40 d
#5: 5 50 e
#6: 5 200 e
#7: 6 60 6
#8: 7 700 g
NOTE: As commented by #Arun, in the devel version v1.9.5, we can also set the key as an argument inside setDT, i.e. setDT(myData, key='ID')
Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2
I have a dataframe with columns A, B and C.
I want to apply a function on each row of a dataframe in which a function will check the value of row$A and row$B and will update row$C based on those values. How can I achieve that?
Example:
A B C
1 1 10 10
2 2 20 20
3 NA 30 30
4 NA 40 40
5 5 50 50
Now I want to update all rows in C column to B/2 value in that same row if value in A column for that row is NA.
So the dataframe after changes would look like:
A B C
1 1 10 10
2 2 20 20
3 NA 30 15
4 NA 40 20
5 5 50 50
I would like to know if this can be done without using a for loop.
Or if you want to update the column by reference (without copying the whole data set when updating the column) could also try data.table
library(data.table)
setDT(dat)[is.na(A), C := B/2]
dat
# A B C
# 1: 1 10 10
# 2: 2 20 20
# 3: NA 30 15
# 4: NA 40 20
# 5: 5 50 50
Edit:
Regarding #aruns comment, checking the address before and after the change implies it was updated by reference still.
library(pryr)
address(dat$C)
## [1] "0x2f85a4f0"
setDT(dat)[is.na(A), C := B/2]
address(dat$C)
## [1] "0x2f85a4f0"
Try this:
your_data <- within(your_data, C[is.na(A)] <- B[is.na(A)] / 2)
Try
indx <- is.na(df$A)
df$C[indx] <- df$B[indx]/2
df
# A B C
#1 1 10 10
#2 2 20 20
#3 NA 30 15
#4 NA 40 20
#5 5 50 50
here is simple example using library(dplyr).
Fictional dataset:
df <- data.frame(a=c(1, NA, NA, 2), b=c(10, 20, 50, 50))
And you want just those rows where a == NA, therefore you can use ifelse:
df <- mutate(df, c=ifelse(is.na(a), b/2, b))
Another approach:
dat <- transform(dat, C = B / 2 * (i <- is.na(A)) + C * !i)
# A B C
# 1 1 10 10
# 2 2 20 20
# 3 NA 30 15
# 4 NA 40 20
# 5 5 50 50
Try:
> ddf$C = with(ddf, ifelse(is.na(A), B/2, C))
>
> ddf
A B C
1 1 10 10
2 2 20 20
3 NA 30 15
4 NA 40 20
5 5 50 50