Merging and summarizing two dataframes - r

I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.

If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))

This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.

You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17

Related

Compare two columns of two different data frames with different length of rows return a third row

I have two different df which have the same columns: "O" for place and "date" for time.
Df 1 gives different information for a certain place (O) and time (date) in one 1 row and df 2 has many information for the same year and place in many different rows. No I want to extract one condition of the first df that applies for all the rows of the second df if values for "O" and "date" are equal.
To make it more clear:
I have one line in df 1: krnqm=250 for O=1002 and date=1885. Now I want a new column "krnqm" in df 2 where df2$krnqm = 250 for all rows where df2$0=1002 and df2$date=1885.
Unfortunately I have no idea how to put that condition into a code line and would be greatful for your help.
You can do this quite easily in base R using the merge function. Here's an example.
Simulate some data from your description:
df1 <- expand.grid(O = letters[c(2:4,7)], date = c(1,3))
df2 <- data.frame(O = rep(letters[1:6], c(2,3,3,6,2,2)), date = rep(1:3, c(3,2,4)))
df1$krnqm <- sample(1:1000, size = nrow(df1), replace=T)
> df1
O date krnqm
1 b 1 833
2 c 1 219
3 d 1 773
4 g 1 514
5 b 3 118
6 c 3 969
7 d 3 704
8 g 3 914
> df2
O date
1 a 1
2 a 1
3 b 1
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
9 d 3
10 d 1
11 d 1
12 d 1
13 d 2
14 d 2
15 e 3
16 e 3
17 f 3
18 f 3
Now let's combine the two data frames in the manner you describe.
df2 <- merge(df2, df1, all.x=T)
> df2
O date krnqm
1 a 1 NA
2 a 1 NA
3 b 1 833
4 b 2 NA
5 b 2 NA
6 c 3 969
7 c 3 969
8 c 3 969
9 d 1 773
10 d 1 773
11 d 1 773
12 d 2 NA
13 d 2 NA
14 d 3 704
15 e 3 NA
16 e 3 NA
17 f 3 NA
18 f 3 NA
So you can see, the krnqm column in the resulting data frame contains NAs for any combinations of 'O' and 'date' that were not found in the data frame where the krnqm values were extracted from. If your df1 has other columns, that you do not want to be included in the merge, just change the merge call slightly to only use those columns that you want: df2 <- merge(df2, df1[,c("O", "date", "krnqm")], all.x=T).
Good luck!

Calculating all possible intersections

We have a data frame with one column for a category and one column for discrete values. We want to get all possible intersections (number of common values) for all combinations of categories.
I came up with the following code. However, is there something shorter out there? I am sure there is a better way of doing this, a specialized function that does exactly this. The code below can be shortened, of course, for example with purrr:map, but that is not my question.
## prepare an example data set
df <- data.frame(category=rep(LETTERS[1:5], each=20),
value=sample(letters[1:10], 100, replace=T))
cats <- unique(df$category)
n <- length(cats)
## all combinations of 1...n unique elements from category
combinations <- lapply(1:n, function(i) combn(cats, i, simplify=FALSE))
combinations <- unlist(combinations, recursive=FALSE)
names(combinations) <- sapply(combinations, paste0, collapse="")
## for each combination of categories, get the values which belong
## to this category
intersections <- lapply(combinations,
function(co)
lapply(co, function(.x) df$value[ df$category == .x ]))
intersections <- lapply(intersections,
function(.x) Reduce(intersect, .x))
intersections <- sapply(intersections, length)
This brings us to my desired outcome:
> intersections
A B C D E AB AC AD AE BC
20 20 20 20 20 10 8 8 9 8
BD BE CD CE DE ABC ABD ABE ACD ACE
8 9 7 8 8 8 8 9 7 8
ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE
8 7 8 8 7 7 8 8 7 7
ABCDE
7
Question: is there a way of achieving the same result with less fuzz?
Here is a possible approach with data.table to cast the data.frame and model.matrix to count the higher-order interactions:
Cast to wide-format by grouping all matching values between categories in the rows (credits to #chinsoon12 for the dcast syntax).
Identify all higher-order interactions with model.matrix and sum over the columns.
library(data.table)
df_wide <- dcast(setDT(df), value + rowid(category, value) ~ category, fun.aggregate = length, fill = 0)
head(df_wide)
#> value category A B C D E
#> 1: a 1 1 1 1 1 1
#> 2: a 2 1 0 0 1 1
#> 3: a 3 0 0 0 1 0
#> 4: b 1 1 1 1 0 1
#> 5: b 2 1 0 1 0 1
#> 6: c 1 1 1 1 1 1
colSums(model.matrix(~(A + B + C + D + E)^5, data = df_wide))[-1]
#> A B C D E A:B A:C
#> 20 20 20 20 20 13 11
#> A:D A:E B:C B:D B:E C:D C:E
#> 12 12 11 13 13 11 13
#> D:E A:B:C A:B:D A:B:E A:C:D A:C:E A:D:E
#> 10 8 9 9 7 9 7
#> B:C:D B:C:E B:D:E C:D:E A:B:C:D A:B:C:E A:B:D:E
#> 8 9 7 8 5 7 5
#> A:C:D:E B:C:D:E A:B:C:D:E
#> 5 6 4
Data
set.seed(1)
df <- data.frame(category=rep(LETTERS[1:5], each=20),
value=sample(letters[1:10], 100, replace=T))

How to perform a conditional lookup in r?

I play with the following two simple datasets:
(myData <- data.frame(ID=c(1:7, 5), Sum=c(10, 20, 30, 40, 50, 60, 700, 200)))
# ID Sum
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 6 60
# 7 7 700
# 8 5 200
and
(myMap <- data.frame(ID=c(1:5, 7), Name=c("a", "b", "c", "d", "e", "g")))
# ID Name
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
# 6 7 g
I will map the data with the map, this way:
myData$Name<-myMap$Name[match(myData$ID, myMap$ID)]
However since there is no map entry for the ID == 6, the output is:
ID Sum Name
1 1 10 a
2 2 20 b
3 3 30 c
4 4 40 d
5 5 50 e
6 6 60 <NA>
7 7 700 g
8 5 200 e
What I am trying to do now: in the record where Name is NA, the Name should become ID.
My attempts:
myData$Dummy<-ifelse( is.na(myData$Name),myData$ID, myData$Name)
or
for (i in 1:length(myData$Name) )
if (is.na(myData$Name[i]))
{
x <- myData$ID[i]
# print(x)
myData$Name[i]<- as.factor(x)
print(myData$Name[i])
}
are wrong. Could you please give me a hint?
It's the fact that the column you think is character is really a factor. Either use stringsAsFactors=FALSE when creating the data frame or you'll need to account for it when manipulating the data. I've provided dplyr + piping and base R solutions below. Note the use of left_join (dplyr) or merge (base) vs your subset & matching:
library(dplyr)
myData <- read.csv(text="ID;Sum
1;10
2;20
3;30
4;40
5;50
6;60
7;700
5;200", sep=";")
myMap <- read.csv(text="ID;Name
1;a
2;b
3;c
4;d
5;e
7;g", sep=";")
# dplyr -------------------------------------------------------------------
myData %>%
left_join(myMap) %>%
mutate(Name=as.character(Name),
Name=ifelse(is.na(Name), ID, Name)) -> dplyr_myData
## Joining by: "ID"
dplyr_myData
## ID Sum Name
## 1 1 10 a
## 2 2 20 b
## 3 3 30 c
## 4 4 40 d
## 5 5 50 e
## 6 6 60 6
## 7 7 700 g
## 8 5 200 e
# base --------------------------------------------------------------------
base_myData <- merge(myData, myMap, all.x=TRUE)
base_myData$Name <- as.character(base_myData$Name)
base_myData$Name <- ifelse(is.na(base_myData$Name),
base_myData$ID, base_myData$Name)
base_myData
## ID Sum Name
## 1 1 10 a
## 2 2 20 b
## 3 3 30 c
## 4 4 40 d
## 5 5 50 e
## 6 5 200 e
## 7 6 60 6
## 8 7 700 g
An option using data.table
library(data.table)#1.9.5+
setkey(setDT(myData), ID)[myMap, Name:=i.Name][is.na(Name),
Name:= as.character(ID)]
# ID Sum Name
#1: 1 10 a
#2: 2 20 b
#3: 3 30 c
#4: 4 40 d
#5: 5 50 e
#6: 5 200 e
#7: 6 60 6
#8: 7 700 g
NOTE: As commented by #Arun, in the devel version v1.9.5, we can also set the key as an argument inside setDT, i.e. setDT(myData, key='ID')

How to sum over diagonals of data frame

Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2

subset data frame on vector sequence

I have the data frame df and I want to subset df based on a number sequence within a categorical.
x <- c(1,2,3,4,5,7,9,11,13)
x2 <- x+77
df <- data.frame(x=c(x,x2),y= c(rep("A",9),rep("B",9)))
df
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 7 A
7 9 A
8 11 A
9 13 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
15 84 B
16 86 B
17 88 B
18 90 B
I want only the rows where x increments by 1 and not the rows where x increases by two: e.g.
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
I figured I have to do some dort of subtraction between elements and check if the difference is >1 and combine this with a ddply but this seems cumbersome. Is there a sort of sequence function I am missing?
using diff
df[which(c(1,diff(df$x))==1),]
Your example seems to behave well and can be nicely handled by #agstudy's answer. Should your data act up one day, though...
myfun <- function(d, whichDiff = 1) {
# d is the data.frame you'd like to subset, containing the variable 'x'
# whichDiff is the difference between values of x you're looking for
theWh <- which(!as.logical(diff(d$x) - whichDiff))
# Take the diff of x, subtract whichDiff to get the desired values equal to 0
# Coerce this to a logical vector and take the inverse (!)
# which() gets the indexes that are TRUE.
# allWh <- sapply(theWh, "+", 1)
# Since the desired rows may be disjoint, use sapply to get each index + 1
# Seriously? sapply to add 1 to a numeric vector? Not even on a Friday.
allWh <- theWh + 1
return(d[sort(unique(c(theWh, allWh))), ])
}
> library(plyr)
>
> ddply(df, .(y), myfun)
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 78 B
7 79 B
8 80 B
9 81 B
10 82 B

Resources