merge tables in R, combine cells if in both - r

Hi can you please explain how I can merge two tables that they can be used to generate a piechart?
#read input data
dat = read.csv("/ramdisk/input.csv", header = TRUE, sep="\t")
# pick needed columns and count the occurences of each entry
df1 = table(dat[["C1"]])
df2 = table(dat[["C2"]])
# rename columns
names(df1) <- c("ID", "a", "b", "c", "d")
names(df2) <- c("ID", "e", "f", "g", "h")
# show data for testing purpose
df1
# ID a b c d
#241 18 17 28 29
df2
# ID e f g h
#230 44 8 37 14
# looks fine so far, now the problem:
# what I want to do ist merging df and df2
# so that df will contain the overall numbers of each entry
# df should print
# ID a b c d e f g h
#471 18 17 28 29 44 8 37 14
# need them to make a nice piechart in the end
#pie(df)
I assume it can be done with merge somehow, but I haven't found the right way. The closest solution I found was merge(df1,df2,all=TRUE), but it wasn't exactly what I've needed.

An approach would be to stack, then rbind and do an aggregate
out <- aggregate(values ~ ., rbind(stack(df1), stack(df2)), sum)
To get a named vector
with(out, setNames(values, ind))
Or another approach is to concatenate the tables and then use tapply to do a group by sum
v1 <- c(df1, df2)
tapply(v1, names(v1), sum)
Or with rowsum
rowsum(v1, group = names(v1))

Another approach would be to use rbindlist from data.table and colSums to get the totals. rbindlist with fill=TRUE accepts all columns, even if they are not present in both tables.
df1<-read.table(text="ID a b c d
241 18 17 28 29 ",header=TRUE)
df2<-read.table(text="ID e f g h
230 44 8 37 14" ,header=TRUE)
library(data.table)
setDT(df1)
setDT(df2)
res <- rbindlist(list(df1,df2), use.names=TRUE, fill=TRUE)
colSums(res, na.rm=TRUE)
ID a b c d e f g h
471 18 17 28 29 44 8 37 14

I wrote the package safejoin that handle this type of tasks in an intuitive way (I hope!). You just need to have a common id between your 2 tables (we'll use tibble::row_id_to_column for that) and then you can merge and handle the column conflict with sum.
Using #pierre-lapointe's data :
library(tibble)
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
res <- safe_inner_join(rowid_to_column(df1),
rowid_to_column(df2),
by = "rowid",
conflict = sum)
res
# rowid ID a b c d e f g h
# 1 1 471 18 17 28 29 44 8 37 14
The for a given row (here the first and only), you can get your pie chart by converting to a vector with unlist and removing the irrelevant 2 first elements :
pie(unlist(res[1,])[-(1:2)])

Related

How to make sure 3 separate dfs only contain the same columns?

I have 3 dataframes, all have some overlapping column names, but also some that are not present in at least one column. I am trying to 1) select only columns that are present in all 3 dfs and 2) make sure all of the columns are in the same order (not looking for alphabetical per say).
df1
A B C D
4 5 2 9
df2
A D C F
13 23 94 1
df3
E C A D
3 83 12 7
**Ideal Output**
df1
A C D
4 2 9
df2
A C D
13 94 23
df3
A C D
12 83 7
I am honestly not to sure where to start. Intuitively I think something like this:
df1 <- apply(df1, 2, function(x) ifelse(colnames(x) %in% colnames(df2) & colnames(df1) %in% colnames(df3), x, subset(df1, select = -c(x))
Then repeat for the other 2 dfs. Once all three dfs have the same columns, then I would just order it using one of the dfs as a template.
col_order <- colnames(df1)
df2 <- df2[, col_order]
Where am I going wrong?
We can get the datasets in a list, and get the intersecting names and use that to subset
lst1 <- mget(ls(pattern = '^df\\d+$'))
nm1 <- Reduce(intersect, lapply(lst1, names))
lapply(lst1, subset, select = nm1)

How to compare two variable and different length data frames to add values from one data frame to the other, repeating values where necessary

I apologize as I'm not sure how to word this title exactly.
I have two data frames. df1 is a series of paths with columns "source" and "destination". df2 stores values associated with the destinations. Below is some sample data:
df1
row
source
destination
1
A
B
2
C
B
3
H
F
4
G
B
df2
row
destination
n
1
B
26
2
F
44
3
L
12
I would like to compare the two data frames and add the n column to df1 so that df1 has the correct n value for each destination. df1 should look like:
row
source
destination
n
1
A
B
26
2
C
B
26
3
H
F
44
4
G
B
26
The data that I'm actually working with is much larger, and is never the same number of rows when I run the program. The furthest I've gotten with this is using the which command to get the right values, but only each value once.
df2[ which(df2$destination %in% df1$destination), ]$n
[1] 26 44
When what I would need is the list (26,26,44,26) so I can save it to df1$n
We can use a merge or left_join
library(data.table)
setDT(df1)[df2, n := i.n, on = .(destination)]
A base R option using match
transform(
df1,
n = df2$n[match(destination, df2$destination)]
)
which gives
row source destination n
1 1 A B 26
2 2 C B 26
3 3 H F 44
4 4 G B 26
Data
df1 <- data.frame(row = 1:4, source = c("A", "C", "H", "G"), destination = c("B", "B", "F", "B"))
df2 <- data.frame(row = 1:3, destination = c("B", "F", "L"), n = c(26, 44, 12))

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

data.table execute function on groups of columns

If I have the following data table
m = matrix(1:12, ncol=4)
colnames(m) = c('A1','A2','B1','B2')
d = data.table(m)
is it possible to execute a function on sets of columns?
For example the following would be the sum of A1,A2 and B1,B2.
A B
1: 5 17
2: 7 19
3: 9 21
The solution would preferably work with a 500k x 100 matrix
Solution
A trick would be to split the column into groups.
Then you can use rowSums as Frank suggests (see comments on question):
# using your data example
m <- matrix(1:12, ncol = 4)
colnames(m) <- c('A1', 'A2', 'B1', 'B2')
d <- data.table(m)
# 1) group columns
groups <- split(colnames(d), substr(colnames(d), 1, 1))
# 2) group wise row sums
d[,lapply(groups, function(i) {rowSums(d[, i, with = FALSE])})]
Result
This will return the data.table:
A B
1: 5 17
2: 7 19
3: 9 21
Explanation
split creates a list of column names for each group, defined by a (something coercable to a) factor.
substr(colnames(m), 1, 1) takes the first letter as group id, use a different approach (e.g. sub("([A-Z]).*", "\\1", colnames(m)) for variable number of letters).
lapply is commonly used to apply functions over multiple columns in a data.table. Here we create a list output, named as the groups, containing the rowSums. with = FALSE is important to use the value of i to get the respective columns from d.
Definitely possible...
d[, ":=" (A = A1 + A2, B = B1 + B2)]
d
A1 A2 B1 B2 A B
1: 1 4 7 10 5 17
2: 2 5 8 11 7 19
3: 3 6 9 12 9 21
# Want to drop the old columns?
set(d, j = which(names(d) %in% c("A1", "B1", "A2", "B2")), value = NULL)
d
A B
1: 5 17
2: 7 19
3: 9 21
Whether it is desirable I shall not tell. Probably better to follow Frank's advice (see comments).

How to make a code which finds the largest k cells and their locations, when given a table?

I want to know a code which finds the largest k cells and their locations, when given a two dimensional table.
for example, the given two dimensional table is as follows,
table_ex
A B C
F 99 693 515
I 722 583 37
M 186 817 525
the function, which is made by a desirable code, gives the result.
function(table_ex, 2)
817, M B
722, I A
In the case described above, since k=2, the function gives two largest cells and their locations.
You can coerce to data.frame then just sort using order:
getTopCells <- function(tab, n) {
sort_df <- as.data.frame(tab)
sort_df <- sort_df[order(-sort_df$Freq),]
sort_df[1:n, ]
}
Example:
tab <- table(sample(c('A', 'B'), 200, replace=T),
rep(letters[1:5], 40))
# returns:
# a b c d e
# A 20 23 19 21 23
# B 20 17 21 19 17
getTopCells(tab, 3)
# returns:
# Var1 Var2 Freq
# 3 A b 23
# 9 A e 23
# 6 B c 21
A solution using only 'base' and without coercing into a data.frame :
First let's create a table:
set.seed(123)
tab <- table(sample(c('A', 'B'), 200, replace=T),
rep(letters[1:5], 40))
a b c d e
A 15 13 18 20 22
B 25 27 22 20 18
and now:
for (i in 1:nrow(tab)){
cat(dimnames(tab)[[1]][i], which.max(tab[i,]),max(tab[i,]),'\n')
}
A 5 22
B 2 27
I'm using a reshaping approach here. The key is to save your table in a data.frame format and then save your row names as another column in that data.frame. Then you can use something like:
df = read.table(text="
names A B C
F 99 693 515
I 722 583 37
M 186 817 525", header=T)
library(tidyr) # to reshape your dataset
library(dplyr) # to join commands
df %>%
gather(names2,value,-names) %>% # reshape your dataset
arrange(desc(value)) %>% # arrange your value column
slice(1:2) # pick top 2 rows
# names names2 value
# 1 M B 817
# 2 I A 722
PS: In case you don't want to use any packages, or don't want to use data.frames but your original table, I'm sure you'll find some great alternative replies here.

Resources