R Merge tables with one identifier, and other columns with same name will add up - r

I am currently have multiple tables need to merge. For example, I have tbl_1, tbl_2, and tbl_3. And I want to reach the final result as result table.
tbl_1:
ID trx_1 Cre_counts Deb_counts
1 10 9 8
2 5 6 5
3 10 4 3
tbl_2:
ID trx_2 Unk_counts Deb_counts
1 10 1 2
2 5 6 5
3 10 3 7
tbl_3:
ID trx_3 Unk_counts Ckc_counts
1 3 4 4
2 2 4 3
3 8 7 6
result:
ID trx_1 tx_2 trx_3 Cre_counts Deb_counts Unk_counts Ckc_counts
1 10 10 3 9 10 5 4
2 5 5 2 6 10 10 3
3 10 10 8 4 10 10 6
I have tries merge three tables by "ID", but the column name will change to Deb_counts.x, Deb_counts.y... I can use transform(), rowSums() to take some extra step to make it work. But I am wondering is there a easier way to do it? Thank you!

Maybe not the most elegant but here is a way:
First, you need to put your tables into a list:
l_tbl <- mget(ls(pattern="^tbl"))
Then you go through the list, working with 2 tables at a time, thanks to Reduce, first adding the common columns, then merging:
Reduce(function(x, y) {
col_com <- setdiff(intersect(names(x), names(y)), "ID")
if(length(col_com)) {
x[, col_com] <- x[, col_com] + y[, col_com]
y <- y[, !(names(y) %in% col_com)] # you only keep the "not common" columns in the second table
}
return(merge(x, y, by="ID"))
}, l_tbl)
ID trx_1 Cre_counts trx_3 Ckc_counts trx_2 Deb_counts Unk_counts
1 1 10 9 3 4 10 10 5
2 2 5 6 2 3 5 10 10
3 3 10 4 8 6 10 10 10

Related

Placing multiple outputs from each function call using apply into a row in a dataframe in R

I have a function that I repeat, changing the argument each time, using apply/sapply/lapply.
Works great.
I want to return a data set, where each row contains two (or more) variables from each iteration of the function.
Instead I get an unusable list.
do <-function(x){
a <- x+1
b <- x+2
cbind(a,b)
}
over <- [1:6]
final <- lapply(over, do)
Any suggestions?
Without changing your function do, you can use sapply and transpose it.
data.frame(t(sapply(over, do)))
# X1 X2
#1 2 3
#2 3 4
#3 4 5
#4 5 6
#5 6 7
#6 7 8
If you want to use do in current form with lapply, we can do
do.call(rbind.data.frame, lapply(over, do))
You could also try
as.data.frame(Reduce(rbind, final))
# a b
# 1 2 3
# 2 3 4
# 3 4 5
# 4 5 6
# 5 6 7
# 6 7 8
See ?Reduce and ?rbind for information about what they'll do.
You could also modify your final expression as
final <- as.data.frame(Reduce(rbind, lapply(over, do)))
#final
# a b
# 1 2 3
# 2 3 4
# 3 4 5
# 4 5 6
# 5 6 7
# 6 7 8

Order/Sort/Rank a table

I have a table like this
table(mtcars$gear, mtcars$cyl)
I want to rank the rows by the ones with more observations in the 4 cylinder. E.g.
4 6 8
4 8 4 0
5 2 1 2
3 1 2 12
I have been playing with order/sort/rank without much success. How could I order tables output?
We can convert table to data.frame and then order by the column.
sort_col <- "4"
tab <- as.data.frame.matrix(table(mtcars$gear, mtcars$cyl))
tab[order(-tab[sort_col]), ]
# OR tab[order(tab[sort_col], decreasing = TRUE), ]
# 4 6 8
#4 8 4 0
#5 2 1 2
#3 1 2 12
If we don't want to convert it into data frame and want to maintain the table structure we can do
tab <- table(mtcars$gear, mtcars$cyl)
tab[order(-tab[,dimnames(tab)[[2]] == sort_col]),]
# 4 6 8
# 4 8 4 0
# 5 2 1 2
# 3 1 2 12
Could try this. Use sort for the relevant column, specifying decreasing=TRUE; take the names of the sorted rows and subset using those.
table(mtcars$gear, mtcars$cyl)[names(sort(table(mtcars$gear, mtcars$cyl)[,1], dec=T)), ]
4 6 8
4 8 4 0
5 2 1 2
3 1 2 12
In the same scope as Milan, but using the order() function, instead of looking for names() in a sort()-ed list.
The [,1] is to look at the first column when ordering.
table(mtcars$gear, mtcars$cyl)[order(table(mtcars$gear, mtcars$cyl)[,1], decreasing=T),]

Need help concatenating column names

I am generating 5 different prediction and adding those predictions to an existing data frame. My code is:
For j in i{
…
actual.predicted <- data.frame(test_data, predicted)
}
I am trying to concatenate words together to create new column names, in the loop. Specifically, I have a column named “predicted” and I am generating predictions in each iteration of the loop. So, in the first iteration, I want the new column name to be “predicted.1” and for the second iteration, the new column name should be “predicted.2” and so on.
Any thoughts would be greatly appreciated.
You may not even need to use a loop here, but assuming you do, one pattern which might work well here would be to use a list:
results <- list()
for j in i {
# do something involving j
name <- paste0("predicted.", j)
results[[name]] <- data.frame(test_data, predicted)
}
One option is to set the names after assigning new columns
actual.predicted <- data.frame(orig_col = sample(10))
for (j in 1:5){
new_col = sample(10)
actual.predicted <- cbind(actual.predicted, new_col)
names(actual.predicted)[length(actual.predicted)] <- paste0('predicted.',j)
}
actual.predicted
# orig_col predicted.1 predicted.2 predicted.3 predicted.4 predicted.5
# 1 1 4 4 9 1 5
# 2 10 2 3 7 5 9
# 3 8 6 5 4 2 3
# 4 5 9 9 10 7 7
# 5 2 1 10 8 3 10
# 6 9 7 6 6 8 6
# 7 7 8 7 2 4 2
# 8 3 3 1 1 6 8
# 9 6 10 2 3 9 4
# 10 4 5 8 5 10 1

R Looking up closest value in data.frame less than equal to another value

I have two data.frames, lookup_df and values_df. For each row in lookup_df I want to lookup the closest value in the values_df that is less than or equal to an index value.
Here's my code so far:
lookup_df <- data.frame(ids = 1:10)
values_df <- data.frame(idx = c(1,3,7), values = c(6,2,8))
What I'm wanting for the result_df is the following:
> result_df
ids values
1 1 6
2 2 6
3 3 2
4 4 2
5 5 2
6 6 2
7 7 8
8 8 8
9 9 8
10 10 8
I know how to do this with SQL fairly easily but I'm curious if there is an R way that is straightforward. I could iterate the the rows of the lookup_df and then loop through the rows of the values_df but that is not computationally efficient. I'm open to using dplyr library if someone knows how to use that to solve the problem.
If values_df is sorted by idx ascending, then findInterval will work:
lookup_df <- data.frame(ids = 1:10)
values_df <- data.frame(idx = c(1,3,7), values = c(6,2,8))
lookup_df$values <- values_df$values[findInterval(lookup_df$ids,values_df$idx)]
lookup_df
> ids values
1 1 6
2 2 6
3 3 2
4 4 2
5 5 2
6 6 2
7 7 8
8 8 8
9 9 8
10 10 8

generate sequence of numbers in R according to other variables

I have problem to generate a sequence of number according on two other variables.
Specifically, I have the following DB (my real DB is not so balanced!):
ID1=rep((1:1),20)
ID2=rep((2:2),20)
ID3=rep((3:3),20)
ID<-c(ID1,ID2,ID3)
DATE1=rep("2013-1-1",10)
DATE2=rep("2013-1-2",10)
DATE=c(DATE1,DATE2)
IN<-data.frame(ID,DATE=rep(DATE,3))
and I would like to generate a sequence of number according to the number of observation per each ID for each DATE, like this:
OUTPUT<-data.frame(ID,DATE=rep(DATE,3),N=rep(rep(seq(1:10),2),3))
Curiously, I try the following solution that works for the DB provided above, but not for the real DB!
IN$UNIQUE<-with(IN,as.numeric(interaction(IN$ID,IN$DATE,drop=TRUE,lex.order=TRUE)))#generate unique value for the combination of id and date
PROG<-tapply(IN$DATE,IN$UNIQUE,seq)#generate the sequence
OUTPUT$SEQ<-c(sapply(PROG,"["))#concatenate the sequence in just one vector
Right now, I can not understand why the solution doesn't work for the real DB, as always any tips is greatly appreciated!
Here there is an example (just one ID included) of the data-set:
id date
1 F2_G 2005-03-09
2 F2_G 2005-06-18
3 F2_G 2005-06-18
4 F2_G 2005-06-18
5 F2_G 2005-06-19
6 F2_G 2005-06-19
7 F2_G 2005-06-19
8 F2_G 2005-06-19
9 F2_G 2005-06-20
Here's one using ave:
OUT <- within(IN, {N <- ave(ID, list(ID, DATE), FUN=seq_along)})
This should do what you want...
require(reshape2)
as.vector( apply( dcast( IN , ID ~ DATE , length )[,-1] , 1:2 , function(x)seq.int(x) ) )
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6
[27] 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2
[53] 3 4 5 6 7 8 9 10
Bascially we use dcast to get the number of observations by ID and date like so
dcast( IN , ID ~ DATE , length )
ID 2013-1-1 2013-1-2
1 1 10 10
2 2 10 10
3 3 10 10
Then we use apply across each cell to make a sequence of integers as long as the count of ID for each date. Finally we coerce back to a vector using as.vector.

Resources