Keeping same name columns adjacent after merge - r

I have two data tables with lots of columns. The columns are the same but they are from different time points (one is from 2015 and one is from today). The structure of the data tables is roughly something like this:
library(data.table)
dt1 <- data.table(id = c("A", "B", "C"), i = c(2,4,6), a = c(1,2,3), w = c(2,3,4), f = c(2,3,5))
old_dt1 <- data.table(id = c("A", "B", "C"), i = c(1,2,6), a = c(1,1,1), w = c(2,1,2), f = c(1,3,1))
I would like to join them by id but I want that the columns with the same name are placed next to each other.
My problem is that when I merge (which is expected) I get the following result:
> merge(dt1, old_dt1, by = "id", suffixes = c("", "-2015"))
id i a w f i-2015 a-2015 w-2015 f-2015
1: A 2 1 2 2 1 1 2 1
2: B 4 2 3 3 2 1 1 3
3: C 6 3 4 5 6 1 2 1
I know I can manually reorder the data table by setcolorder but I was wondering if I am missing something simple (unfortunately the columns are not in alphabetical order so that is not an option...)
What I would like to get is the following:
result <- merge(dt1, old_dt1, by = "id", suffixes = c("", "-2015"))
setcolorder(result, c(1,2,6,3,7,4,8,5,9))
> result
id i i-2015 a a-2015 w w-2015 f f-2015
1: A 2 1 1 1 2 2 2 1
2: B 4 2 2 1 3 1 3 3
3: C 6 6 3 1 4 2 5 1

If the columns are already ordered in the two datasets, then create a matrix with 2 rows based on the column names excluding the first i.e. 'id', concatenate with 'id' and the set the column order
setcolorder(result, c(names(result)[1], matrix(names(result)[-1], nrow=2, byrow=TRUE)))
result
# id i i-2015 a a-2015 w w-2015 f f-2015
#1: A 2 1 1 1 2 2 2 1
#2: B 4 2 2 1 3 1 3 3
#3: C 6 6 3 1 4 2 5 1

Related

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Sort Data in the Table

For example, now I get the table
A B C
A 0 4 1
B 2 1 3
C 5 9 6
I like to order the columns and rows by my own defined order, to achieve
B A C
B 1 2 3
A 4 0 1
C 9 5 6
This can be accomplished in base R. First we make the example data:
# make example data
df.text <- 'A B C
0 4 1
2 1 3
5 9 6'
df <- read.table(text = df.text, header = T)
rownames(df) <- LETTERS[1:3]
A B C
A 0 4 1
B 2 1 3
C 5 9 6
Then we simply re-order the columns and rows using a vector of named indices:
# re-order data
defined.order <- c('B', 'A', 'C')
df <- df[, defined.order]
df <- df[defined.order, ]
B A C
B 1 2 3
A 4 0 1
C 9 5 6
If the defined order is given as
defined_order <- c("B", "A", "C")
and the initial table is created by
library(data.table)
# create data first
dt <- fread("
id A B C
A 0 4 1
B 2 1 3
C 5 9 6")
# note that row names are added as own id column
then you could achieve the desired result using data.table as follows:
# change column order
setcolorder(dt, c("id", defined_order))
# change row order
dt[order(defined_order)]
# id B A C
# 1: B 1 2 3
# 2: A 4 0 1
# 3: C 9 5 6

Remover observations for which there is not a duplicate

I would like to break a dataset into two frames - one for which the original dataset has duplicate observations based on a condition and one for which the original dataset does not have duplicate observations based on a condition. In the following example, I would like to break the frame into one for which there is only one coder for an observation and one for which there are two coders::
frame <- data.frame(id = c(1,1,1,2,2,3), coder = c("A", "A", "B", "A", "B", "A"), y = c(4,5,4,1,1,2))
frame
For this, I would like to produce, such that:
frame1:
id coder y
1 1 A 4
2 1 A 5
3 1 B 4
4 2 A 1
5 2 B 1
frame2:
6 3 A 2
You can use aggregate to determine the ids you want in each data frame:
cts <- aggregate(coder~id, frame, function(x) length(unique(x)))
cts
# id coder
# 1 1 2
# 2 2 2
# 3 3 1
Then you can subset as appropriate based on this:
subset(frame, id %in% cts$id[cts$coder >= 2])
# id coder y
# 1 1 A 4
# 2 1 A 5
# 3 1 B 4
# 4 2 A 1
# 5 2 B 1
subset(frame, id %in% cts$id[cts$coder < 2])
# id coder y
# 6 3 A 2
You may also try:
indx <- !colSums(!table(frame$coder, frame$id))
frame[frame$id %in% names(indx)[indx],]
# id coder y
#1 1 A 4
#2 1 A 5
#3 1 B 4
#4 2 A 1
#5 2 B 1
frame[frame$id %in% names(indx)[!indx],]
# id coder y
#6 3 A 2
Explanation
table(frame$coder, frame$id)
# 1 2 3
# A 2 1 1
# B 1 1 0 #Here for id 3, B==0
If we Negate that, the result would be a logical index
!table(frame$coder, frame$id).
Do the colSums of the above, which results
# 1 2 3
# 0 0 1
Negate again and get the index for ids and subset those ids which are TRUE
From this you can subset by matching with the names of the ids

Create counter with multiple variables [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have my data that looks like below:
CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013
I need to create a counter variable, which will be like below:
CustomerID TripDate TripCounter
1 1/3/2013 1
1 1/4/2013 2
1 1/9/2013 3
2 2/1/2013 1
2 2/4/2013 2
3 1/2/2013 1
Tripcounter will be for each customer.
Use ave. Assuming your data.frame is called "mydf":
mydf$counter <- with(mydf, ave(CustomerID, CustomerID, FUN = seq_along))
mydf
# CustomerID TripDate counter
# 1 1 1/3/2013 1
# 2 1 1/4/2013 2
# 3 1 1/9/2013 3
# 4 2 2/1/2013 1
# 5 2 2/4/2013 2
# 6 3 1/2/2013 1
For what it's worth, I also implemented a version of this approach in a function included in my "splitstackshape" package. The function is called getanID:
mydf <- data.frame(IDA = c("a", "a", "a", "b", "b", "b", "b"),
IDB = c(1, 2, 1, 1, 2, 2, 2), values = 1:7)
mydf
# install.packages("splitstackshape")
library(splitstackshape)
# getanID(mydf, id.vars = c("IDA", "IDB"))
getanID(mydf, id.vars = 1:2)
# IDA IDB values .id
# 1 a 1 1 1
# 2 a 2 2 1
# 3 a 1 3 2
# 4 b 1 4 1
# 5 b 2 5 1
# 6 b 2 6 2
# 7 b 2 7 3
As you can see from the example above, I've written the function in such a way that you can specify one or more columns that should be treated as ID columns. It checks to see if any of the id.vars are duplicated, and if they are, then it generates a new ID variable for you.
You can also use plyr for this (using #AnadaMahto's example data):
> ddply(mydf, .(IDA), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 2 2 2
3 a 1 3 3
4 b 1 4 1
5 b 2 5 2
6 b 2 6 3
7 b 2 7 4
or even:
> ddply(mydf, .(IDA, IDB), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 1 3 2
3 a 2 2 1
4 b 1 4 1
5 b 2 5 1
6 b 2 6 2
7 b 2 7 3
Note that plyr does not have a reputation for being the quickest solution, for that you need to take a look at data.table.
Here's a data.table approach:
library(data.table)
DT <- data.table(mydf)
DT[, .id := sequence(.N), by = "IDA,IDB"]
DT
# IDA IDB values .id
# 1: a 1 1 1
# 2: a 2 2 1
# 3: a 1 3 2
# 4: b 1 4 1
# 5: b 2 5 1
# 6: b 2 6 2
# 7: b 2 7 3
meanwhile, you can also use dplyr. if your data.frame is called mydata
library(dplyr)
mydata %>% group_by(CustomerID) %>% mutate(TripCounter = row_number())
I need to do this often, and wrote a function that accomplishes it differently than the previous answers. I am not sure which solution is most efficient.
idCounter <- function(x) {
unlist(lapply(rle(x)$lengths, seq_len))
}
mydf$TripCounter <- idCounter(mydf$CustomerID)
Here's the procedure styled code. I dont believe in things like if you are using loop in R then you are probably doing something wrong
x <- dataframe$CustomerID
dataframe$counter <- 0
y <- dataframe$counter
count <- 1
for (i in 1:length(x)) {
ifelse (x[i] == x[i-1], count <- count + 1, count <- 1 )
y[i] <- count
}
dataframe$counter <- y
This isn't the right answer but showing some interesting things comparing to for loops, vectorization is fast does not care about sequential updating.
a<-read.table(textConnection(
"CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013 "), header=TRUE)
a <- a %>%
group_by(CustomerID,TripDate) # must in order
res <- rep(1, nrow(a)) #base # 1
res[2:6] <-sapply(2:6, function(i)if(a$CustomerID[i]== a$CustomerID[i - 1]) {res[i] = res[i-1]+1} else {res[i]= res[i]})
a$TripeCounter <- res

Combine several row variables

Given data that looks like this:
Year<-c(1,1,1,1,2,2,2,2,3,3,3,3)
Tax<-c('A','B','C','D','A','B','C','D','A','B','C','D')
Count<-c(1,2,1,2,1,2,1,1,1,2,1,1)
Dummy<-data.frame(Year,Tax,Count)
Dummy
Year Tax Count
1 1 A 1
2 1 B 2
3 1 C 1
4 1 D 2
5 2 A 1
6 2 B 2
7 2 C 1
8 2 D 1
9 3 A 1
10 3 B 2
11 3 C 1
12 3 D 1
How would I go about combining some of the "Tax" elements- for instance if I wanted to combine A,B,C into a new variable "ABC". My end result should look like this
Year Tax Count
1 ABC 4
1 D 2
2 ABC 4
2 D 1
3 ABC 4
3 D 1
Another plyr solution. Just redefine your Tax variable and do a normal summary.
ddply(within(Dummy, {
Tax <- ifelse(Tax %in% c('A','B','C'), 'ABC', 'D')
}), .(Year, Tax), summarise, Count=sum(Count))
If you don't have plyr (or don't like it (!)), this problem is simple enough to handle in base R in a straightforward way.
aggregate(Count ~ Year + Tax, within(Dummy, {
Tax <- ifelse(Tax %in% c('A','B','C'), 'ABC', 'D')
}), sum)
Here an option using ddply
ddply(Dummy,.(Year),summarise,
Tax=c(Reduce(paste0,head(Tax,-1)),as.character(tail(Tax,1))),
Count=c(sum(head(Count,-1)),tail(Count,1)))
Year Tax Count
1 1 ABC 4
2 1 D 2
3 2 ABC 4
4 2 D 1
5 3 ABC 4
6 3 D 1
Alright, here is a much better solution than my original one. No empty dataframes, no rbinding, but it can still deal with arbitrary groups:
groups_list = list(c("A", "B", "C"), "D")
Dummy$TaxGroup = sapply(Dummy$Tax, function(tax_value) {
group_search = sapply(groups_list, function(group) tax_value %in% group)
group_num = which(group_search)
})
combined = ddply(
Dummy,
.(Year, TaxGroup),
summarize,
GroupName=paste(groups_list[[TaxGroup[1]]], sep="", collapse=""),
CombinedCount=sum(Count)
)

Resources