How do I add column values based on matching IDs in R? - r

I have two data frames:
A:
ID Var1 Var2 Var3
1 0 3 4
2 1 5 0
3 1 6 7
B:
ID Var1 Var2 Var3
1 2 4 2
2 2 1 1
3 0 2 1
4 1 0 3
I want to add the columns from A and B based on matching ID's to get data frame C, and keep row 4 from B (even though it does not have a matching ID from A):
ID Var1 Var2 Var3
1 2 7 6
2 3 6 1
3 1 8 8
4 1 0 3

rbind and aggregate by ID:
aggregate(. ~ ID, data=rbind(A,B), sum)
# ID Var1 Var2 Var3
#1 1 2 7 6
#2 2 3 6 1
#3 3 1 8 8
#4 4 1 0 3
In data.table you can similarly do:
library(data.table)
setDT(rbind(A,B))[, lapply(.SD, sum), by=ID]
And there would be analogous solutions in dplyr and sql or whatever else. Bind the rows, group by ID, sum.

Related

How to triplicate and rearrange columns [duplicate]

Is there any efficient way, without using for loops, to duplicate the columns in a data frame? For example, if I have the following data frame:
Var1 Var2
1 1 0
2 2 0
3 1 1
4 2 1
5 1 2
6 2 2
And I specify that column Var1 should be repeated twice, and column Var2 three times, then I would like to get the following:
Var1 Var1 Var2 Var2 Var2
1 1 1 0 0 0
2 2 2 0 0 0
3 1 1 1 1 1
4 2 2 1 1 1
5 1 1 2 2 2
6 2 2 2 2 2
Any help would be greatly appreciated!
We can replicate the column names (rep), use that as index to duplicate the columns. By default, the data.frame columns can have only unique column names, so it will use make.unique to add .1, .2 as suffix to the duplicate column names in 'df2'. If we don't want that, we can remove the suffix part with sub.
df2 <- df1[rep(names(df1), c(2,3))]
names(df2) <- sub('\\..*', '', names(df2))
df2
# Var1 Var1 Var2 Var2 Var2
#1 1 1 0 0 0
#2 2 2 0 0 0
#3 1 1 1 1 1
#4 2 2 1 1 1
#5 1 1 2 2 2
#6 2 2 2 2 2
Or as #Frank mentioned in the comments, we can also do
`[.noquote`(df1,c(1,1,2,2,2))

R - Loop through a data table with combination of dcast of sum

I have a table similar this, with more columns. What I am trying to do is creating a new table that shows, for each ID, the number of Counts of each Type, the Value of each Type.
df
ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3
I am able to do it for one single column by using
dcast(df[,j=list(sum(Counts,na.rm = TRUE)),by = c("ID","Type")],ID ~ paste(Type,"Counts",sep="_"))
However, I want to use a loop through each column within the data table. but there is no success, it will always add up all the rows. I have try to use
sum(df[[i]],na.rm = TRUE)
sum(names(df)[[i]] == "",na.rm = TRUE)
sum(df[[names(df)[i]]],na.rm = TRUE)
j = list(apply(df[,c(3:4),with=FALSE],2,function(x) sum(x,na.rm = TRUE)
I want to have a new table similar like
ID A_Counts B_Counts A_Value B_Value
1 1 2 5 4
2 5 3 5 6
My own table have more columns, but the idea is the same. Do I over-complicated it or is there a easy trick I am not aware of? Please help me. Thank you!
You have to melt your data first, and then dcast it:
library(reshape2)
df2 <- melt(df,id.vars = c("ID","Type"))
# ID Type variable value
# 1 1 A Counts 1
# 2 1 B Counts 2
# 3 2 A Counts 2
# 4 2 A Counts 3
# 5 2 B Counts 1
# 6 2 B Counts 2
# 7 1 A Value 5
# 8 1 B Value 4
# 9 2 A Value 1
# 10 2 A Value 4
# 11 2 B Value 3
# 12 2 B Value 3
dcast(df2,ID ~ Type + variable,fun.aggregate=sum)
# ID A_Counts A_Value B_Counts B_Value
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Another solution with base functions only:
df3 <- aggregate(cbind(Counts,Value) ~ ID + Type,df,sum)
# ID Type Counts Value
# 1 1 A 1 5
# 2 2 A 5 5
# 3 1 B 2 4
# 4 2 B 3 6
reshape(df3, idvar='ID', timevar='Type',direction="wide")
# ID Counts.A Value.A Counts.B Value.B
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Data
df <- read.table(text ="ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3",stringsAsFactors=FALSE,header=TRUE)

How to duplicate columns in a data frame

Is there any efficient way, without using for loops, to duplicate the columns in a data frame? For example, if I have the following data frame:
Var1 Var2
1 1 0
2 2 0
3 1 1
4 2 1
5 1 2
6 2 2
And I specify that column Var1 should be repeated twice, and column Var2 three times, then I would like to get the following:
Var1 Var1 Var2 Var2 Var2
1 1 1 0 0 0
2 2 2 0 0 0
3 1 1 1 1 1
4 2 2 1 1 1
5 1 1 2 2 2
6 2 2 2 2 2
Any help would be greatly appreciated!
We can replicate the column names (rep), use that as index to duplicate the columns. By default, the data.frame columns can have only unique column names, so it will use make.unique to add .1, .2 as suffix to the duplicate column names in 'df2'. If we don't want that, we can remove the suffix part with sub.
df2 <- df1[rep(names(df1), c(2,3))]
names(df2) <- sub('\\..*', '', names(df2))
df2
# Var1 Var1 Var2 Var2 Var2
#1 1 1 0 0 0
#2 2 2 0 0 0
#3 1 1 1 1 1
#4 2 2 1 1 1
#5 1 1 2 2 2
#6 2 2 2 2 2
Or as #Frank mentioned in the comments, we can also do
`[.noquote`(df1,c(1,1,2,2,2))

Rbind same data.frame with column switching

I am not new to R, but I cannot solve this problem: I have a data.frame and want to rbind the same data.frame with coloumn switching. But R does not switch the columns.
Example:
set.seed(13)
df <- data.frame(var1 = sample(5), var2 = sample(5))
> df
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
> rbind(df, df[,c(2,1)])
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
6 4 1
7 1 3
8 2 4
9 5 2
10 3 5
As you can see, the coloumns are not switched (row 6-10) whereas switching the columns alone works like a charm:
> df[,c(2,1)]
var2 var1
1 1 4
2 3 1
3 4 2
4 2 5
5 5 3
I guess this has something to do with the column names, but I cannot figure out what exacly.
Can anyone help?
Kind regards!
As pointed out by #Henrik, from ?rbind.data.frame: "The rbind data frame method [...] matches columns by name. So try this:
> rbind(df, setNames(df[,c(2,1)], c("var1", "var2")))
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
6 1 4
7 3 1
8 4 2
9 2 5
10 5 3
this also works:
> rbind(as.matrix(df), as.matrix(df[,c(2,1)]))

Using `rank` across columns to create new variable

I have a question I can't figure out, which I'm almost certain involves rank. Let's say that I have a df in wide form with 3 variables with integer values.
id var1 var2 var3
1 23 8 30
2 1 2 3
3 4 5 1
4 100 80 60
I'd like to create three new variables with the rank of the values for var1, var2, and var3 from largest to smallest. For example,
id var1 var2 var3 var1_rank var2_rank var3_rank
1 23 8 30 2 3 1
2 1 2 3 3 2 1
3 4 5 1 2 1 3
4 100 80 60 1 2 3
How would I go about doing this? Thanks!
Get the example data:
test <- read.table(text="id var1 var2 var3
1 23 8 30
2 1 2 3
3 4 5 1
4 100 80 60",header=TRUE)
Get the ranks part and rename appropriately (notice the -x to reverse the rank so it relates to decreasing instead of increasing size - this will be generalisable to any size of data.frame used as input):
ranks <- t(apply(test[,-1], 1, function(x) rank(-x) ))
colnames(ranks) <- paste(colnames(ranks), "_rank", sep="")
Join with the old data frame.
data.frame(test, ranks)
Result:
> data.frame(test,ranks)
id var1 var2 var3 var1_rank var2_rank var3_rank
1 1 23 8 30 2 3 1
2 2 1 2 3 3 2 1
3 3 4 5 1 2 1 3
4 4 100 80 60 1 2 3
To get to #mnel's answer using base R, you could also do something like:
testres <- data.frame(test["id"],stack(test[2:4]))
testres$rank <- ave(testres$values,testres$id,FUN=function(x) rank(-x) )
> testres
id values ind rank
1 1 23 var1 2
2 2 1 var1 3
3 3 4 var1 2
4 4 100 var1 1
5 1 8 var2 3
6 2 2 var2 2
7 3 5 var2 1
8 4 80 var2 2
9 1 30 var3 1
10 2 3 var3 1
11 3 1 var3 3
12 4 60 var3 3
I think it is easier to work in long format (and more memory efficient, as apply will coerce to a matrix. Here is an approach using reshape and data.table
library(data.table)
tlong <- reshape(data.table(test), direction ='long', varying = list(2:4),
times = paste0('var',1:3), v.names = 'value')
# calculate the rank within each `id`
tlong[, rank := rank(-value), by = id]
tlong
## id time value rank
## 1: 1 var1 23 2
## 2: 2 var1 1 3
## 3: 3 var1 4 2
## 4: 4 var1 100 1
## 5: 1 var2 8 3
## 6: 2 var2 2 2
## 7: 3 var2 5 1
## 8: 4 var2 80 2
## 9: 1 var3 30 1
## 10: 2 var3 3 1
## 11: 3 var3 1 3
## 12: 4 var3 60 3
# reshape to wide (if you want)
oldname <- paste0('var1',1:3)
twide <- reshape(tlong, direction = 'wide', timevar = 'time', idvar = 'id')
# reorder from value.var1, rank.var1,... to value.var1, value.var2,....rank.var1, rank.var2
setcolorder(twide, c('id', paste('value', oldname, sep ='.'), paste('rank', oldname, sep = '.'))
Here's one approach:
data.frame(dat, 4 - t(apply(dat[, -1], 1, rank)))
## > data.frame(dat, 4 - t(apply(dat[, -1], 1, rank)))
## id var1 var2 var3 var1.1 var2.1 var3.1
## 1 1 23 8 30 2 3 1
## 2 2 1 2 3 3 2 1
## 3 3 4 5 1 2 1 3
## 4 4 100 80 60 1 2 3

Resources