Add two dataframes; same dimension; different order column - r

I have a dataframe, df1:
Type CA AR Total
alpha 2 3 5
beta 1 5 6
gamma 6 2 8
delta 8 1 9
and a dataframe, df2:
Type AR CA Total
alpha 3 4 7
beta 2 6 8
gamma 9 1 10
delta 4 1 5
I want to add the two dataframes such that the values under "CA" are added together and that the values under "AR" are added together. Basically, the values under each heading should be added together.
The resulting df should look like this:
Type AR CA Total
alpha 6 6 12
beta 7 7 14
gamma 11 7 18
delta 5 9 14
For example: (AR, gamma) = 2 + 9 = 11

The safest way would probably be to bind and aggregate
aggregate(.~Type, rbind(df1,df2), sum)
# Type CA AR Total
# 1 alpha 6 6 12
# 2 beta 7 7 14
# 3 delta 9 5 14
# 4 gamma 7 11 18
The rbind.data.frame function pays attention to column names so it will properly stack your values.

I'll repeat my suggestion from the comments last time -- consider putting Type in rownames:
DF1 <- data.frame(df1[-1],row.names=df1$Type)
DF2 <- data.frame(df2[-1],row.names=df2$Type)
From here, adding is straightforward:
DF1 + DF2[names(DF1)]
# CA AR Total
# alpha 6 6 12
# beta 7 7 14
# gamma 7 11 18
# delta 9 5 14
A couple of caveats: If your rows are not ordered the same way, this will not work correctly (that's why #MrFlick's approach is "safe"). Also, the extension to more data frames isn't so elegant here:
Reduce(`+`,lapply(list(DF2,DF3,DF4),`[`,order(names(DF1))),init=DF1) # here
aggregate(.~Type, rbind(df1,df2,df3,df4), sum) # #MrFlick

You can consider storing your data in a "long" form instead, which would make further operations more straightforward.
If you have your data.frames in a list, you can easily use melt from "reshape2" to get a "long" data.frame. For example:
melt(list(df1, df2), id.vars = "Type")
Once the data are in the long form, you can reshape it to a "wide" form using dcast, and perform whatever aggregation you want to at that stage.
Furthermore, you can generalize the creation of the list if you have similarly named data.frames in your workspace by using mget.
Here's an example, assuming we have two data.frames, one named "df1", and one named "df2":
library(reshape2)
dcast(melt(mget(ls(pattern = "df\\d+")), id.vars = "Type"),
Type ~ variable, value.var = "value", fun.aggregate = sum)
# Type CA AR Total
# 1 alpha 6 6 12
# 2 beta 7 7 14
# 3 delta 9 5 14
# 4 gamma 7 11 18

Related

How to use a fulljoin on my dataframes and rename columns with the same name R

I have two dataframes and they both have the exact same column names, however the data in the columns is different in each dataframe. I am trying to join the two frames (as seen below) by a full join. However, the hard part for me is the fact that I have to rename the columns so that the columns corresponding to my one dataset have some text added to the end while adding different text to the end of the columns that correspond to the second data set.
combined_df <- full_join(any.drinking, binge.drinking, by = ?)
A look at one of my df's:
Without custom function and shorter:
df <- cbind(cars, cars)
colnames(df) <- c(paste0(colnames(cars), "_any"), paste0(colnames(cars), "_binge"))
Output:
> head(df)
speed_any dist_any speed_binge dist_binge
1 4 2 4 2
2 4 10 4 10
3 7 4 7 4
4 7 22 7 22
5 8 16 8 16
6 9 10 9 10
Certainly not the most elegant way but maybe it is what you want:
custom_bind <- function(df1, suffix1, df2, suffix2){
colnames(df1) <- paste(colnames(df1), suffix1, sep = "_")
colnames(df2) <- paste(colnames(df2), suffix2, sep = "_")
df <- cbind(df1, df2)
return(df)
}
custom_bind(cars, "any", cars, "binge")
I made it as a function in case you want to do it with other tables. If not then it is not necessary.
Output:
> head(custom_bind(cars, "any", cars, "binge"))
speed_any dist_any speed_binge dist_binge
1 4 2 4 2
2 4 10 4 10
3 7 4 7 4
4 7 22 7 22
5 8 16 8 16
6 9 10 9 10

How to create a function which loops through column index numbers in R?

Consider the following data frame (df):
"id" "a1" "b1" "c1" "not_relevant" "p_a1" "p_b1" "p_c1"
a 2 6 0 x 2 19 12
a 4 2 7 x 3.5 7 11
b 1 9 4 x 7 1.5 4
b 7 5 11 x 8 12 5
I would like to create a new column which shows the sum of the product between two corresponding columns. To write less code I address the columns by their index number. Unfortunately I have no experience in writing functions, so I ended up doing this manually, which is extremely tedious and not very elegant.
Here a reproducible example of the data frame and what I have tried so far:
id <- c("a","a","b","b")
df <- data.frame(id)
df$a1 <- as.numeric((c(2,4,1,7)))
df$b1 <- as.numeric((c(6,2,9,5)))
df$c1 <- as.numeric((c(0,7,4,11)))
df$not_relevant <- c("x","x","x","x")
df$p_a1 <- as.numeric((c(2,3.5,7,8)))
df$p_b1 <- as.numeric((c(19,7,1.5,12)))
df$p_c1 <- as.numeric((c(12,11,4,5)))
require(dplyr)
df %>% mutate(total = .[[2]]*.[[6]] + .[[3]] *.[[7]]+ .[[4]] *.[[8]])
This leads to the desired result, but as I mentioned is not very efficient:
"id" "a1" "b1" "c1" "not_relevant" "p_a1" "p_b1" "p_c1" "total"
a 2 6 0 x 2 19 12 118.0
a 4 2 7 x 3.5 7 11 105.0
b 1 9 4 x 7 1.5 4 36.5
b 7 5 11 x 8 12 5 171.0
The real data I am working with has much more columns, so I would be glad if someone could show me a way to pack this operation into a function which loops through the column index numbers and matches the correct columns to each other.
Column indices are not a good way to do this. (Not a good way in general...)
Here's a simple dplyr method that assumes the columns are in the correct corresponding order (that is, it will give the wrong result if the "x1", "x2", "x3" is in a different order than "p_x3", "p_x2", "p_x1"). You may also need to refine the selection criteria for your real data:
df$total = rowSums(select(df, starts_with("x")) * select(df, starts_with("p_")))
df
# id x1 x2 x3 not_relevant p_x1 p_x2 p_x3 total
# 1 a 2 6 0 x 2.0 19.0 12 118.0
# 2 a 4 2 7 x 3.5 7.0 11 105.0
# 3 b 1 9 4 x 7.0 1.5 4 36.5
# 4 b 7 5 11 x 8.0 12.0 5 171.0
The other good option would be to convert your data to a long format, where you have a single x column and a single p column, with an "index" column indicating the 1, 2, 3. Then the operation could be done by group, finally moving back to a wide format.

R: Add two data frames with same dimensions

I have df1:
Type CA AR Total
alpha 2 3 5
beta 1 5 6
gamma 6 2 8
delta 8 1 9
I have df2:
Type CA AR Total
alpha 3 4 7
beta 2 6 8
gamma 9 1 10
delta 4 1 5
I want to add the values in both the data frames to get 1 data frame with this result:
Type CA AR Total
alpha 5 7 12
beta 3 11 14
gamma 15 3 18
delta 12 2 14
Example --> (alpha, CA) = 2 (from df1) + 3 (from df2) = 5 (resulting df)
Does anyone know how to do this? It's not exactly merge I think because merge will override the value, where as, I want to add the value.
Thanks in advance!!
+ is vectorised, this is just a simple operation in R
cbind(df1[1], df1[-1] + df2[-1])
# Type CA AR Total
# 1 alpha 5 7 12
# 2 beta 3 11 14
# 3 gamma 15 3 18
# 4 delta 12 2 14
If your data set are not order properly, you could use match (as mentioned in comments)
cbind(df1[1], df1[, -1] + df2[match(df1$Type, df2$Type), -1])
You can just sum them and re-add the factor column.
df_tot <- df1 + df2
df_tot$Type = df1$Type
You can do with dplyr + magrittr, if you want to go that route:
library("dplyr")
library("magrittr")
df1 %>% select(-type) %>%
add(df2 %>% select(-type)) %>%
mutate(type = df1$type)
Note: this assumes df1 and df2 are ordered in the same manner.

reshape data into panel with multiple variables and no time variable in R

I'm new to reshaping data in R and can't figure out how to use reshape() (or another package) to create a panel data. There are two time observations for each geographical unit, however each of the time observations is formatted in a variable. For example:
subdistrict <- 1:4
control_t1 <- 5:8
control_t2 <- 9:12
motivation_t1 <- 12:15
motivation_t2 <- 16:19
data_mat <- as.data.frame(cbind(subdistrict, control_t1, control_t2, motivation_t1, motivation_t2))
data_mat
subdistrict control_t1 control_t2 motivation_t1 motivation_t2
1 1 5 9 12 16
2 2 6 10 13 17
3 3 7 11 14 18
4 4 8 12 15 19
Here, control_t1 and control_t2 each refer to a different period. My goal is to reshape the data such that a time variable can be established and the named variable can be collapsed so to produce the following frame:
subdistrict time control motivation
1 1 1 12
1 2 5 16
2 1 2 13
2 2 6 17
3 1 3 14
3 2 7 18
4 1 4 15
4 2 8 19
I'm not sure how to create the new time variable, and collapse and rename the variables to reshape the data as such. Thanks for any help.
You just have to use the reshape() function with option direction = "long". Here is the code :
district <- 1:4
control_t1 <- 5:8
control_t2 <- 9:12
relax_t1 <- 12:15
relax_t2 <- 16:19
data_mat <- as.data.frame(cbind(district, control_t1, control_t2, relax_t1, relax_t2))
reshape(data = data_mat, direction = "long", idvar = "district", timevar = "time", varying = list(c(2:3), c(4:5)))
# district time control_t1 relax_t1
# 1.1 1 1 5 12
# 2.1 2 1 6 13
# 3.1 3 1 7 14
# 4.1 4 1 8 15
# 1.2 1 2 9 16
# 2.2 2 2 10 17
# 3.2 3 2 11 18
# 4.2 4 2 12 19
Have a look at the R Programming wikibooks to learn more.
A simple answer is to split and rebind the data frame into your new form, like so:
new_Data <- data.frame(
subdistrict=data_mat[,1],
control=unlist(data_mat[,2:3]),
motivation=unlist(data_mat[,4:5]))
All we are doing here is collapsing the two columns of 'control' and 'motivation' into single columns of data by using the 'unlist' function and then binding it all into a new data frame. The 'subdistrict' data repeats, so there is no reason to specify it twice.

Operate over levels of two factors

I have a dataset that looks something like this, with many classes, each with many (5-10) subclasses, each with a value associated with it:
> data.frame(class=rep(letters[1:4], each=4), subclass=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8), value=1:16)
class subclass value
1 a 1 1
2 a 1 2
3 a 2 3
4 a 2 4
5 b 3 5
6 b 3 6
7 b 4 7
8 b 4 8
9 c 5 9
10 c 5 10
11 c 6 11
12 c 6 12
13 d 7 13
14 d 7 14
15 d 8 15
16 d 8 16
I want to first sum the values for each class/subclass, then take the median value for each class among all the subclasses.
I.e., the intermediate step would sum the values for each subclass for each class, and would look like this (note that I don't need to keep the data from this intermediate step):
> data.frame(class=rep(letters[1:4], each=2), subclass=1:8, sum=c(3,7,11,15,19,23,27,31))
class subclass sum
1 a 1 3
2 a 2 7
3 b 3 11
4 b 4 15
5 c 5 19
6 c 6 23
7 d 7 27
8 d 8 31
The second step would take the median for each class among all the subclasses, and would look like this:
> data.frame(class=letters[1:4], median=c(median(c(3,7)), median(c(11,15)), median(c(19,23)), median(c(27,31))))
class median
1 a 5
2 b 13
3 c 21
4 d 29
This is the only data I need to keep. Note that both $class and $subclass will be factor variables, and value will always be a non-missing positive integer. Each class will have a varying number of subclasses.
I'm sure I can do this with some nasty for loops, but I was hoping for a better way that's vectorized and easier to maintain.
Here is another example of using aggregate
temp <- aggregate(df$value,list(class=df$class,subclass=df$subclass),sum)
aggregate(temp$x,list(class=temp$class),median)
Output:
class x
1 a 5
2 b 13
3 c 21
4 d 29
Or if you like a one-liner solution, you can do:
aggregate(value ~ class, median, data=aggregate(value ~ ., sum, data=df))
You could try for your first step:
df_sums <- aggregate(value ~ class + subclass, sum, data=df)
Then:
aggregate(value ~ class, median, data=df_sums)
Here are two other alternatives.
The first uses ave within a within statement where we progressively reduce our source data.frame after adding in our aggregated data. Since this will result in many repeated rows, we can safely use unique as the last step to get the output you want.
unique(within(mydf, {
Sum <- ave(value, class, subclass, FUN = sum)
rm(subclass, value)
Median <- ave(Sum, class, FUN = median)
rm(Sum)
}))
# class Median
# 1 a 5
# 5 b 13
# 9 c 21
# 13 d 29
A second option is to use the "data.table" package and "compound" your statements as below. V1 is the name that will be automatically created by data.table if a name is not specified by the user.
library(data.table)
DT <- data.table(mydf)
DT[, sum(value), by = c("class", "subclass")][, median(V1), by = "class"]
# class V1
# 1: a 5
# 2: b 13
# 3: c 21
# 4: d 29

Resources