R: Add two data frames with same dimensions - r

I have df1:
Type CA AR Total
alpha 2 3 5
beta 1 5 6
gamma 6 2 8
delta 8 1 9
I have df2:
Type CA AR Total
alpha 3 4 7
beta 2 6 8
gamma 9 1 10
delta 4 1 5
I want to add the values in both the data frames to get 1 data frame with this result:
Type CA AR Total
alpha 5 7 12
beta 3 11 14
gamma 15 3 18
delta 12 2 14
Example --> (alpha, CA) = 2 (from df1) + 3 (from df2) = 5 (resulting df)
Does anyone know how to do this? It's not exactly merge I think because merge will override the value, where as, I want to add the value.
Thanks in advance!!

+ is vectorised, this is just a simple operation in R
cbind(df1[1], df1[-1] + df2[-1])
# Type CA AR Total
# 1 alpha 5 7 12
# 2 beta 3 11 14
# 3 gamma 15 3 18
# 4 delta 12 2 14
If your data set are not order properly, you could use match (as mentioned in comments)
cbind(df1[1], df1[, -1] + df2[match(df1$Type, df2$Type), -1])

You can just sum them and re-add the factor column.
df_tot <- df1 + df2
df_tot$Type = df1$Type

You can do with dplyr + magrittr, if you want to go that route:
library("dplyr")
library("magrittr")
df1 %>% select(-type) %>%
add(df2 %>% select(-type)) %>%
mutate(type = df1$type)
Note: this assumes df1 and df2 are ordered in the same manner.

Related

mutate string into numeric, ignore alphabetical order of factor

I am trying to recode factor levels into numbers using mutate function, but I want to ignore alphabetical order the factors are appearing in. There are multiple same values of factor levels and I want them to be assigned the number in the new column of the row in which they first appeared in the dataframe.
Example:
library(stringi)
set.seed(234)
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
data2<-data %>% mutate(num=(as.numeric(factor(data))))
data2
Expected outcome:
dat<-data2[,-2]
order<-c(1,2,3,2,4,5)
expected_result<-cbind.data.frame(head(dat), order)
expected_result
I think you can just create a new factor and set the levels as unique values of data2$data in your example:
new_fac <- factor(data2$data, levels = unique(data2$data))
The numeric values can be obtained:
new_order <- as.numeric(new_fac)
And this is what your final result would look like:
head(data.frame(new_fac, new_order))
new_fac new_order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
Or in your example with dplyr, you can do:
data %>%
mutate(num = as.numeric(factor(data, levels = unique(data))))
You could accomplish this with a helper table that contains the row number of the first time a string appears in your table. I.e.
library(stringi)
library(tidyverse)
# generate data
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
Create helper table:
factorlevels <- data %>% unique() %>% mutate(order = row_number())
... and inner join to data
data %>% inner_join(factorlevels)
Output:
> data %>% inner_join(factorlevels)
Joining, by = "data"
data order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
7 v 6
8 i 7
9 v 6
10 H 8
11 Y 9
12 X 10
13 a 11
14 a 11
15 0 12
16 R 13
17 J 14
18 j 15
19 8 16
20 s 17
I am sure that there is a one-liner approach to this, but I could not figure it out right away.

How to choose between two replicated quantities in R

This is a simplified example.
I have a data frame with two variables like this:
a <- c(1,1,1,2,2,2,3,3,6,7,4,5,5,8)
b <- c(5,10,4,2,8,4,6,9,12,3,7,4,1,7)
D <- data.frame(a,b)
As you can see, there are 8 values for a but they have replicated, and my data-frame has 14 observations. I want to create a data-frame which has 8 observations in which the a quantities are unique, and the b values are the minimum of choices, i.e., the result should be like:
a b
1 1 4
2 2 2
3 3 6
4 6 12
5 7 3
6 4 7
7 5 1
8 8 7
Here's how to do it with base R:
#both lines do the same thing, pick one
aggregate(D$b, by = D["a"], FUN = min)
aggregate(b ~ a, data = D, FUN = min)
Here's how to do it with data.table:
library(data.table)
setDT(D)
D[ , .(min(b)), by=a]
Here's how to do it with tidyverse functions:
library(tidyverse) #or just library(dplyr)
D %>% group_by(a)
%>% summarize(min(b))
Using R base approach:
> D2 <- D[order(D$a, D$b ), ]
> D2 <- D2[ !duplicated(D2$a), ]
> D2
a b
3 1 4
4 2 2
7 3 6
11 4 7
13 5 1
9 6 12
10 7 3
14 8 7
A base R option would be
aggregate(b ~ a, D, min)
library (dplyr)
D<-D %>% group_by(a) %>% summarize(min(b))

Moving down columns in data frames in R

Suppose I have the next data frame:
df<-data.frame(step1=c(1,2,3,4),step2=c(5,6,7,8),step3=c(9,10,11,12),step4=c(13,14,15,16))
step1 step2 step3 step4
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
4 4 8 12 16
and what I have to do is something like the following:
df2<-data.frame(col1=c(1,2,3,4,5,6,7,8,9,10,11,12),col2=c(5,6,7,8,9,10,11,12,13,14,15,16))
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
How can I do that? consider that more steps can be included (example, 20 steps).
Thanks!!
We can design a function to achieve this task. df_final is the final output. Notice that bin is an argument that the users can specify how many columns to transform together.
# A function to conduct data transformation
trans_fun <- function(df, bin = 3){
# Calculate the number of new columns
new_ncol <- (ncol(df) - bin) + 1
# Create a list to store all data frames
df_list <- lapply(1:new_ncol, function(num){
return(df[, num:(num + bin - 1)])
})
# Convert each data frame to a vector
dt_list2 <- lapply(df_list, unlist)
# Convert dt_list2 to data frame
df_final <- as.data.frame(dt_list2)
# Set the column and row names of df_final
colnames(df_final) <- paste0("col", 1:new_ncol)
rownames(df_final) <- 1:nrow(df_final)
return(df_final)
}
# Apply the trans_fun
df_final <- trans_fun(df)
df_final
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
Here is a method using dplyr and reshape2 - this assumes all of the columns are the same length.
library(dplyr)
library(reshape2)
Drop the last column from the dataframe
df[,1:ncol(df)-1]%>%
melt() %>%
dplyr::select(col1=value) -> col1
Drop the first column from the dataframe
df %>%
dplyr::select(-step1) %>%
melt() %>%
dplyr::select(col2=value) -> col2
Combine the dataframes
bind_cols(col1, col2)
This should do the work:
df2 <- data.frame(col1 = 1:(length(df$step1) + length(df$step2)))
df2$col1 <- c(df$step1, df$step2, df$step3)
df2$col2 <- c(df$step2, df$step3, df$step4)
Things to point:
The important thing to see in the first line of the code, is the need for creating a table with the right amount of rows
Calling a columns that does not exist will create one, with that name
Deleting columns in R should be done like this df2$col <- NULL
Are you not just looking to do:
df2 <- data.frame(col1 = unlist(df[,-nrow(df)]),
col2 = unlist(df[,-1]))
rownames(df2) <- NULL
df2
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16

R - Output of aggregate and range gives 2 columns for every column name - how to restructure?

I am trying to produce a summary table showing the range of each variable by group. Here is some example data:
df <- data.frame(group=c("a","a","b","b","c","c"), var1=c(1:6), var2=c(7:12))
group var1 var2
1 a 1 7
2 a 2 8
3 b 3 9
4 b 4 10
5 c 5 11
6 c 6 12
I used the aggregate function like this:
df_range <- aggregate(df[,2:3], list(df$group), range)
Group.1 var1.1 var1.2 var2.1 var2.2
1 a 1 2 7 8
2 b 3 4 9 10
3 c 5 6 11 12
The output looked normal, but the dimensions are 3x3 instead of 5x3 and there are only 3 names:
names(df_range)
[1] "Group.1" "var1" "var2"
How do I get this back to the normal data frame structure with one name per column? Or alternatively, how do I get the same summary table without using aggregate and range?
That is the documented output of a matrix within the data frame. You can undo the effect with:
newdf <- do.call(data.frame, df_range)
# Group.1 var1.1 var1.2 var2.1 var2.2
#1 a 1 2 7 8
#2 b 3 4 9 10
#3 c 5 6 11 12
dim(newdf)
#[1] 3 5
Here's an approach using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
summarise_each(funs(max(.) - min(.)), var1, var2)

Add two dataframes; same dimension; different order column

I have a dataframe, df1:
Type CA AR Total
alpha 2 3 5
beta 1 5 6
gamma 6 2 8
delta 8 1 9
and a dataframe, df2:
Type AR CA Total
alpha 3 4 7
beta 2 6 8
gamma 9 1 10
delta 4 1 5
I want to add the two dataframes such that the values under "CA" are added together and that the values under "AR" are added together. Basically, the values under each heading should be added together.
The resulting df should look like this:
Type AR CA Total
alpha 6 6 12
beta 7 7 14
gamma 11 7 18
delta 5 9 14
For example: (AR, gamma) = 2 + 9 = 11
The safest way would probably be to bind and aggregate
aggregate(.~Type, rbind(df1,df2), sum)
# Type CA AR Total
# 1 alpha 6 6 12
# 2 beta 7 7 14
# 3 delta 9 5 14
# 4 gamma 7 11 18
The rbind.data.frame function pays attention to column names so it will properly stack your values.
I'll repeat my suggestion from the comments last time -- consider putting Type in rownames:
DF1 <- data.frame(df1[-1],row.names=df1$Type)
DF2 <- data.frame(df2[-1],row.names=df2$Type)
From here, adding is straightforward:
DF1 + DF2[names(DF1)]
# CA AR Total
# alpha 6 6 12
# beta 7 7 14
# gamma 7 11 18
# delta 9 5 14
A couple of caveats: If your rows are not ordered the same way, this will not work correctly (that's why #MrFlick's approach is "safe"). Also, the extension to more data frames isn't so elegant here:
Reduce(`+`,lapply(list(DF2,DF3,DF4),`[`,order(names(DF1))),init=DF1) # here
aggregate(.~Type, rbind(df1,df2,df3,df4), sum) # #MrFlick
You can consider storing your data in a "long" form instead, which would make further operations more straightforward.
If you have your data.frames in a list, you can easily use melt from "reshape2" to get a "long" data.frame. For example:
melt(list(df1, df2), id.vars = "Type")
Once the data are in the long form, you can reshape it to a "wide" form using dcast, and perform whatever aggregation you want to at that stage.
Furthermore, you can generalize the creation of the list if you have similarly named data.frames in your workspace by using mget.
Here's an example, assuming we have two data.frames, one named "df1", and one named "df2":
library(reshape2)
dcast(melt(mget(ls(pattern = "df\\d+")), id.vars = "Type"),
Type ~ variable, value.var = "value", fun.aggregate = sum)
# Type CA AR Total
# 1 alpha 6 6 12
# 2 beta 7 7 14
# 3 delta 9 5 14
# 4 gamma 7 11 18

Resources