How to combine two data frames using dplyr or other packages? - r

I have two data frames:
df1 = data.frame(index=c(0,3,4),n1=c(1,2,3))
df1
# index n1
# 1 0 1
# 2 3 2
# 3 4 3
df2 = data.frame(index=c(1,2,3),n2=c(4,5,6))
df2
# index n2
# 1 1 4
# 2 2 5
# 3 3 6
I want to join these to:
index n
1 0 1
2 1 4
3 2 5
4 3 8 (index 3 in two df, so add 2 and 6 in each df)
5 4 3
6 5 0 (index 5 not exists in either df, so set 0)
7 6 0 (index 6 not exists in either df, so set 0)
The given data frames are just part of large dataset. Can I do it using dplyr or other packages in R?

Using data.table (would be efficient for bigger datasets). I am not changing the column names, as the rbindlist uses the name of the first dataset ie. in this case n from the second column (Don't know if it is a feature or bug). Once you join the datasets by rbindlist, group it by column index i.e. (by=index) and do the sum of n column (list(n=sum(n)) )
library(data.table)
rbindlist(list(data.frame(index=0:6,n=0), df1,df2))[,list(n=sum(n)), by=index]
index n
#1: 0 1
#2: 1 4
#3: 2 5
#4: 3 8
#5: 4 3
#6: 5 0
#7: 6 0
Or using dplyr. Here, the column names of all the datasets should be the same. So, I am changing it before binding the datasets using rbind_list. If the names are different, there will be multiple columns for each name. After joining the datasets, group it by index and then use summarize and do the sum of column n.
library(dplyr)
nm1 <- c("index", "n")
colnames(df1) <- colnames(df2) <- nm1
rbind_list(df1,df2, data.frame(index=0:6, n=0)) %>%
group_by(index) %>%
summarise(n=sum(n))

This is something you could do with the base functions aggregate and rbind
df1 = data.frame(index=c(0,3,4),n=c(1,2,3))
df2 = data.frame(index=c(1,2,3),n=c(4,5,6))
aggregate(n~index, rbind(df1, df2, data.frame(index=0:6, n=0)), sum)
which returns
index n
1 0 1
2 1 4
3 2 5
4 3 8
5 4 3
6 5 0
7 6 0

How about
names(df1) <- c("index", "n") # set colnames of df1 to target
df3 <- rbind(df1,setNames(df2, names(df1))) # set colnnames of df2 and join
df <- df3 %>% dplyr::arrange(index) # sort by index
Cheers.

Related

How to interlacely merge two matrices?

I am facing with the other problem in coding with R-Studio. I have two dataframes (with the same number of rows and colunms). Now I want to merge them two into one, but the 6 columns of dataframe 1 would be columns 1,3,5,7,9.11 in the new matrix; while those of data frame 2 would be 2,4,6,8,10,12 in the new merged dataframe.
I can do it with for loop but is there any smarter way/function to do it? Thank you in advance ; )
You can cbind them and then reorder the columns accordingly:
df1 <- as.data.frame(sapply(LETTERS[1:6], function(x) 1:3))
df2 <- as.data.frame(sapply(letters[1:6], function(x) 1:3))
cbind(df1, df2)[, matrix(seq_len(2*ncol(df1)), 2, byrow=T)]
# A a B b C c D d E e F f
# 1 1 1 1 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3 3 3 3 3
The code below will produce your required result, and will also work if one data frame has more columns than the other
# create sample data
df1 <- data.frame(
a1 = 1:10,
a2 = 2:11,
a3 = 3:12,
a4 = 4:13,
a5 = 5:14,
a6 = 6:15
)
df2 <- data.frame(
b1=11:20,
b2=12:21,
b3=13:22,
b4=14:23,
b5=15:24,
b6=16:25
)
# join by interleaving columns
want <- cbind(df1,df2)[,order(c(1:length(df1),1:length(df2)))]
Explanation:
cbind(df1,df2) combines the data frames with all the df1 columns first, then all the df2 columns.
The [,...] element re-orders these columns.
c(1:length(df1),1:length(df2)) gives 1 2 3 4 5 6 1 2 3 4 5 6 - i.e. the order of the columns in df1, followed by the order in df2
order() of this gives 1 7 2 8 3 9 4 10 5 11 6 12 which is the required column order
So [, order(c(1:length(df1), 1:length(df2)] re-orders the columns so that the columns of the original data frames are interleaved as required.

How to compute the NAs with the column mean and then multiply columns of different lengths in R?

My question might be not so clear so I am putting an example.
My final goal is to produce
final=(df1$a*df2$b)+(df1$a*df3$c*df4$d)+(df4$d*df5$e)
I have five data frames (one column each) with different lengths as follows:
df1
a
1. 1
2. 2
3. 4
4. 2
df2
b
1. 2
2. 6
df3
c
1. 2
2. 4
3. 3
df4
d
1. 1
2. 2
3. 4
4. 3
df5
e
1. 4
2. 6
3. 2
So I want a final database which includes them all as follows
finaldf
a b c d e
1. 1 2 2 1 4
2. 2 6 4 2 6
3. 4 NA 3 4 2
4. 2 NA NA 3 NA
I want all the NAs for each column to be replaced with the mean of that column, so the finaldf has equal length of all the columns:
finaldf
a b c d e
1. 1 2 2 1 4
2. 2 6 4 2 6
3. 4 4 3 4 2
4. 2 4 3 3 4
and therefore I can produce a final result for final=(df1$a*df2$b)+(df1$a*df3$c*df4$d)+(df4$d*df5$e) as I need.
The easiest by far is to use the qpcR, dplyr and tidyr packages.
library(dplyr)
library(qpcR)
library(tidyr)
df1 <- data.frame(a=c(1,2,4,2))
df2 <- data.frame(b=c(2,6))
df3 <- data.frame(c=c(2,4,3))
df4 <- data.frame(d=c(1,2,4,3))
df5 <- data.frame(e=c(4,6,2))
mydf <- qpcR:::cbind.na(df1, df2, df3, df4,df5) %>%
tidyr::replace_na(.,as.list(colMeans(.,na.rm=T)))
> mydf
a b c d e
1 1 2 2 1 4
2 2 6 4 2 6
3 4 4 3 4 2
4 2 4 3 3 4
Depending on your rgl settings, you might need to run the following at the top of your script to make the qpcR package load (see https://stackoverflow.com/a/66127391/2554330 ):
options(rgl.useNULL = TRUE)
library(rgl)
With purrr and dplyr, we can first put all dataframes in a list with mget(). Second, use set_names to replace the dataframe names with their respective column names. As a third step, unlist the dataframes to get vectors with pluck. Then add the NAs by making all vectors the same length.
Finally, bind all vectors back into a dataframe with as.data.frame, then use mutate with ~replace_na and colmeans.
library(dplyr)
library(purrr)
mget(ls(pattern = 'df\\d')) %>%
set_names(map_chr(., colnames)) %>%
map(pluck, 1) %>%
map(., `length<-`, max(lengths(.))) %>%
as.data.frame %>%
mutate(across(everything(), ~replace_na(.x, mean(.x, na.rm=TRUE))))

How to merge columns in R with different levels of values

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?
library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1
It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

combine datasets by the value of multiple columns

I'm trying to enter values based on the value of multiple columns from two datasets.
I have my main dataset (df1), with lists of a location and corresponding dates and df2 consists of a list of temperatures at all locations on every possible date. Eg:
df1
Location Date
A 2
B 1
C 1
D 3
B 3
df2
Location Date1Temp Date2Temp Date3Temp
A -5 -4 0
B 2 0 2
C 4 4 5
D 6 3 4
I would like to create a temperature variable in df1, according to the location and date of each observation. Preferably I would like to carry this out with all Temperature data in the same dataframe, but this can be separated and added 'by date' if necessary. With the example data, I would want this to create something like this:
Location Date Temp
A 2 -4
B 1 2
C 1 4
D 3 4
B 3 2
I've been playing around with merge and ifelse, but haven't figured anything out yet.
is it what you need?
library(reshape2)
library(magrittr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),d1t=c(-5,5,4,6),d2t=c(-4,0,4,3),d3t=c(0,2,5,4))
merge(df1,df2) %>% melt(id.vars=c("Location","Date"))
Here's how to do that with dplyr and tidyr.
Basically, you want to use gather to melt the DateXTemp columns from df2 into two columns. Then, you want to use gsub to remove the "Date" and "Temp" strings to get numbers that are comparable to what you have in df1. Since DateXTemp were initially characters, you need to transform the remaining numbers to numeric with as.numeric. I then use left_join to join the tables.
library(dplyr);library(tidyr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),Date1Temp=c(-5,5,4,6),
Date2Temp=c(-4,0,4,3),Date3Temp=c(0,2,5,4))
df2_new <- df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))
df1%>%left_join(df2_new)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2
EDIT
As suggested by #Sotos, you can do that in one piping like so:
df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))%>%
left_join(df1,.)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2

How to select unique point

I am a novice R programmer. I have a following series of points.
df <- data.frame(x = c(1 , 2, 3, 4), y = c(6 , 3, 7, 5))
df <- df %>% mutate(k = 1)
df <- df %>% full_join(df, by = 'k')
df <- subset(df, select = c('x.x', 'y.x', 'x.y', 'y.y'))
df
Is there way to select for "unique" points? (the order of the points do not matter)
EDIT:
x.x y.x x.y y.y
1 6 2 3
2 3 3 7
.
.
.
(I changed the 2 to 7 to clarify the problem)
With data.table (and working from the OP's initial df):
library(data.table)
setDT(df)
df[, r := .I ]
df[df, on=.(r > r), nomatch=0]
x y r i.x i.y
1: 2 3 1 1 6
2: 3 2 1 1 6
3: 4 5 1 1 6
4: 3 2 2 2 3
5: 4 5 2 2 3
6: 4 5 3 3 2
This is a "non-equi join" on row numbers. In x[i, on=.(r > r)] the left-hand r refers to the row in x and the right-hand one to a row of i. The columns named like i.* are taken from i.
Data.table joins, which are of the form x[i], use i to look up rows of x. The nomatch=0 option drops rows of i that find no matches.
In the tidyverse, you can save a bit of work by doing the self-join with tidyr::crossing. If you add row indices pre-join, reducing is a simple filter call:
library(tidyverse)
df %>% mutate(i = row_number()) %>% # add row index column
crossing(., .) %>% # Cartesian self-join
filter(i < i1) %>% # reduce to lower indices
select(-i, -i1) # remove extraneous columns
## x y x1 y1
## 1 1 6 2 3
## 2 1 6 3 7
## 3 1 6 4 5
## 4 2 3 3 7
## 5 2 3 4 5
## 6 3 7 4 5
or in all base R,
df$m <- 1
df$i <- seq(nrow(df))
df <- merge(df, df, by = 'm')
df[df$i.x < df$i.y, c(-1, -4, -7)]
## x.x y.x x.y y.y
## 2 1 6 2 3
## 3 1 6 3 7
## 4 1 6 4 5
## 7 2 3 3 7
## 8 2 3 4 5
## 12 3 7 4 5
You can use the duplicated.matrix() function from base, to find the rows which are no duplicator - which means in fact that there are unique. When you call the duplicated() function you have to clarify that you only want to use the to first colons. With this call you check which line is unique. In a second step you call in your dataframe for this rows, with all columns.
unique_lines = !duplicated.matrix(df[,c(1,2)])
df[unique_lines,]

Resources