How to use a fulljoin on my dataframes and rename columns with the same name R - r

I have two dataframes and they both have the exact same column names, however the data in the columns is different in each dataframe. I am trying to join the two frames (as seen below) by a full join. However, the hard part for me is the fact that I have to rename the columns so that the columns corresponding to my one dataset have some text added to the end while adding different text to the end of the columns that correspond to the second data set.
combined_df <- full_join(any.drinking, binge.drinking, by = ?)
A look at one of my df's:

Without custom function and shorter:
df <- cbind(cars, cars)
colnames(df) <- c(paste0(colnames(cars), "_any"), paste0(colnames(cars), "_binge"))
Output:
> head(df)
speed_any dist_any speed_binge dist_binge
1 4 2 4 2
2 4 10 4 10
3 7 4 7 4
4 7 22 7 22
5 8 16 8 16
6 9 10 9 10
Certainly not the most elegant way but maybe it is what you want:
custom_bind <- function(df1, suffix1, df2, suffix2){
colnames(df1) <- paste(colnames(df1), suffix1, sep = "_")
colnames(df2) <- paste(colnames(df2), suffix2, sep = "_")
df <- cbind(df1, df2)
return(df)
}
custom_bind(cars, "any", cars, "binge")
I made it as a function in case you want to do it with other tables. If not then it is not necessary.
Output:
> head(custom_bind(cars, "any", cars, "binge"))
speed_any dist_any speed_binge dist_binge
1 4 2 4 2
2 4 10 4 10
3 7 4 7 4
4 7 22 7 22
5 8 16 8 16
6 9 10 9 10

Related

Adding new columns to dataframe with suffix

I want to subtract one column from another and create a new one using the corresponding suffix in the first column. I have approx 50 columns
I can do it "manually" as follows...
df$new1 <- df$col_a1 - df$col_b1
df$new2 <- df$col_a2 - df$col_b2
What is the easiest way to create a loop that does the job for me?
We can use grep to identify columns which has "a" and "b" in it and subtract them directly.
a_cols <- grep("col_a", names(df))
b_cols <- grep("col_b", names(df))
df[paste0("new", seq_along(a_cols))] <- df[a_cols] - df[b_cols]
df
# col_a1 col_a2 col_b1 col_b2 new1 new2
#1 10 15 1 5 9 10
#2 9 14 2 6 7 8
#3 8 13 3 7 5 6
#4 7 12 4 8 3 4
#5 6 11 5 9 1 2
#6 5 10 6 10 -1 0
data
Tested on this data
df <- data.frame(col_a1 = 10:5, col_a2 = 15:10, col_b1 = 1:6, col_b2 = 5:10)

How to check if rows in one column present in another column in R

I have a data set = data1 with id and emails as follows:
id emails
1 A,B,C,D,E
2 F,G,H,A,C,D
3 I,K,L,T
4 S,V,F,R,D,S,W,A
5 P,A,L,S
6 Q,W,E,R,F
7 S,D,F,E,Q
8 Z,A,D,E,F,R
9 X,C,F,G,H
10 A,V,D,S,C,E
I have another data set = data2 with check_email as follows:
check_email
A
D
S
V
I want to check if check_email column is present in data1 and want to take only those id from data1 when check_email in data2 is present in emails in data1.
My desired output will be:
id
1
2
4
5
7
8
10
I have created a code using for loop but it is taking forever because my actual dataset is very large.
Any advice in this regard will be highly appreciated!
You can use regular expression to subset your data. First collapse everything in one pattern:
paste(data2$check_email, collapse = "|")
# [1] "A|D|S|V"
Then create a indicator vector whether the pattern matches the emails:
grep(paste(data2$check_email, collapse = "|"), data1$emails)
# [1] 1 2 4 5 7 8 10
And then combine everything:
data1[grep(paste(data2$check_email, collapse = "|"), data1$emails), ]
# id emails
# 1 1 A,B,C,D,E
# 2 2 F,G,H,A,C,D
# 3 4 S,V,F,R,D,S,W,A
# 4 5 P,A,L,S
# 5 7 S,D,F,E,Q
# 6 8 Z,A,D,E,F,R
# 7 10 A,V,D,S,C,E
data1[rowSums(sapply(data2$check_email, function(x) grepl(x,data1$emails))) > 0, "id", F]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10
We can split the elements of the character vector as.character(data1$emails) into substrings, then we can iterate over this list with sapply looking for any value of this substring contained in data2$check_email. Finally we extract those values from data1
> emails <- strsplit(as.character(data1$emails), ",")
> ind <- sapply(emails, function(emails) any(emails %in% as.character(data2$check_email)))
> data1[ind,"id", drop = FALSE]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10

Moving down columns in data frames in R

Suppose I have the next data frame:
df<-data.frame(step1=c(1,2,3,4),step2=c(5,6,7,8),step3=c(9,10,11,12),step4=c(13,14,15,16))
step1 step2 step3 step4
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
4 4 8 12 16
and what I have to do is something like the following:
df2<-data.frame(col1=c(1,2,3,4,5,6,7,8,9,10,11,12),col2=c(5,6,7,8,9,10,11,12,13,14,15,16))
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
How can I do that? consider that more steps can be included (example, 20 steps).
Thanks!!
We can design a function to achieve this task. df_final is the final output. Notice that bin is an argument that the users can specify how many columns to transform together.
# A function to conduct data transformation
trans_fun <- function(df, bin = 3){
# Calculate the number of new columns
new_ncol <- (ncol(df) - bin) + 1
# Create a list to store all data frames
df_list <- lapply(1:new_ncol, function(num){
return(df[, num:(num + bin - 1)])
})
# Convert each data frame to a vector
dt_list2 <- lapply(df_list, unlist)
# Convert dt_list2 to data frame
df_final <- as.data.frame(dt_list2)
# Set the column and row names of df_final
colnames(df_final) <- paste0("col", 1:new_ncol)
rownames(df_final) <- 1:nrow(df_final)
return(df_final)
}
# Apply the trans_fun
df_final <- trans_fun(df)
df_final
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
Here is a method using dplyr and reshape2 - this assumes all of the columns are the same length.
library(dplyr)
library(reshape2)
Drop the last column from the dataframe
df[,1:ncol(df)-1]%>%
melt() %>%
dplyr::select(col1=value) -> col1
Drop the first column from the dataframe
df %>%
dplyr::select(-step1) %>%
melt() %>%
dplyr::select(col2=value) -> col2
Combine the dataframes
bind_cols(col1, col2)
This should do the work:
df2 <- data.frame(col1 = 1:(length(df$step1) + length(df$step2)))
df2$col1 <- c(df$step1, df$step2, df$step3)
df2$col2 <- c(df$step2, df$step3, df$step4)
Things to point:
The important thing to see in the first line of the code, is the need for creating a table with the right amount of rows
Calling a columns that does not exist will create one, with that name
Deleting columns in R should be done like this df2$col <- NULL
Are you not just looking to do:
df2 <- data.frame(col1 = unlist(df[,-nrow(df)]),
col2 = unlist(df[,-1]))
rownames(df2) <- NULL
df2
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16

Split a dataset into a list of dataframes with equal number of columns

I have a data set with 36 columns and single observation. I want to split it into a list with each dataframe having 3 columns and then rbind them into a single data frame.
I have been using the following code:
m=12
nc<-ncol(df)
df1<-lapply(split(as.list(df), cut(1:nc, m, labels = FALSE)), as.data.frame)
df1<-do.call("rbind",df1)
This code is working. But the problem comes when I try to run this code in shiny app.
Can someone suggest a replacement for above code
We can split the one row dataframe by generating a specific sequence
do.call("rbind", split(c(t(df)), rep(seq(1, ncol(df)/3), each = 3)))
where
rep(seq(1, ncol(df)/3), each = 3)
would generate
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8
9 9 9 10 10 10 11 11 11 12 12 12

R Looking up closest value in data.frame less than equal to another value

I have two data.frames, lookup_df and values_df. For each row in lookup_df I want to lookup the closest value in the values_df that is less than or equal to an index value.
Here's my code so far:
lookup_df <- data.frame(ids = 1:10)
values_df <- data.frame(idx = c(1,3,7), values = c(6,2,8))
What I'm wanting for the result_df is the following:
> result_df
ids values
1 1 6
2 2 6
3 3 2
4 4 2
5 5 2
6 6 2
7 7 8
8 8 8
9 9 8
10 10 8
I know how to do this with SQL fairly easily but I'm curious if there is an R way that is straightforward. I could iterate the the rows of the lookup_df and then loop through the rows of the values_df but that is not computationally efficient. I'm open to using dplyr library if someone knows how to use that to solve the problem.
If values_df is sorted by idx ascending, then findInterval will work:
lookup_df <- data.frame(ids = 1:10)
values_df <- data.frame(idx = c(1,3,7), values = c(6,2,8))
lookup_df$values <- values_df$values[findInterval(lookup_df$ids,values_df$idx)]
lookup_df
> ids values
1 1 6
2 2 6
3 3 2
4 4 2
5 5 2
6 6 2
7 7 8
8 8 8
9 9 8
10 10 8

Resources