How to move data from one dataframe to a second dataframe - r

I apologize if this is a duplicated question. I tried to find my question but I may not be using the right terminology. Feel free to change the title of this post if there is a better way to ask this question.
I have two dataframes
df <- data.frame("Location" = c("chr1:123", "chr6:2452", "chr8:4352", "chr11:8754", "chr3:76345", "chr7:23454","chr18:23452"),
"Score" = c("tolered(1)", "tolerated(2)", "", "", "deleterious(0.1)", "", "deleterious(0.2)"))
df2 <- data.frame("Location" = c( "chr7:23454", "chr9:243256", "chr8:4352", "chr2:6795452", "chr11:8754","chr18:23452", "chr3:76345"),
"Score" = c("", "", "", "", "", "", ""))
df has locations and values in the "score" column that I want to keep.
df2 has the data from df plus some new data.
I want the scores from df for any values that are in df2 and make a
new dataframe called df3.
Desired result:
df3 <- data.frame("Location" = c( "chr7:23454", "chr9:243256", "chr8:4352", "chr2:6795452", "chr11:8754","chr18:23452", "chr3:76345"),
"Score" = c("", "", "", "", "", "deleterious(0.2)", "deleterious(0.1)"))
I am just not sure what the best/fastest method to do this. I am not quite sure where to begin. I feel like you can do this with dplyr but I have never done this before

Using a left_join() from dplyr:
library(dplyr)
df3 <- df2 %>%
dplyr::select(-Score) %>%
left_join(df, by = "Location")

I was able to sort of force this.
I started with this
df3 <- anti_join(df2, df, by = "Location")
df3 <- rbind(df3, df)
but that gave me some extra data that I didn't want/need so I filtered back with df2
df3 <- df3 %>%
filter(Location %in% df2$Location)
This isn't the prettiest method so if anyone else has a cleaner method, please feel free to answer!

df
Location Score
1 A 1
2 B 2
3 C NA
4 D NA
5 E 5
6 F NA
7 G 7
df2
Location Score
1 E NA
2 F NA
3 G NA
4 H NA
5 I NA
6 J NA
7 K 11
df3
Location Score
1 H NA
2 I NA
3 J NA
4 K 11
5 E 5
6 F NA
7 G 7
Code
library(dplyr)
df3 <- df2 %>%
anti_join(df, by = "Location") %>%
bind_rows(inner_join(df, df2 %>% select(1), by = "Location"))
Data
df <- data.frame("Location" = LETTERS[1:7],
"Score" = c(1, 2, NA, NA, 5, NA, 7),
stringsAsFactors = FALSE)
df2 <- data.frame("Location" = LETTERS[5:11],
"Score" = c(rep(NA, 6), 11),
stringsAsFactors = FALSE)

Related

Adding a new column next to each existing column that matches a certain column name pattern in R / tidyverse

In a dataframe I want to add a new column next each column whose name matches a certain pattern, for example whose name starts with "ip_" and is followed by a number. The name of the new columns should follow the pattern "newCol_" suffixed by that number again. The values of the new columns should be NA's.
So this dataframe:
should be transformed to that dataframe:
A tidiverse solution with use of regex is much appreciated!
Sample data:
df <- data.frame(
ID = c("1", "2"),
ip_1 = c(2,3),
ip_9 = c(5,7),
ip_39 = c(11,13),
in_1 = c("B", "D"),
in_2 = c("A", "H"),
in_3 = c("D", "A")
)
To get the columns is easy with across -
library(dplyr)
df %>%
mutate(across(starts_with('ip'), ~NA, .names = '{sub("ip", "newCol", .col)}'))
# ID ip_1 ip_9 ip_39 in_1 in_2 in_3 newCol_1 newCol_9 newCol_39
#1 1 2 5 11 B A D NA NA NA
#2 2 3 7 13 D H A NA NA NA
To get the columns in required order -
library(dplyr)
df %>%
mutate(across(starts_with('ip'), ~NA, .names = '{sub("ip", "newCol", .col)}')) %>%
select(ID, starts_with('in'),
order(suppressWarnings(readr::parse_number(names(.))))) %>%
select(ID, ip_1:newCol_39, everything())
# ID ip_1 newCol_1 ip_9 newCol_9 ip_39 newCol_39 in_1 in_2 in_3
#1 1 2 NA 5 NA 11 NA B A D
#2 2 3 NA 7 NA 13 NA D H A
To add the new NA columns :
df[, sub("^ip", "newCol", grep("^ip", names(df), value = TRUE))] <- NA
To reorder them :
df <- df[, order(c(grep("newCol", names(df), invert = TRUE), grep("^ip", names(df))))]
edit :
If it's something you (or whoever stumble here) plan on doing often, you can use this function :
insertCol <- function(x, ind, col.names = ncol(df) + seq_along(colIndex), data = NA){
out <- x
out[, col.names] <- data
out[, order(c(col(x)[1,], ind))]
}

R - applying calculation pairwise on columns of data frame/data table

Let's say I have the data frames with the same column names
DF1 = data.frame(a = c(0,1), b = c(2,3), c = c(4,5))
DF2 = data.frame(a = c(6,7), c = c(8,9))
and want to apply some basic calculation on them, for example add each column.
Since I also want the goal data frame to display missing data, I appended such a column to DF2, so I have
> DF2
a c b
1 6 8 NA
2 7 9 NA
What I tried here now is to create the data frame
for(i in names(DF2)){
DF3 = data.frame(i = DF1[i] + DF2[i])
}
(and then bind this together) but this obviously doesn't work since the order of the columns is mashed up.
SO,
what's the best way to do this pairwise calculation when the order of the columns is not the same, without reordering them?
I also tried doing (since this is what I thought would be a fix)
for(i in names(DF2)){
DF3 = data.frame(i = DF1$i + DF2$i)
}
but this doesn't work because DF1$i is NULL for all i.
Conlusion: I want the data frame
>DF3
a b c
1 6+0 NA 4+8
2 1+7 NA 5+9
Any help would be appreciated.
This may help -
#Get column names from DF1 and DF2
all_cols <- union(names(DF1), names(DF2))
#Fill missing columns with NA in both the dataframe
DF1[setdiff(all_cols, names(DF1))] <- NA
DF2[setdiff(all_cols, names(DF2))] <- NA
#add the two dataframes arranging the columns
DF1[all_cols] + DF2[all_cols]
# a b c
#1 6 NA 12
#2 8 NA 14
We can use bind_rows
library(dplyr)
library(data.table)
bind_rows(DF1, DF2, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 6 NA 12
2 8 NA 14
Another base R option using aggregate + stack + reshae
aggregate(
. ~ rid,
transform(
reshape(
transform(rbind(
stack(DF1),
stack(DF2)
),
rid = ave(seq_along(ind), ind, FUN = seq_along)
),
direction = "wide",
idvar = "rid",
timevar = "ind"
),
rid = 1:nrow(DF1)
),
sum,
na.action = "na.pass"
)[-1]
gives
values.a values.b values.c
1 6 NA 12
2 8 NA 14

Extracting and cbinding similarly named variables in a data.frame in R

I have a cbind of 2 data.frames called DATA. Using BASE R, I was wondering how I could extract and then, cbind similarly named variables in DATA and store them as a list?
For the example below, I want all variable AAs, and separately all variable BBs in DATA be separately cbinded and stored as a list?
Note: names could be anything, and the number of variables could be any number. A function(al) solution is highly appreciated.
Note: suppose we have NO ACCESS to r, the only input is DATA.
r <- list(
data.frame(Name = rep("Jacob", 6),
X = c(2,2,1,1,NA, NA),
Y = c(1,1,1,2,1,NA),
Z = rep(3, 6),
out = rep(1, 6)),
data.frame(Name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
Y = c(1,1,1,2,NA,NA),
Z = rep(2, 6),
out = rep(1, 6)),
data.frame(Name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
Y = c(1,1,1,2,2,NA),
Z = rep(2, 6),
out = rep(2, 6)),
data.frame(Name = rep("Jim", 6),
X = c(1,NA,3,1,NA,NA),
Y = c(1,1,1,2,2,NA),
Z = rep(2, 6),
out = rep(1, 6)))
DATA <- do.call(cbind, r) ## DATA: cbind of two data.frames
Here is an option with split. Wouldn't recommend to have same duplicate column names in the dataset. But, if it is really needed, after thee split, change the column names by removing the . following by one or more numbers at the end of it with sub
nm1 <- Reduce(intersect, lapply(r, colnames)) # get the common names
lst1 <- split.default(DATA[names(DATA) %in% nm1], names(DATA)[names(DATA) %in% nm1])
lapply(lst1, function(x) setNames(x, sub("\\.\\d+$", "", names(x))))
Or if we need to use only 'DATA' and not 'r' for finding the intersecting column names. It is difficult but we can get a frequency of the occurence of column names and select that have 2 as frequency
tbl <- table(names(DATA))
nm1 <- names(which(tbl==max(tbl)))
Use that in the split.default as before
lst1 <- split.default(DATA[names(DATA) %in% nm1], names(DATA)[names(DATA) %in% nm1])
lapply(lst1, function(x) setNames(x, sub("\\.\\d+$", "", names(x))))
Using OP's new example
r <- list( data.frame( AA = c(2,2,1,1,3,2), BB = c(1,1,1,2,2,NA), CC = 1:6), data.frame( AA = c(1,NA,3,1,3,2), BB = c(1,1,1,2,2,2)), data.frame( AA = c(1,NA,3,1,3,2), BB = c(1,1,1,2,2,2), DD = 0:5) )
DATA <- do.call(cbind, r)
tbl <- table(names(DATA))
nm1 <- names(which(tbl==max(tbl)))
lst1 <- split.default(DATA[names(DATA) %in% nm1], names(DATA)[names(DATA) %in% nm1])
lapply(lst1, function(x) setNames(x, sub("\\.\\d+$", "", names(x))))
#$AA
# AA AA AA
#1 2 1 1
#2 2 NA NA
#3 1 3 3
#4 1 1 1
#5 3 3 3
#6 2 2 2
#$BB
# BB BB BB
#1 1 1 1
#2 1 1 1
#3 1 1 1
#4 2 2 2
#5 2 2 2
#6 NA 2 2

Create a null column with given name, if missing from one dataset

I have 5 data sets, each containing some columns. The data sets have common column names, but all columns are not present in all the data sets. So whenever a column name (that appears in at least one of the data set) is not present in some other data set, I want to create a column of all zeros with that column name in that data set. So that all the data sets have same number of columns (and same column names).
Put the dataframes in the list, get the all the unique column names present in all the dataframes combined and add columns which are absent in each dataframe with 0.
all_names <- unique(unlist(sapply(list_df, names)))
lst1 <- lapply(list_df, function(x) {x[setdiff(all_names, names(x))] <- 0;x})
lst1
#[[1]]
# a b c
#1 1 6 0
#2 2 7 0
#3 3 8 0
#4 4 9 0
#5 5 10 0
#[[2]]
# a c b
#1 1 6 0
#2 2 7 0
#3 3 8 0
#4 4 9 0
#5 5 10 0
#[[3]]
# a c b
#1 1 6 11
#2 2 7 12
#3 3 8 13
#4 4 9 14
#5 5 10 15
If you need separate dataframes you can use lst1[[1]], lst1[[2]] individually again.
data
df1 <- data.frame(a = 1:5, b = 6:10)
df2 <- data.frame(a = 1:5, c = 6:10)
df3 <- data.frame(a = 1:5, c = 6:10, b = 11:15)
list_df <- list(df1, df2, df3)
We can use a for loop to do this
un1 <- Reduce(union, lapply(lst1, names))
for(i in seq_along(lst1)) lst1[[i]][setdiff(un1, names(lst1[[i]]))] <- 0
data
lst1 <- list(structure(list(a = 1:5, b = 6:10, c = c(0, 0, 0, 0, 0)),
row.names = c(NA,
-5L), class = "data.frame"), structure(list(a = 1:5, c = 6:10,
b = c(0, 0, 0, 0, 0)),
row.names = c(NA, -5L), class = "data.frame"),
structure(list(a = 1:5, c = 6:10, b = 11:15),
class = "data.frame", row.names = c(NA,
-5L)))
I would use dplyr's bind_rows, which automatically fills missing values with NA. If you include .id = "df_id" a column will be added connecting each row to the original dataframe:
library(dplyr)
bind_rows(df1, df2, df3, .id = "df_id")
#### OUTPUT ####
df_id x y z
1 1 1 2 NA
2 2 3 NA 4
3 3 NA 5 6
If you want 0s instead of NAs just runt df[is.na(df)] <- 0. If you want a more informative df_id column you can pass in a named list:
bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = "df_id")
#### OUTPUT ####
df_id x y z
1 df1 1 2 NA
2 df2 3 NA 4
3 df3 NA 5 6
If you want your dataframes separate then simply split by df_id, which generates a list of dataframes:
df <- bind_rows(df1, df2, df3, .id = "df_id")
split(df, df$df_id)
#### OUTPUT ####
$`1`
df_id x y z
1 1 1 2 NA
$`2`
df_id x y z
2 2 3 NA 4
$`3`
df_id x y z
3 3 NA 5 6
Data:
df1 <- data.frame(x = 1, y = 2)
df2 <- data.frame(x = 3, z = 4)
df3 <- data.frame(y = 5, z = 6)
In addition to the previous answers, you can use the bind_rows function in order to quickly combine all your data frames, which will take care of differences in column names:
library(dplyr)
x <- data.frame(
a = 1:3,
b = 4:6
)
y <- data.frame(
a = 4:7
)
z <- data.frame(
c = 8:10
)
xyz <- bind_rows(x, y, z)
xyz %>% replace(., is.na(.), 0)

combination of pairs of columns BUT not rows in a data frame

How to calculate the combinations of pairs of columns in a data frame, but restrict it, so that it does not considers combinations among rows?
I have a data frame like the following, where each column is a variable.
ID A B C D E F G H I J
1 12 185 NA NA NA NA NA NA NA NA
2 35 20 11 NA NA NA NA NA NA NA
3 45 NA NA NA NA NA NA NA NA NA
I want an output like this:
Var1
12, 185
35, 20
35, 11
20, 11
45, 45
I tried the following code, but it considers ALL possible pairs of combinations among columns and rows. I want each row to be consider independently from each other. Does someone have an idea? Thanks.
numNetList <- read.csv2("abd.csv", sep=";")
comb <- lapply(numNetList, function(x) if (length(x) > 1)
combn(sort(as.numeric(x)), 2))
combb <- do.call(cbind, comb)
pajek_list <- as.data.frame(table(paste(combb[1,], combb[2,], sep = ',')))
not the efficient method, but solves the problem
func <- function(x){
t = as.character(x[!is.na(x)])
if (length(t)==1)
t = rep(t,2)
t1 = combn(t,2)
}
l = apply(df[-1], 1, func)
l1 <- as.data.frame(l)
colnames(l1) = NULL
l2= data.frame(t(l1))
library(tidyr)
unite(l2, "new_col", X1,X2 ,sep = ",")
# new_col
# 12,185
# 35,20
# 35,11
# 20,11
# 45,45
I would go with a combination of dplyr and tidyr:
library(dplyr)
library(tidyr)
df <- tibble(A = c(12,35,45), B = c(185, 20, NA), C = c(NA, 11, NA))
df %>%
mutate(group = 1:n()) %>%
gather(col, val, -group) %>%
group_by(group) %>%
expand(col, val) %>%
distinct(val) %>%
summarise(val = toString(val))

Resources