Reshape origin destination data - r

I need to turn this data frame :
df1 <- data.frame(A = c(1,2,3), B = c(2,1,4), Flow = c(50,30,20))
into a data frame like this :
df2 <- data.frame(A = c(1,3), B = c(3,4), AtoB = c(50,20), BtoA = c(20, NA))
I am trying to reshape it with dplyr. Is there an existing function or a way to do that ?

An option would be to create an Identifier column between 'A' and 'B' with labels 'AtoB/BtoA' based on the minimum value in each row, then change the values in 'A', 'B' by taking the min/max for each row (pmin/pmax) and spread the output back to 'wide' format
library(dplyr)
library(tidyr)
df1 %>%
mutate(grpIdent = case_when(A == pmin(A, B) ~ 'AtoB', TRUE ~ 'BtoA'),
A1= pmin(A, B), B1 = pmax(A, B)) %>%
select(A = A1, B = B1, grpIdent, Flow) %>%
spread(grpIdent, Flow)
# A B AtoB BtoA
#1 1 2 50 30
#2 3 4 20 NA

Using base R(This might require introducing a blank or blanks). It is also assumed that the to and fro- values are entered in succession.
new_df<-cbind(df[seq(1,nrow(df), by=2),], df[seq(2,nrow(df), by=2),])[,-c(4,5)]
names(new_df)<-c("A","B","AtoB","BtoA")
new_df
Result:
# A B AtoB BtoA
#1 1 2 50 30
#3 3 4 20 30

Related

Taking a subset of a main dataset based on the values of another data frame that is a subset of the main data frame

I have these two datasets : df as the main data frame and g as a created data frame
df = data.frame(x = seq(1,20,2),y = letters[1:10] )
df
g = data.frame(xx = c(2,3,4,5,7,8,9) )
and I want to take a subset of the data frame df based on the values xx of the data frame g as follows
m = df[df$x==g$xx,]
but the result is based on the match between the two data frames for the order of the matched values. not the matched values themselves.
output
> m
x y
2 3 b
I don't what the error I am making.
Maybe you need to use %in% instead of ==
> df[df$x %in% g$xx,]
x y
2 3 b
3 5 c
4 7 d
5 9 e
You can also use inner_join from dplyr:
library(dplyr)
df %>%
inner_join(g, by = c("x" = "xx"))
intersect can be useful too
df[intersect(df$x, g$xx),]
using merge
merge(df, g, by.x = "x", by.y = 'xx')
x y
1 3 b
2 5 c
3 7 d
4 9 e

Convert data in 10 columns into some rows

I have a data set as following:-
a <- data.frame(X1="A", X2="B", X3="C", X4="D", X5="0",
X6="0", X7="0", X8="0", X9="0", X10="0")
Basically it is a 1 row X 10 column data.frame.
The resulting data.frame should have the column elements of a as rows rather than columns. And any columns in a which are equal to "0" should not be present in the new data.frame. For ex. -
# b
# [1] A
# [2] B
# [3] C
# [4] D
Use a transpose and subset with a logical condition
data.frame("b" = t(df1)[t(df1) != 0])
A second look gave me chance to play with code, you did not need a transpose
data.frame("b" = df1[df1 != 0])
You could unlist and then subset
subset(data.frame(b = unlist(a), row.names = NULL), b != 0)
# b
#1 A
#2 B
#3 C
#4 D
Using pivot_longer function, you can reshape your dataframe into a longer format and then filter values that are "0". With the function column_to_rownames from tibble package, you can pass the first column as rownames.
Altogether, you can do something like this:
library(tidyr)
library(dplyr)
library(tibble)
a %>% pivot_longer(everything(), names_to = "Row", values_to = "b") %>%
filter(b != "0") %>%
column_to_rownames("Row")
b
X1 A
X2 B
X3 C
X4 D

How to obtain minimum difference between 2 columns

I want to obtain the minimum distance between 2 columns, however the same name may appear in both Column A and Column B. See example below;
Patient1 Patient2 Distance
A B 8
A C 11
A D 19
A E 23
B F 6
C G 25
So the output I need is:
Patient Patient_closest_distance Distance
A B 8
B F 6
c A 11
I have tried using the list function
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
However, I just get the minimum distance for each column, i.e. C will have 2 results as it is in both columns rather than showing the closest patient considering both columns. Also, I only get a list of distances, so I can't see which patient is linked to which;
Patient1 SNP
1: A 8
I have tried using the list function in R Studio
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
This code below works.
# Create sample data frame
df <- data.frame(
Patient1 = c('A','B', 'A', 'A', 'C', 'B'),
Patient2 = c('B', 'A','C', 'D', 'D', 'F'),
Distance = c(10, 1, 20, 3, 60, 20)
)
# Format as character variable (instead of factor)
df$Patient1 <- as.character(df$Patient1); df$Patient2 <- as.character(df$Patient2);
# If you want mirror paths included, you'll need to add them.
# Ex.) A to C at a distance of 20 is equivalent to C to A at a distance of 20
# If you don't need these mirror paths, you can ignore these two lines.
df_mirror <- data.frame(Patient1 = df$Patient2, Patient2 = df$Patient1, Distance = df$Distance)
df <- rbind(df, df_mirror); rm(df_mirror)
# group pairs by min distance
library(dplyr)
df <- summarise(group_by(df, Patient1, Patient2), min(Distance))
# Resort, min to top.
nearest <- df[order(df$`min(Distance)`), ]
# Keep only the first of each group
nearest <- nearest[!duplicated(nearest$Patient1),]

Combining values Boolean columns to one with Priority in R

Gone through below links but it solved my problem partially.
merge multiple TRUE/FALSE columns into one
Combining a matrix of TRUE/FALSE into one
R: Converting multiple boolean columns to single factor column
I have a dataframe which looks like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
A = c('Y','N','N','N','N','N','N','N'),
B = c('N','Y','N','N','N','N','Y','N'),
C = c('N','N','Y','N','N','Y','N','N'),
D = c('N','N','N','Y','N','Y','N','N'),
E = c('N','N','N','N','Y','N','Y','N')
)
I want to make a reshape my df with one column but it has to give priorities when there are 2 "Y" in a row.
THE priority is A>B>C>D>E which means if their is "Y" in A then the resultant value should be A. Similarly, in above example df both C and D has "Y" but there should be "C" in the resultant df.
Hence output should look like:
resultant_dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Result = c('A','B','C','D','E','C','B','NA')
)
I have tried this:
library(reshape2)
new_df <- melt(dat, "Id", variable.name = "Result")
new_df <-new_df[new_df$value == "Y", c("Id", "Result")]
But the problem is doesn't handle the priority thing, it creates 2 rows for the same Id.
tmp = data.frame(ID = dat[,1],
Result = col_order[apply(
X = dat[col_order],
MARGIN = 1,
FUN = function(x) which(x == "Y")[1])],
stringsAsFactors = FALSE)
tmp$Result[is.na(tmp$Result)] = "Not Present"
tmp
# ID Result
#1 1 A
#2 2 B
#3 3 C
#4 4 D
#5 5 E
#6 6 C
#7 7 B
#8 8 Not Present

How to modify a single column with joins using dplyr

I'm trying to add a new column to a data frame, based on the levels of one (or a few) factors. I start with a data frame with two factors and a single variable
library(dplyr)
test <- data_frame(one = letters[1:5], two = LETTERS[1:5], three = 6:10)
And I want to add a new column, four, that has values for certain levels of one and two. For convenience, I keep these new values in their own little tables:
new_fourth_a <- data_frame(one = "b", four = 47)
new_fourth_b <- data_frame(two = c("C","E"), four = 42)
The correct answer would be
one two three four
(chr) (chr) (int) (dbl)
1 a A 6 NA
2 b B 7 47
3 c C 8 42
4 d D 9 NA
5 e E 10 42
And the best way I could think of to accomplish this is via left_join():
test %>%
left_join(new_fourth_a, by = "one") %>%
left_join(new_fourth_b, by = "two")
But this ends up duplicating the four column. This could be a good thing: it would allow for easy checking to see if there are any joins that introduce more than one value for the new column (ie check that there is only one non-NA value across each row in all the columns that start with four. ). Still, I think there must be an easier way?
Here is a solution that uses join
library(dplyr)
test <- data_frame(one = letters[1:5], two = LETTERS[1:5], three = 6:10)
new_fourth_a <- data_frame(one = "b", extra_a = 47)
new_fourth_b <- data_frame(two = c("C","E"), extra_b = 42)
test %>%
left_join(new_fourth_a, by = "one") %>%
left_join(new_fourth_b, by = "two") %>%
mutate(four = pmax(extra_a, extra_b, na.rm = TRUE)) %>%
select(-extra_a, -extra_b)
If you want to handle an arbitrary number then you have the handle one at a time
library(dplyr)
test <- data_frame(one = letters[1:5], two = LETTERS[1:5], three = 6:10)
new_fourth_a <- data_frame(one = "b", extra = 47)
new_fourth_b <- data_frame(two = c("C","E"), extra = 42)
test %>%
left_join(new_fourth_a, by = "one") %>%
mutate(four = extra) %>%
select(-extra) %>%
left_join(new_fourth_b, by = "two") %>%
mutate(four = ifelse(is.na(extra), four, extra)) %>%
select(-extra)
Instead of creating two more data_frames, we could use %in% with some arithmetic to get a numeric index to create the column 'four' with values NA, 47, and 42.
test %>%
mutate(four = c(NA, 47, 42)[1+(one %in% 'b') +
2*(two %in% c('C', 'E'))])
# one two three four
# (chr) (chr) (int) (dbl)
#1 a A 6 NA
#2 b B 7 47
#3 c C 8 42
#4 d D 9 NA
#5 e E 10 42

Resources