How to obtain minimum difference between 2 columns - r

I want to obtain the minimum distance between 2 columns, however the same name may appear in both Column A and Column B. See example below;
Patient1 Patient2 Distance
A B 8
A C 11
A D 19
A E 23
B F 6
C G 25
So the output I need is:
Patient Patient_closest_distance Distance
A B 8
B F 6
c A 11
I have tried using the list function
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
However, I just get the minimum distance for each column, i.e. C will have 2 results as it is in both columns rather than showing the closest patient considering both columns. Also, I only get a list of distances, so I can't see which patient is linked to which;
Patient1 SNP
1: A 8
I have tried using the list function in R Studio
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]

This code below works.
# Create sample data frame
df <- data.frame(
Patient1 = c('A','B', 'A', 'A', 'C', 'B'),
Patient2 = c('B', 'A','C', 'D', 'D', 'F'),
Distance = c(10, 1, 20, 3, 60, 20)
)
# Format as character variable (instead of factor)
df$Patient1 <- as.character(df$Patient1); df$Patient2 <- as.character(df$Patient2);
# If you want mirror paths included, you'll need to add them.
# Ex.) A to C at a distance of 20 is equivalent to C to A at a distance of 20
# If you don't need these mirror paths, you can ignore these two lines.
df_mirror <- data.frame(Patient1 = df$Patient2, Patient2 = df$Patient1, Distance = df$Distance)
df <- rbind(df, df_mirror); rm(df_mirror)
# group pairs by min distance
library(dplyr)
df <- summarise(group_by(df, Patient1, Patient2), min(Distance))
# Resort, min to top.
nearest <- df[order(df$`min(Distance)`), ]
# Keep only the first of each group
nearest <- nearest[!duplicated(nearest$Patient1),]

Related

Reshape origin destination data

I need to turn this data frame :
df1 <- data.frame(A = c(1,2,3), B = c(2,1,4), Flow = c(50,30,20))
into a data frame like this :
df2 <- data.frame(A = c(1,3), B = c(3,4), AtoB = c(50,20), BtoA = c(20, NA))
I am trying to reshape it with dplyr. Is there an existing function or a way to do that ?
An option would be to create an Identifier column between 'A' and 'B' with labels 'AtoB/BtoA' based on the minimum value in each row, then change the values in 'A', 'B' by taking the min/max for each row (pmin/pmax) and spread the output back to 'wide' format
library(dplyr)
library(tidyr)
df1 %>%
mutate(grpIdent = case_when(A == pmin(A, B) ~ 'AtoB', TRUE ~ 'BtoA'),
A1= pmin(A, B), B1 = pmax(A, B)) %>%
select(A = A1, B = B1, grpIdent, Flow) %>%
spread(grpIdent, Flow)
# A B AtoB BtoA
#1 1 2 50 30
#2 3 4 20 NA
Using base R(This might require introducing a blank or blanks). It is also assumed that the to and fro- values are entered in succession.
new_df<-cbind(df[seq(1,nrow(df), by=2),], df[seq(2,nrow(df), by=2),])[,-c(4,5)]
names(new_df)<-c("A","B","AtoB","BtoA")
new_df
Result:
# A B AtoB BtoA
#1 1 2 50 30
#3 3 4 20 30

how to show shared features between items in the same column of a data frame in R

If a data frame contains column on items, their features and the value of their features, how to show shared features between items in the same column of a data frame?
This is the input dataframe
df <- data.frame(group = c(rep(c('A', 'B', 'C'), 3)),
feature = c('x','x','x','y','y','z','z','w','t'),
value=c(1,2,1,3,2,1,2,2,3))
This is a sample of the desired output, which compares any two pairs A&B, A&C, B&C and filter for common features:
df_desired <- data.frame(group1 =c('A','A','B'), group2 = c('B','C','C'), shared_feature = c('x','x','x'), value1 = c(1, 1,2), value2 = c(1, 2,1))
You could try the following:
# Added stringsAsFactors=F argument
df <- data.frame(group = c(rep(c('A', 'B', 'C'), 3)),
feature = c('x','x','x','y','y','z','z','w','t'),
value=c(1,2,1,3,2,1,2,2,3),stringsAsFactors = F)
df_desired <- data.frame(group1 =c('A','A','B'), group2 = c('B','C','C'), shared_feature = c('x','x','x'), value1 = c(1, 1,2), value2 = c(1, 2,1))
# For rbindlist function
library(data.table)
# Keep only features that are available for every group
df_agg = aggregate(value ~ feature, data = df, FUN = length)
shared_feats = df_agg$feature[df_agg$value==length(unique(df$group))]
df = df[df$feature %in% shared_feats,]
# A function that takes a feat_df containing the values of one feature for each group,
# and converts it to our expected output.
create_comb_df <- function(feat_df)
{
df2 = as.data.frame(t(combn(feat_df$group,2)))
colnames(df2) = c('group1','group2')
df2$feature = feat_df$feature[1]
df2$value1 = feat_df$value[match(df2$group1,feat_df$group)]
df2$value2 = feat_df$value[match(df2$group2,feat_df$group)]
return(df2)
}
# Create final output
rbindlist(lapply(split(df,as.character(df$feature)),create_comb_df))
Output:
group1 group2 feature value1 value2
1: A B x 1 2
2: A C x 1 1
3: B C x 2 1
Or, to get all shared features, replace
shared_feats = df_agg$feature[df_agg$value==length(unique(df$group))]
with
shared_feats = df_agg$feature[df_agg$value>1]
and the results are:
group1 group2 feature value1 value2
1: A B x 1 2
2: A C x 1 1
3: B C x 2 1
4: A B y 3 2
5: C A z 1 2
Hope this helps!

Is it possible to merge rows in R data.frame?

If I have the following data.frame:
> df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'))
> df
x y
1 a d
2 b* e
3 c f
Is there a clear way to identify rows in which the df$x entries include the string value *, then use this condition to force the string entries of that row to be merged with the row preceding itself, resulting in a data.frame like the following:
> df
x y
1 a b* d e
2 c f
I assume that the first part of the problem (identifying the x row values that include `*) can be done in a fairly straightforward way using regular expressions. I'm having trouble identifying how to force a data.frame row merge with the row preceding it.
One particularly tricky challenge is if multiple entries in a row have the pattern, e.g.
> df <- data.frame(x = c('a', 'b*', 'c*'), y = c('d', 'e', 'f'))
> df
x y
1 a d
2 b* e
3 c* f
In this case, the resulting data.frame should look like this:
> df
x y
1 a b* c* d e f
The main issue that I find is that after running one iteration of a loop that pastes the strings from df[2,] into df[1,], the data.frame index does not adapt to the new data.frame size:
> df
x y
1 a b* d e
3 c* f
So, subsequent indexing is disrupted.
Here a initial solution:
# Creating the data frame
df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'),stringsAsFactors = FALSE)
df
# Creating a vector of rows with *
ast <- grepl("\\*",df$x)
# For loop
for(i in seq(length(ast),1,-1)){
if(ast[i]){
df[i-1,"x"] <- paste(df[i-1,"x"],df[i,"x"],sep=" ")
df[i-1,"y"] <- paste(df[i-1,"y"],df[i,"y"],sep=" ")
df <- df[-i,]
}
}
That's an initial solution because you still have to manage when the first row has * and other situations like this. I hope that helps already.
Here are 3 alternatives (for the base R one, I assumed x and y are characters rather factor. I also made your data more complicated in order to cover different scenarios)
(A bit more complicated data set)
df <- data.frame(x = c('p','a', 'b*', 'c*', 'd', 'h*', 'j*', 'l*', 'n'),
y = c('r','d', 'e', 'f', 'g', 'i', 'k', 'm', 'o'),
stringsAsFactors = FALSE)
Base R
aggregate(. ~ ID,
transform(df, ID = cumsum(!grepl("*", x, fixed = TRUE))),
paste, collapse = " ")
# ID x y
# 1 1 p r
# 2 2 a b* c* d e f
# 3 3 d h* j* l* g i k m
# 4 4 n o
data.table
library(data.table)
setDT(df)[, lapply(.SD, paste, collapse = " "),
by = .(ID = cumsum(!grepl("*", df[["x"]], fixed = TRUE)))]
# ID x y
# 1: 1 p r
# 2: 2 a b* c* d e f
# 3: 3 d h* j* l* g i k m
# 4: 4 n o
dplyr
library(dplyr)
df %>%
group_by(ID = cumsum(!grepl("*", x, fixed = TRUE))) %>%
summarise_all(funs(paste(., collapse = " ")))
# # A tibble: 4 x 3
# ID x y
# <int> <chr> <chr>
# 1 1 p r
# 2 2 a b* c* d e f
# 3 3 d h* j* l* g i k m
# 4 4 n o
Not actually merging the rows, but for those rows that have a * it pastes the value of the previous row in, and then it gets rid of rows that had a * in the following row.
library(dplyr)
df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'))
df <- mutate(df,
Operator = grepl("\\*",x), # Check for *
lagged.x = lag(x, n = 1), # Get x value from 1 row ago
lagged.y = lag(y, n = 1), # Get y value from 1 row ago
x = ifelse(Operator, paste(lagged.x, x),x), # if there is * paste lagged x
y = ifelse(Operator, paste(lagged.y, y),y), # if there is * paste lagged y
lead.Operator = lead(Operator, n = 1) # Check if next row has a *
)
# keep only rows that had no * in following row and that had no following row (last row)
df <- filter(df, !lead.Operator | is.na(lead.Operator))
# Select just the x and y columns
df <- select(df, x, y)

Sum observations from two columns, looping over many columns in R

I have searched high and low, but am stuck on how to approach this. I have two sets of columns that I want to sum, row by row, but which I want to loop over many columns. If I were to do this manually, I would want:
df1[1,1]+df2[1,1]
df1[2,1]+df2[2,1]
etc... I've found many helpful examples on how to do something like:
apply(df[,c("a","d")], 1, sum)
though I want to do this over lots of columns. Also, while it's not entirely relevant, I want to phrase my question as close to my reality as possible, so my example below includes NA's, since my actual data contains many missing values.
# make a data frame, df1, with three columns
a <- sample(1:100, 50, replace = T)
b <- sample(100:300, 50, replace = T)
c <- sample(2:50, 500, replace = T)
df1 <- cbind(a,b,c)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
Make it possible to randomly throw a few NAs in, function from http://www.r-bloggers.com/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
Add the NAs to the frames
NAins(df1, .2)
NAins(df2, .14)
Then, I tried to seq along the columns in each data frame, and used apply setting the index to 1, meaning to sum each row entry. This doesn't work.
for(i in seq_along(df1)){
for(j in seq_along(df2)){
apply(c(df1[,i], col2[j]), 1, function(x) sum(x, na.rm = T))}}
Thanks for any help!
You should be able to just replace NA with 0, and then add with "+":
replace(df1, is.na(df1), 0) + replace(df2, is.na(df2), 0)
# X Y Z
# 1 7 19 6
# 2 11 12 1
# 3 16 14 11
# 4 13 7 13
# 5 10 2 11
Alternatively, if you have more than just two data.frames, you can collect them in a list and use Reduce:
Reduce("+", lapply(mget(c("df1", "df2", "df3")), function(x) replace(x, is.na(x), 0)))
Here's some sample data (and what I think is an easier way to create it):
set.seed(1) ## Set a seed so others can reproduce your sample data
dfmaker <- function() {
setNames(
data.frame(
replicate(3, sample(c(NA, 1:10), 5, TRUE), FALSE)),
c("X", "Y", "Z"))
}
df1 <- dfmaker()
df1
# X Y Z
# 1 2 9 2
# 2 4 10 1
# 3 6 7 7
# 4 9 6 4
# 5 2 NA 8
df2 <- dfmaker()
df2
# X Y Z
# 1 5 10 4
# 2 7 2 NA
# 3 10 7 4
# 4 4 1 9
# 5 8 2 3
df3 <- dfmaker()
You can transform the data.frame to an array and sum them using apply function.
install.package('abind')
library(abind)
df <- abind(list(df1,df2), along = 3)
results <- apply(df, MARGIN = c(1,2), FUN = function(x) sum(x, na.rm = TRUE))
results

Count number of time combination of events appear in dataframe columns ext

This is an extension of the question asked in Count number of times combination of events occurs in dataframe columns, I will reword the question again so it is all here:
I have a data frame and I want to calculate the number of times each combination of events in two columns occur (in any order), with a zero if a combination doesn't appear.
For example say I have
df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'),
'y' = c('c', 'c', 'a', 'a', 'b'))
So
x y
a c
b c
c a
c a
c a
c b
a and b do not occur together, a and c 4 times (rows 2, 4, 5, 6) and b and c twice (3rd and 7th rows) so I would want to return
x-y num
a-b 0
a-c 4
b-c 2
I hope this makes sense? Thanks in advance
This should do it:
res = table(df)
To convert to data frame:
resdf = as.data.frame(res)
The resdf data.frame looks like:
x y Freq
1 a a 0
2 b a 0
3 c a 2
4 a b 0
5 b b 0
6 c b 1
7 a c 1
8 b c 1
9 c c 0
Note that this answer takes order into account. If ordering of the columns is unimportant, then modifying the original data.frame prior to the process will remove the effect of ordering (a-c treated the same as c-a).
df1 = as.data.frame(t(apply(df,1,sort)))
As said, you can do this with factor() and expand.grid() (or another way to get all possible combinations)
all.possible <- expand.grid(c('a','b','c'), c('a','b','c'))
all.possible <- all.possible[all.possible[, 1] != all.possible[, 2], ]
all.possible <- unique(apply(all.possible, 1, function(x) paste(sort(x), collapse='-')))
df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'),
'y' = c('c', 'c', 'a', 'a', 'b'))
table(factor(apply(df , 1, function(x) paste(sort(x), collapse='-')), levels=all.possible))
An alternative, because I was a bit bored. Perhaps a bit more generalised? But probably still uglier than it could be...
df2 <- as.data.frame(table(df))
df2$com <- apply(df2[,1:2],1,function(x) if(x[1] != x[2]) paste(sort(x),collapse='-'))
df2 <- df2[df2$com != "NULL",]
ddply(df2, .(unlist(com)), summarise,
num = sum(Freq))

Resources