How to modify a single column with joins using dplyr - r

I'm trying to add a new column to a data frame, based on the levels of one (or a few) factors. I start with a data frame with two factors and a single variable
library(dplyr)
test <- data_frame(one = letters[1:5], two = LETTERS[1:5], three = 6:10)
And I want to add a new column, four, that has values for certain levels of one and two. For convenience, I keep these new values in their own little tables:
new_fourth_a <- data_frame(one = "b", four = 47)
new_fourth_b <- data_frame(two = c("C","E"), four = 42)
The correct answer would be
one two three four
(chr) (chr) (int) (dbl)
1 a A 6 NA
2 b B 7 47
3 c C 8 42
4 d D 9 NA
5 e E 10 42
And the best way I could think of to accomplish this is via left_join():
test %>%
left_join(new_fourth_a, by = "one") %>%
left_join(new_fourth_b, by = "two")
But this ends up duplicating the four column. This could be a good thing: it would allow for easy checking to see if there are any joins that introduce more than one value for the new column (ie check that there is only one non-NA value across each row in all the columns that start with four. ). Still, I think there must be an easier way?

Here is a solution that uses join
library(dplyr)
test <- data_frame(one = letters[1:5], two = LETTERS[1:5], three = 6:10)
new_fourth_a <- data_frame(one = "b", extra_a = 47)
new_fourth_b <- data_frame(two = c("C","E"), extra_b = 42)
test %>%
left_join(new_fourth_a, by = "one") %>%
left_join(new_fourth_b, by = "two") %>%
mutate(four = pmax(extra_a, extra_b, na.rm = TRUE)) %>%
select(-extra_a, -extra_b)
If you want to handle an arbitrary number then you have the handle one at a time
library(dplyr)
test <- data_frame(one = letters[1:5], two = LETTERS[1:5], three = 6:10)
new_fourth_a <- data_frame(one = "b", extra = 47)
new_fourth_b <- data_frame(two = c("C","E"), extra = 42)
test %>%
left_join(new_fourth_a, by = "one") %>%
mutate(four = extra) %>%
select(-extra) %>%
left_join(new_fourth_b, by = "two") %>%
mutate(four = ifelse(is.na(extra), four, extra)) %>%
select(-extra)

Instead of creating two more data_frames, we could use %in% with some arithmetic to get a numeric index to create the column 'four' with values NA, 47, and 42.
test %>%
mutate(four = c(NA, 47, 42)[1+(one %in% 'b') +
2*(two %in% c('C', 'E'))])
# one two three four
# (chr) (chr) (int) (dbl)
#1 a A 6 NA
#2 b B 7 47
#3 c C 8 42
#4 d D 9 NA
#5 e E 10 42

Related

R - Adding vector from one dataframe as column to another dataframe

I have two dataframes and want to add a specific vector from one as a column to another, for multiplication purposes OR how can I multiply data from one dataframe to a specific vector from another?
Example
library (dplyr)
df <- data.frame (name = c("A", "B", "C", "D", "E"),
area = c(1,2,3,4,5),
yield = c(10, 20, 30, 40, 50))
df2 <- data.frame (application = c("test", "current", "future"),
number = c(5,10,15))
The intended result is e.g. get the value for "current" in df2 and create a new column on df named "number", that will be multiplied with the other columns in df and generate the column "calculation" - excel example on how df would look like at the end below:
I tried
df$number <- df2 %>%
filter(application == "current") %>%
select(number)
But I get an Error in $<-.data.frame(*tmp*, number, value = list(number = 10)) :
replacement has 1 row, data has 5
I know that I could do
df$number <- df2[2,2]
But I want it to be specifically related to "current" (as I tried to do with dplyr). This is only an example - in reality, df2 is a big file and the order can change when people are adding more data.
Here is a base R approach -
df$number <- df2$number[df2$application == "current"]
df$calculation <- with(df, area * yield * number)
df
Or if you prefer dplyr -
library(dplyr)
df <- df %>%
bind_cols(df2 %>%
filter(application == "current") %>%
select(number)) %>%
mutate(calculation = area * yield * number)
df
# name area yield number calculation
#1 A 1 10 10 100
#2 B 2 20 10 400
#3 C 3 30 10 900
#4 D 4 40 10 1600
#5 E 5 50 10 2500

How to obtain minimum difference between 2 columns

I want to obtain the minimum distance between 2 columns, however the same name may appear in both Column A and Column B. See example below;
Patient1 Patient2 Distance
A B 8
A C 11
A D 19
A E 23
B F 6
C G 25
So the output I need is:
Patient Patient_closest_distance Distance
A B 8
B F 6
c A 11
I have tried using the list function
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
However, I just get the minimum distance for each column, i.e. C will have 2 results as it is in both columns rather than showing the closest patient considering both columns. Also, I only get a list of distances, so I can't see which patient is linked to which;
Patient1 SNP
1: A 8
I have tried using the list function in R Studio
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
This code below works.
# Create sample data frame
df <- data.frame(
Patient1 = c('A','B', 'A', 'A', 'C', 'B'),
Patient2 = c('B', 'A','C', 'D', 'D', 'F'),
Distance = c(10, 1, 20, 3, 60, 20)
)
# Format as character variable (instead of factor)
df$Patient1 <- as.character(df$Patient1); df$Patient2 <- as.character(df$Patient2);
# If you want mirror paths included, you'll need to add them.
# Ex.) A to C at a distance of 20 is equivalent to C to A at a distance of 20
# If you don't need these mirror paths, you can ignore these two lines.
df_mirror <- data.frame(Patient1 = df$Patient2, Patient2 = df$Patient1, Distance = df$Distance)
df <- rbind(df, df_mirror); rm(df_mirror)
# group pairs by min distance
library(dplyr)
df <- summarise(group_by(df, Patient1, Patient2), min(Distance))
# Resort, min to top.
nearest <- df[order(df$`min(Distance)`), ]
# Keep only the first of each group
nearest <- nearest[!duplicated(nearest$Patient1),]

Reshape origin destination data

I need to turn this data frame :
df1 <- data.frame(A = c(1,2,3), B = c(2,1,4), Flow = c(50,30,20))
into a data frame like this :
df2 <- data.frame(A = c(1,3), B = c(3,4), AtoB = c(50,20), BtoA = c(20, NA))
I am trying to reshape it with dplyr. Is there an existing function or a way to do that ?
An option would be to create an Identifier column between 'A' and 'B' with labels 'AtoB/BtoA' based on the minimum value in each row, then change the values in 'A', 'B' by taking the min/max for each row (pmin/pmax) and spread the output back to 'wide' format
library(dplyr)
library(tidyr)
df1 %>%
mutate(grpIdent = case_when(A == pmin(A, B) ~ 'AtoB', TRUE ~ 'BtoA'),
A1= pmin(A, B), B1 = pmax(A, B)) %>%
select(A = A1, B = B1, grpIdent, Flow) %>%
spread(grpIdent, Flow)
# A B AtoB BtoA
#1 1 2 50 30
#2 3 4 20 NA
Using base R(This might require introducing a blank or blanks). It is also assumed that the to and fro- values are entered in succession.
new_df<-cbind(df[seq(1,nrow(df), by=2),], df[seq(2,nrow(df), by=2),])[,-c(4,5)]
names(new_df)<-c("A","B","AtoB","BtoA")
new_df
Result:
# A B AtoB BtoA
#1 1 2 50 30
#3 3 4 20 30

Combining values Boolean columns to one with Priority in R

Gone through below links but it solved my problem partially.
merge multiple TRUE/FALSE columns into one
Combining a matrix of TRUE/FALSE into one
R: Converting multiple boolean columns to single factor column
I have a dataframe which looks like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
A = c('Y','N','N','N','N','N','N','N'),
B = c('N','Y','N','N','N','N','Y','N'),
C = c('N','N','Y','N','N','Y','N','N'),
D = c('N','N','N','Y','N','Y','N','N'),
E = c('N','N','N','N','Y','N','Y','N')
)
I want to make a reshape my df with one column but it has to give priorities when there are 2 "Y" in a row.
THE priority is A>B>C>D>E which means if their is "Y" in A then the resultant value should be A. Similarly, in above example df both C and D has "Y" but there should be "C" in the resultant df.
Hence output should look like:
resultant_dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Result = c('A','B','C','D','E','C','B','NA')
)
I have tried this:
library(reshape2)
new_df <- melt(dat, "Id", variable.name = "Result")
new_df <-new_df[new_df$value == "Y", c("Id", "Result")]
But the problem is doesn't handle the priority thing, it creates 2 rows for the same Id.
tmp = data.frame(ID = dat[,1],
Result = col_order[apply(
X = dat[col_order],
MARGIN = 1,
FUN = function(x) which(x == "Y")[1])],
stringsAsFactors = FALSE)
tmp$Result[is.na(tmp$Result)] = "Not Present"
tmp
# ID Result
#1 1 A
#2 2 B
#3 3 C
#4 4 D
#5 5 E
#6 6 C
#7 7 B
#8 8 Not Present

Joint Occurrence of variables in R

I want to count individual and combine occurrence of variables (1 represents presence and 0 represents absence). This can be obtained by multiple uses of table function (See MWE below). Is it possible to use a more efficient approach to get the required output given below?
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
table(A)
A
0 1
48 52
table(B)
B
0 1
53 47
table(C)
C
0 1
34 66
table(A, B)
B
A 0 1
0 25 23
1 28 24
table(A, C)
C
A 0 1
0 12 36
1 22 30
table(B, C)
C
B 0 1
0 21 32
1 13 34
table(A, B, C)
, , C = 0
B
A 0 1
0 8 4
1 13 9
, , C = 1
B
A 0 1
0 17 19
1 15 15
Required Output
I am requiring something like the following:
A = 52
B = 45
C = 66
A + B = 24
A + C = 30
B + C = 34
A + B + C = 15
Expanding on Sumedh's answer, you can also do this dynamically without having to specify the filter every time. This will be useful if you have more than only 3 columns to combine.
You can do something like this:
lapply(seq_len(ncol(df)), function(i){
# Generate all the combinations of i element on all columns
tmp_i = utils::combn(names(df), i)
# In the columns of tmp_i we have the elements in the combination
apply(tmp_i, 2, function(x){
dynamic_formula = as.formula(paste("~", paste(x, "== 1", collapse = " & ")))
df %>%
filter_(.dots = dynamic_formula) %>%
summarize(Count = n()) %>%
mutate(type = paste0(sort(x), collapse = ""))
}) %>%
bind_rows()
}) %>%
bind_rows()
This will:
1) generate all the combinations of the columns of df. First the combinations with one element (A, B, C) then the ones with two elements (AB, AC, BC), etc.
This is the external lapply
2) then for every combination will create a dynamic formula. For AB for instance the formula will be A==1 & B==1, exactly as Sumedh suggested. This is the dynamic_formula bit.
3) Will filter the dataframe with the dynamically generated formula and count the number of rows
4) Bind all together (the two bind_rows)
The output will be
Count type
1 52 A
2 47 B
3 66 C
4 24 AB
5 30 AC
6 34 BC
7 15 ABC
EDITED TO ADD: I see now that you don't want to get the exclusive counts (i.e. A and AB should both include all As).
I got more than a little nerd-sniped by this today, particularly as I wanted to solve it using base R with no packages. The below should do that.
There is a very easy (in principle) solution that simply uses xtabs(), which I've illustrated below. However, to generalize it for any potential number of dimensions, and then to apply it to a variety of combinations, actually was harder. I strove to avoid using the dreaded eval(parse()).
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
# Turn strings off
options(stringsAsFactors = FALSE)
# Obtain the n-way frequency table
# This table can be directly subset using []
# It is a little tricky to pass the arguments
# I'm trying to avoid eval(parse())
# But still give a solution that isn't bound to a specific size
xtab_freq <- xtabs(formula = formula(x = paste("~",paste(names(df),collapse = " + "))),
data = df)
# Demonstrating what I mean
# All A
sum(xtab_freq["1",,])
# [1] 52
# AC
sum(xtab_freq["1",,"1"])
# [1] 30
# Using lapply(), we pass names(df) to combn() with m values of 1, 2, and 3
# The output of combn() goes through list(), then is unlisted with recursive FALSE
# This gives us a list of vectors
# Each one being a combination in which we are interested
lst_combs <- unlist(lapply(X = 1:3,FUN = combn,x = names(df),list),recursive = FALSE)
# For nice output naming, I just paste the values together
names(lst_combs) <- sapply(X = lst_combs,FUN = paste,collapse = "")
# This is a function I put together
# Generalizes process of extracting values from a crosstab
# It does it in this fashion to avoid eval(parse())
uFunc_GetMargins <- function(crosstab,varvector,success) {
# Obtain the dimname-names (the names within each dimension)
# From that, get the regular dimnames
xtab_dnn <- dimnames(crosstab)
xtab_dn <- names(xtab_dnn)
# Use match() to get a numeric vector for the margins
# This can be used in margin.table()
tgt_margins <- match(x = varvector,table = xtab_dn)
# Obtain a margin table
marginal <- margin.table(x = crosstab,margin = tgt_margins)
# To extract the value, figure out which marginal cell contains
# all variables of interest set to success
# sapply() goes over all the elements of the dimname names
# Finds numeric index in that dimension where the name == success
# We subset the resulting vector by tgt_margins
# (to only get the cells in our marginal table)
# Then, use prod() to multiply them together and get the location
tgt_cell <- prod(sapply(X = xtab_dnn,
FUN = match,
x = success)[tgt_margins])
# Return as named list for ease of stacking
return(list(count = marginal[tgt_cell]))
}
# Doing a call of mapply() lets us get the results
do.call(what = rbind.data.frame,
args = mapply(FUN = uFunc_GetMargins,
varvector = lst_combs,
MoreArgs = list(crosstab = xtab_freq,
success = "1"),
SIMPLIFY = FALSE,
USE.NAMES = TRUE))
# count
# A 52
# B 47
# C 66
# AB 24
# AC 30
# BC 34
# ABC 15
I ditched the prior solution that used aggregate.
Using dplyr,
Occurrence of only A:
library(dplyr)
df %>% filter(A == 1) %>% summarise(Total = nrow(.))
Occurrence of A and B:
df %>% filter(A == 1, B == 1) %>% summarise(Total = nrow(.))
Occurence of A, B, and C
df %>% filter(A == 1, B == 1, C == 1) %>% summarise(Total = nrow(.))

Resources