Combine data.frames of different dimensions creating duplicates where needed /r dplyr - r

I am looking for a way to combine two tables of different dimensions by ID. But the final table should have some douplicated values depending on each table.
Here is a random example:
IDx = c("a", "b", "c", "d")
sex = c("M", "F", "M", "F")
IDy = c("a", "a", "b", "c", "d", "d")
status = c("single", "children", "single", "children", "single", "children")
salary = c(30, 80, 50, 40, 30, 80)
x = data.frame(IDx, sex)
y = data.frame(IDy, status, salary)
Here is x:
IDx sex
1 a M
2 b F
3 c M
4 d F
Here is y:
IDy status salary
1 a single 30
2 a children 80
3 b single 50
4 c children 40
5 d single 30
6 d children 80
I am looking for this:
IDy sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80
Basically, sex should be matched to fit the needs of table y. All values in both tables should be used, the actual table is a lot larger. Not all IDs will need to duplicate.
This should be fairly simple, but I cannot find a good answer anywhere online.
Note, I don't want NAs to be introduced.
I am new in R and since I have been focused in dplyr it would help if the example comes from there. It might be simple with base R, too.
UPDATE
The bolded sentences above might be confusing to the final answer. Sorry, it has been a confusing case which I realised should include one extra column tha complicates things, but more of that later.
First, I tried to see what is happening on my actuall table and to find which suggested answer fits my needs. I removed any problematic columns for the following result. So, I checked this:
dim(x)
> [1] 231 2
dim(y)
> [1] 199 8
# left_join joins matching rows from y to x
suchait <- left_join(x, y, by= c("IDx" = "IDy"))
# inner_join retains only rows in both sets
jdobres <- inner_join(y, anno2, by = c(IDx = "IDy"))
dim(suchait) # actuall table used
> [1] 225 9
dim(jdobres)
> [1] 219 9
But why/where do they look different?
This shows the 6 rows that are introduced in suchait's table but not on jdobres and it is because of the different approach.
setdiff(suchait, jdobres )

Using dplyr:
library(dplyr)
df <- left_join(x, y, by = c("IDx" = "IDy"))
Your result would be:
IDx sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80
Or you could do:
df <- left_join(y, x, by = c("IDy" = "IDx"))
It would give:
IDy status salary sex
1 a single 30 M
2 a children 80 M
3 b single 50 F
4 c children 40 M
5 d single 30 F
6 d children 80 F
You can also reorder your columns to get it exactly the way you wanted:
df <- df[, c("IDy", "sex", "status", "salary")]
result:
IDy sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80

Related

Converting letter vector into numeric vector

If I want to convert the letter vector c("A","B","C") into c(10,20,30), what function could I use?
Sorry for asking a question that seems to be trivial. I am a self-taught beginner and I am still getting familiar with the functions.
Edit:
I explain why I ask such strange question.
So here is the background:
A standard deck of playing cards can be created in R as a data frame with the following
command.
Note that D = Diamond, C = Club, H = Heart, S = Spade
deck <- data.frame(
suit = rep(c("D","C","H","S"), 13),
rank = rep(2:14, 4)
11 = Jack, 12 = Queen, 13 = King, 14 = Ace
)
A poker hand is a set of five playing cards. Sample a poker hand using the data frame
deck and name it as hand.
hand<-deck[sample(nrow(deck),5),]
hand
A flush is a hand that contains five cards all of the same suit. Create a logical value named
is.flush which is TRUE if and only if hand is a flush.
is.flush<-length(unique(hand[,1]))==1
is.flush
And here starts the problem:
"A straight is a hand that contains five cards of sequential rank. Note that both
A K Q J 10 and 5 4 3 2 A are considered to be straight, but Q K A 2 3 is
not. Create a logical value named is. straight which is TRUE if and only if the hand is
straight."
Hint: The all() function could be useful.
So here is my attepmt:
I can set:
y <- read.table(text = "
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10")
apply(y, 1, function(x) all(diff(sort(x[ x != 2 ])) == 1))
Then I can have a TRUE FALSE value.
But I cannot input letters in the function above.
Hence I am stuck here, and I have to convert the letter to numbers.
(Unless there is a smarter way)
P.S.
The background code I have so far:
deck <- data.frame(
suit = rep(c("D","C","H","S"), 13),
rank = rep(2:14, 4)
)
deck
hand<-deck[sample(nrow(deck),5),]
hand
is.flush<-length(unique(hand[,1]))==1
is.flush
Sounds like you want case_when inside a custom function
library(tidyverse)
my_func <- function(letter) {
case_when(letter == 'A' ~ 10,
letter == 'B' ~ 20,
letter == 'C' ~ 30,
TRUE ~ 0)
}
my_func(c("A","B","C"))
Will give you
[1] 10 20 30
If you want to map each letter to an arbitrary output value, you can use a named vector as a dictionary, for example:
dictionary <- (1:26) * 10 # 10, 20, 30 .. 260
names(dictionary) <- LETTERS # built-in vector of uppercase letters
dictionary
A B C D E F G H I J K L M N O P Q R S T U V
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220
W X Y Z
230 240 250 260
You can then use letters to index the dictionary and return the mapped value:
test <- c("B", "L", "A")
dictionary[test]
B L A
20 120 10
The function that is actually performing the mapping here is the [ operator, see the Extract docs.
You could do:
library(tidyverse)
x <- c("A","B","C")
recode(x, A = 10, B = 20, C = 30, .default = 0)

How to compare two variable and different length data frames to add values from one data frame to the other, repeating values where necessary

I apologize as I'm not sure how to word this title exactly.
I have two data frames. df1 is a series of paths with columns "source" and "destination". df2 stores values associated with the destinations. Below is some sample data:
df1
row
source
destination
1
A
B
2
C
B
3
H
F
4
G
B
df2
row
destination
n
1
B
26
2
F
44
3
L
12
I would like to compare the two data frames and add the n column to df1 so that df1 has the correct n value for each destination. df1 should look like:
row
source
destination
n
1
A
B
26
2
C
B
26
3
H
F
44
4
G
B
26
The data that I'm actually working with is much larger, and is never the same number of rows when I run the program. The furthest I've gotten with this is using the which command to get the right values, but only each value once.
df2[ which(df2$destination %in% df1$destination), ]$n
[1] 26 44
When what I would need is the list (26,26,44,26) so I can save it to df1$n
We can use a merge or left_join
library(data.table)
setDT(df1)[df2, n := i.n, on = .(destination)]
A base R option using match
transform(
df1,
n = df2$n[match(destination, df2$destination)]
)
which gives
row source destination n
1 1 A B 26
2 2 C B 26
3 3 H F 44
4 4 G B 26
Data
df1 <- data.frame(row = 1:4, source = c("A", "C", "H", "G"), destination = c("B", "B", "F", "B"))
df2 <- data.frame(row = 1:3, destination = c("B", "F", "L"), n = c(26, 44, 12))

Merge several data frames named inside a vector

I am trying to merge some data frames, precisely 4, but I would like my command to work with whichever amount of them. Those data frames are named in a vector:
dataframes<-c('df1', 'data2', 'd3', 'samples4')
All data frames present the same data, but they belong to different samples. As an example, the first dataframe is as following:
ID count1
A 0
B 67
C 200
D 12
E 0
My desired output would contain a column with the counts of each ID for each sample:
ID count1 count2 count3 count4
A 0 2 0 30
B 67 100 300 500
C 200 2 1025 0
D 12 4 0 10
E 0 0 20 2
I have tried the following commands:
Reduce(function(x, y) merge(x, y, by="ID"), list(unname(get(dataframes))))
as.data.frame(do.call(cbind, unname(get(dataframes))))
But in both cases I get just the first data frame. No merging is occuring.
How can I solve this?
Assuming:
df1 <- data.frame(ID = c("A", "B", "C", "D"), count = c(2,45,24,21))
df2 <- data.frame(ID = c("A", "B", "C", "D"), count = c(11,35,4,2))
I'd suggest to just add a column with the sample name to each dataframe, e.g.:
df1["sample"] <- "sample1"
df2["sample"] <- "sample2"
Then merge them "vertically" by using something like rbind().
all_data <- rbind(df1, df2) # this can take more dataframes
This "long format" should also make it easier to filter rows by sample.
If you still need to have the wider structure you describe above (with a column for each sample), you can use reshape2::dcast() to construct it:
library(reshape2)
all_data <– dcast(all_data, ID ~ sample, value.var="count")
Result:
ID sample1 sample2
1 A 2 11
2 B 45 35
3 C 24 4
4 D 21 2

merge tables in R, combine cells if in both

Hi can you please explain how I can merge two tables that they can be used to generate a piechart?
#read input data
dat = read.csv("/ramdisk/input.csv", header = TRUE, sep="\t")
# pick needed columns and count the occurences of each entry
df1 = table(dat[["C1"]])
df2 = table(dat[["C2"]])
# rename columns
names(df1) <- c("ID", "a", "b", "c", "d")
names(df2) <- c("ID", "e", "f", "g", "h")
# show data for testing purpose
df1
# ID a b c d
#241 18 17 28 29
df2
# ID e f g h
#230 44 8 37 14
# looks fine so far, now the problem:
# what I want to do ist merging df and df2
# so that df will contain the overall numbers of each entry
# df should print
# ID a b c d e f g h
#471 18 17 28 29 44 8 37 14
# need them to make a nice piechart in the end
#pie(df)
I assume it can be done with merge somehow, but I haven't found the right way. The closest solution I found was merge(df1,df2,all=TRUE), but it wasn't exactly what I've needed.
An approach would be to stack, then rbind and do an aggregate
out <- aggregate(values ~ ., rbind(stack(df1), stack(df2)), sum)
To get a named vector
with(out, setNames(values, ind))
Or another approach is to concatenate the tables and then use tapply to do a group by sum
v1 <- c(df1, df2)
tapply(v1, names(v1), sum)
Or with rowsum
rowsum(v1, group = names(v1))
Another approach would be to use rbindlist from data.table and colSums to get the totals. rbindlist with fill=TRUE accepts all columns, even if they are not present in both tables.
df1<-read.table(text="ID a b c d
241 18 17 28 29 ",header=TRUE)
df2<-read.table(text="ID e f g h
230 44 8 37 14" ,header=TRUE)
library(data.table)
setDT(df1)
setDT(df2)
res <- rbindlist(list(df1,df2), use.names=TRUE, fill=TRUE)
colSums(res, na.rm=TRUE)
ID a b c d e f g h
471 18 17 28 29 44 8 37 14
I wrote the package safejoin that handle this type of tasks in an intuitive way (I hope!). You just need to have a common id between your 2 tables (we'll use tibble::row_id_to_column for that) and then you can merge and handle the column conflict with sum.
Using #pierre-lapointe's data :
library(tibble)
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
res <- safe_inner_join(rowid_to_column(df1),
rowid_to_column(df2),
by = "rowid",
conflict = sum)
res
# rowid ID a b c d e f g h
# 1 1 471 18 17 28 29 44 8 37 14
The for a given row (here the first and only), you can get your pie chart by converting to a vector with unlist and removing the irrelevant 2 first elements :
pie(unlist(res[1,])[-(1:2)])

R - How to apply different functions to certain rows in a column

I am trying to apply different functions to different rows based on the value of a string in an adjacent column. My dataframe looks like this:
type size
A 1
B 3
A 4
C 2
C 5
A 4
B 32
C 3
and I want to apply different functions to types A, B, and C, to give a third column column "size2." For example, let's say the following functions apply to A, B, and C:
for A: size2 = 3*size
for B: size2 = size
for C: size2 = 2*size
I'm able to do this for each type separately using this code
df$size2 <- ifelse(df$type == "A", 3*df$size, NA)
df$size2 <- ifelse(df$type == "B", 1*df$size, NA)
df$size2 <- ifelse(df$type == "C", 2*df$size, NA)
However, I can't seem to do it for all of the types without erasing all of the other values. I tried to use this code to limit the application of the function to only those values that were NA (i.e., keep existing values and only fill in NA values), but it didn't work using this code:
df$size2 <- ifelse(is.na(df$size2), ifelse(df$type == "C", 2*df$size, NA), NA)
Does anyone have any ideas? Is it possible to use some kind of AND statement with "is.na(df$size2)" and "ifelse(df$type == "C""?
Many thanks!
This might be a might more R-ish (and I called my dataframe 'dat' instead of 'df' since df is a commonly used function.
> facs <- c(3,1,2)
> dat$size2= dat$size* facs[ match( dat$type, c("A","B","C") ) ]
> dat
type size size2
1 A 1 3
2 B 3 3
3 A 4 12
4 C 2 4
5 C 5 10
6 A 4 12
7 B 32 32
8 C 3 6
The match function is used to construct indexes to supply to the extract function [.
if you want you can nest the ifelses:
df$size2 <- ifelse(df$type == "A", 3*df$size,
ifelse(df$type == "B", 1*df$size,
ifelse(df$type == "C", 2*df$size, NA)))
# > df
# type size size2
#1 A 1 3
#2 B 3 3
#3 A 4 12
#4 C 2 4
#5 C 5 10
#6 A 4 12
#7 B 32 32
#8 C 3 6
This could do it like this, creating separate logical vectors for each type:
As <- df$type == 'A'
Bs <- df$type == 'B'
Cs <- df$type == 'C'
df$size2[As] <- 3*df$size[As]
df$size2[Bs] <- df$size[Bs]
df$size2[Cs] <- 2*df$size[Cs]
but a more direct approach would be to create a separate lookup table like this:
df$size2 <- c(A=3,B=1,C=2)[as.character(df$type)] * df$size

Resources