Merge several data frames named inside a vector - r

I am trying to merge some data frames, precisely 4, but I would like my command to work with whichever amount of them. Those data frames are named in a vector:
dataframes<-c('df1', 'data2', 'd3', 'samples4')
All data frames present the same data, but they belong to different samples. As an example, the first dataframe is as following:
ID count1
A 0
B 67
C 200
D 12
E 0
My desired output would contain a column with the counts of each ID for each sample:
ID count1 count2 count3 count4
A 0 2 0 30
B 67 100 300 500
C 200 2 1025 0
D 12 4 0 10
E 0 0 20 2
I have tried the following commands:
Reduce(function(x, y) merge(x, y, by="ID"), list(unname(get(dataframes))))
as.data.frame(do.call(cbind, unname(get(dataframes))))
But in both cases I get just the first data frame. No merging is occuring.
How can I solve this?

Assuming:
df1 <- data.frame(ID = c("A", "B", "C", "D"), count = c(2,45,24,21))
df2 <- data.frame(ID = c("A", "B", "C", "D"), count = c(11,35,4,2))
I'd suggest to just add a column with the sample name to each dataframe, e.g.:
df1["sample"] <- "sample1"
df2["sample"] <- "sample2"
Then merge them "vertically" by using something like rbind().
all_data <- rbind(df1, df2) # this can take more dataframes
This "long format" should also make it easier to filter rows by sample.
If you still need to have the wider structure you describe above (with a column for each sample), you can use reshape2::dcast() to construct it:
library(reshape2)
all_data <– dcast(all_data, ID ~ sample, value.var="count")
Result:
ID sample1 sample2
1 A 2 11
2 B 45 35
3 C 24 4
4 D 21 2

Related

Convert dataframe to a matrix based on events' frequency

I am trying to create a matrix from a dataframe based on the frequency of interaction of pairs of individuals. In the dataframe (example below), I have a list of names under the GIVER and RECIPIENT columns. Each row with a combination of GIVER and RECIPIENT corresponds to one (directed) interaction between the two individuals (interaction dyad).
The matrix I would like to obtain should have all the names of the individuals listed in the columns "GIVER" and "RECIPIENT" (not all individuals appear in both columns). The matrix's rows should represent the number of interactions that an individual (each rowname) gives to any other individual (each colname). The columns should instead represent the the number of interactions that each individual (each colname) receives from any other individual (each rowname).
This is an example of my dataframe:
GIVER
RECIPIENT
A
B
A
C
D
A
E
B
C
E
B
D
I used this function to obtain the matrix:
my_matrix = function(df){
tablei = as.data.frame(table(union(df$GIVER, df$RECIPIENT), union(df$GIVER, df$RECIPIENT)))
nameVals <- sort(unique(unlist(tablei[1:2])))
matrixi <- matrix(0, length(nameVals), length(nameVals), dimnames = list(nameVals,nameVals))
matrixi[as.matrix(tablei[c("Var1", "Var2")])] <- tablei[["Freq"]]
as.data.frame(matrixi)}
However, there is a problem in the second row, which returns the frequency values as all 0 (any interaction of an individual with others) or 1 (interaction of an individual with itself).
tablei = as.data.frame(table(union(df$GIVER, df$RECIPIENT), union(df$GIVER, df$RECIPIENT)))
Do you have any idea on how to fix the problem?
Thank you for your help!
A tidyverse approach to your problem.
Data
df <-
tibble::tribble(
~GIVER, ~RECIPIENT,
"A", "B",
"A", "C",
"D", "A",
"E", "B",
"C", "E",
"B", "D"
)
Code
library(dplyr)
library(tidyr)
df %>%
count(GIVER,RECIPIENT,name = "freq") %>%
complete(GIVER,RECIPIENT,fill = list(freq = 0))
Output
# A tibble: 25 x 3
GIVER RECIPIENT freq
<chr> <chr> <dbl>
1 A A 0
2 A B 1
3 A C 1
4 A D 0
5 A E 0
6 B A 0
7 B B 0
8 B C 0
9 B D 1
10 B E 0
# ... with 15 more rows

How to compare two variable and different length data frames to add values from one data frame to the other, repeating values where necessary

I apologize as I'm not sure how to word this title exactly.
I have two data frames. df1 is a series of paths with columns "source" and "destination". df2 stores values associated with the destinations. Below is some sample data:
df1
row
source
destination
1
A
B
2
C
B
3
H
F
4
G
B
df2
row
destination
n
1
B
26
2
F
44
3
L
12
I would like to compare the two data frames and add the n column to df1 so that df1 has the correct n value for each destination. df1 should look like:
row
source
destination
n
1
A
B
26
2
C
B
26
3
H
F
44
4
G
B
26
The data that I'm actually working with is much larger, and is never the same number of rows when I run the program. The furthest I've gotten with this is using the which command to get the right values, but only each value once.
df2[ which(df2$destination %in% df1$destination), ]$n
[1] 26 44
When what I would need is the list (26,26,44,26) so I can save it to df1$n
We can use a merge or left_join
library(data.table)
setDT(df1)[df2, n := i.n, on = .(destination)]
A base R option using match
transform(
df1,
n = df2$n[match(destination, df2$destination)]
)
which gives
row source destination n
1 1 A B 26
2 2 C B 26
3 3 H F 44
4 4 G B 26
Data
df1 <- data.frame(row = 1:4, source = c("A", "C", "H", "G"), destination = c("B", "B", "F", "B"))
df2 <- data.frame(row = 1:3, destination = c("B", "F", "L"), n = c(26, 44, 12))

Combine data.frames of different dimensions creating duplicates where needed /r dplyr

I am looking for a way to combine two tables of different dimensions by ID. But the final table should have some douplicated values depending on each table.
Here is a random example:
IDx = c("a", "b", "c", "d")
sex = c("M", "F", "M", "F")
IDy = c("a", "a", "b", "c", "d", "d")
status = c("single", "children", "single", "children", "single", "children")
salary = c(30, 80, 50, 40, 30, 80)
x = data.frame(IDx, sex)
y = data.frame(IDy, status, salary)
Here is x:
IDx sex
1 a M
2 b F
3 c M
4 d F
Here is y:
IDy status salary
1 a single 30
2 a children 80
3 b single 50
4 c children 40
5 d single 30
6 d children 80
I am looking for this:
IDy sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80
Basically, sex should be matched to fit the needs of table y. All values in both tables should be used, the actual table is a lot larger. Not all IDs will need to duplicate.
This should be fairly simple, but I cannot find a good answer anywhere online.
Note, I don't want NAs to be introduced.
I am new in R and since I have been focused in dplyr it would help if the example comes from there. It might be simple with base R, too.
UPDATE
The bolded sentences above might be confusing to the final answer. Sorry, it has been a confusing case which I realised should include one extra column tha complicates things, but more of that later.
First, I tried to see what is happening on my actuall table and to find which suggested answer fits my needs. I removed any problematic columns for the following result. So, I checked this:
dim(x)
> [1] 231 2
dim(y)
> [1] 199 8
# left_join joins matching rows from y to x
suchait <- left_join(x, y, by= c("IDx" = "IDy"))
# inner_join retains only rows in both sets
jdobres <- inner_join(y, anno2, by = c(IDx = "IDy"))
dim(suchait) # actuall table used
> [1] 225 9
dim(jdobres)
> [1] 219 9
But why/where do they look different?
This shows the 6 rows that are introduced in suchait's table but not on jdobres and it is because of the different approach.
setdiff(suchait, jdobres )
Using dplyr:
library(dplyr)
df <- left_join(x, y, by = c("IDx" = "IDy"))
Your result would be:
IDx sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80
Or you could do:
df <- left_join(y, x, by = c("IDy" = "IDx"))
It would give:
IDy status salary sex
1 a single 30 M
2 a children 80 M
3 b single 50 F
4 c children 40 M
5 d single 30 F
6 d children 80 F
You can also reorder your columns to get it exactly the way you wanted:
df <- df[, c("IDy", "sex", "status", "salary")]
result:
IDy sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80

conditional replace using match of column/row names from other data frame [duplicate]

This question already has answers here:
Not sure why dcast() this data set results in dropping variables
(1 answer)
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I have two data frames:
id <- c("a", "b", "c")
a <- 0
b <- 0
c <- 0
df1 <- data.frame(id, a, b, c)
id a b c
1 a 0 0 0
2 b 0 0 0
3 c 0 0 0
num <- c("a", "c", "c")
partner <- c("b", "b", "a")
value <- c("10", "20", "30")
df2 <- data.frame(num, partner, value)
num partner value
1 a b 10
2 c b 20
3 c a 30
I'd like to replace zeroes in df1 with df2$value in every instance df1$id==df2$num & colnames(df1)==df2$partner. So the output should look like:
a <- c(0, 0, 30)
b <- c(10, 0, 20)
c <- c(0, 0, 0)
df.nice <- data.frame(id, a, b, c)
id a b c
1 a 0 10 0
2 b 0 0 0
3 c 30 20 0
I can replace individual cells with the following:
df1$b[df1$id=="a"] <- ifelse(df2$num=="a" & df2$partner=="b", df2$value, 0)
but I need to cycle through all possible df1 row/column combinations for a large data frame. I suspect this involves plyr and match together, but can't quite figure out how.
Update
Thanks to #MikeH., I've turned to using reshape. This seems to work:
df.nice <- melt(df2, id=c("num", "partner"))
df.nice <- dcast(test.nice, num ~ partner, value.var="value")
to produce this:
num a b
1 a <NA> 10
2 c 30 20
I do need all possible row/column combinations, however, with all represented as zero. Is there a way to ask reshape to obtain rows and columns from another data frame (e.g., df1) or do should I bind those after reshaping?
If you want a replace (rather than a reshape) I think a simple base R solution would be to do:
idxs <- t(mapply(cbind, match(df2$num, df1$id), match(df2$partner, names(df1))))
df1[idxs] <- df2$value
df1
id a b c
1 a 0 10 0
2 b 0 0 0
3 c 30 20 0
Note that I build the row/column combination lookups to replace using the t(mapply(...)). When you select like df1[idxs] this converts to matrix (to select specific row/column combinations) and then converts back to data.frame.
I had to read in your data using stringsAsFactors = FALSE so the values would register properly (instead of numerics).
Data:
df2 <- data.frame(num, partner, value, stringsAsFactors = F)
df1 <- data.frame(id, a, b, c, stringsAsFactors = F)

Same function over multiple data frames in R - not over a list of data frames

This Issue is almost what I wanted to do, except by the fact of an output being giving as a list of data frames. Let's reproduce the example of mentioned SE issue above.
Let's say I have 2 data frames:
df1
ID col1 col2
x 0 10
y 10 20
z 20 30
df1
ID col1 col2
a 0 10
b 10 20
c 20 30
What I want is an 4th column with an ifelse result. My rationale is:
if col1>=20 in any data.frame I could have named with the pattern "df", then the new column res=1, else res=0.
But I want to create a new column in each data.frame with the same name pattern, not put all of those data.frames in a list and apply the function, except if I could "extract" each 3rd dimension of this list back to individual data frames.
Thanks
Per #Frank...if my understanding of what you are looking for is correct, consider using data.table. MWE:
library(data.table);
addcol <- function(x) x[,res:=ifelse(col1>=20,1,0)]
df1 <- data.table(ID=c("x","y","z"),col1=c(0,10,20),col2=c(10,20,30))
df2 <- data.table(ID=c("x","y","z"),col1=c(20,10,20),col2=c(10,20,30))
#modified df2 so you can see different effects
lapply(list(df1,df2),addcol)
> df1
ID col1 col2 res
1: x 0 10 0
2: y 10 20 0
3: z 20 30 1
> df2
ID col1 col2 res
1: x 20 10 1
2: y 10 20 0
3: z 20 30 1
This works because data.table operates by reference on tables, so inside the function you're actually updating the underlying table, not only the scoped reference to the table.

Resources