Union of data frame in R - r

I have 4 dataframes in a list L like below:
L[[1]]:
V1 V2
B C
A B
Z B
L[[2]]:
V1 V2
B D
A B
Z B
L[[3]]:
V1 V2
Z Y
X Z
N Z
L[[4]]:
V1 V2
Z J
X Z
N Z
This come from graph with the head C,D,Y, and J.
Obviously, C and D is from the same graph, so is Y and J.
How can I merge C with D and Y with J given these dataframes is in a list L?
What I'm thinking is, to iterate the list and pairwise comparison. If dfx intersect with dfy merge. Anyone can help with the R code?
Edit:
What I'm thinking is like this:
Get first element, compare to second, if okay, merged and save to the first element, remove the second element, move to next element until last. Repeat until remaining element not removed. With this, the list will consist of remaining element which has been merged Anyone know how to implement this in the code?
Output expected :
L[[1]]:
V1 V2
B C
B D
A B
Z B
L[[2]]:
V1 V2
Z Y
Z J
X Z
N Z

Could this be an approach to a solution for you?
# create list of data.frames
ld <- list(
data.frame(V1 = c("B","A","Z"), V2 = c("C","B","B")),
data.frame(V1 = c("B","A","Z"), V2 = c("D","B","B")),
data.frame(V1 = c("Z","X","N"), V2 = c("Y","Z","Z")),
data.frame(V1 = c("Z","X","N"), V2 = c("J","Z","Z"))
)
# suggested solution
union_ld <- data.table::rbindlist(ld)
unique(union_ld)
Results:
V1 V2
1: B C
2: A B
3: Z B
4: B D
5: Z Y
6: X Z
7: N Z
8: Z J
Update 1
Quick hack: two data frames in a list as requested by the OP. According to comment of OP, the order of the rows within each result data frame doesn't matter.
list(
unique(data.table::rbindlist(ld[1:2])),
unique(data.table::rbindlist(ld[3:4]))
)
results in:
[[1]]
V1 V2
1: B C
2: A B
3: Z B
4: B D
[[2]]
V1 V2
1: Z Y
2: X Z
3: N Z
4: Z J
The proposed solution combines the first two data frames in the list into one data frame, removes the duplicate rows. This is repeated for the last two data frames in the list. Then, the resulting data frames are combined to a list again.
Update 2
This solution uses rbindlist from package data.table. If you don't like this, the result can be returned as "pure" data frames like this
library(data.table)
list(
setDF(unique(rbindlist(ld[1:2]))),
setDF(unique(rbindlist(ld[3:4])))
)
Update 3
According to OP's comment there are more data frames which need to be combined in several groups.
# set up a list of vectors of numbers of data.frames to combine
dfs_to_combine <- list(c(1:2), c(3:4))
dfs_to_combine
[[1]]
[1] 1 2
[[2]]
[1] 3 4
# now, combine data.frames as specified
library(data.table)
lapply(dfs_to_combine, function(x) setDF(unique(rbindlist(ld[x]))))
[[1]]
V1 V2
1 B C
2 A B
3 Z B
4 B D
[[2]]
V1 V2
1 Z Y
2 X Z
3 N Z
4 Z J
This is just to reproduce your initial example. If you want to combine differently change the numbers, e.g.,
dfs_to_combine <- list(c(1), c(2, 4), c(3))

Related

adding the frequency column in R without using dyplr

I have a "wide" dataset where for each observation I measure a value from a bunch of categorical variables. It is presented just like this:
V1
V2
V3
a
z
f
a
z
f
b
y
g
b
y
g
a
y
g
b
y
f
this means that V1 has two categories "a" and "b", V2 has two categories "z" and "y", and so on. But suppose that I have 30 variables (a quite bigger dataset).
I want to obtain a dataset in this form
V1
V2
V3
Freq
a
z
f
2
b
y
g
2
a
y
g
1
b
y
f
1
How can I get it in R? with smaller datasets I use transform(table(data.frame(data))) but it doesn't work with bigger datasets since it requires to build giant tables. Can somebody help please?
I would like to get a "general" code that does not depend on the variables name since I will be using it in a function. And moreover, since the datasets will be big I prefer to do it without the function table.
Thanks
In base R, with interaction:
as.data.frame(table(interaction(df, sep = "", drop = TRUE)))
Or, with table:
subset(data.frame(table(df)), Freq > 0)
# V1 V2 V3 Freq
#2 b y f 1
#3 a z f 2
#5 a y g 1
#6 b y g 2
With dplyr:
library(dplyr)
df %>%
count(V1, V2, V3, name = "Freq")
# V1 V2 V3 Freq
#1 a y g 1
#2 a z f 2
#3 b y f 1
#4 b y g 2
I assume your dataset dt contains only categorical variables and Freq represents the number of observations for each unique combination of the categorical variables.
As you want codes "without using dplyr," here is an alternative using data.table.
library(data.table)
dt[, Freq:=.N, by=c(colnames(dt))]

R - Merging and aligning two CSVs using common values in multiple columns

I currently have two .csv files that look like this:
File 1:
Attempt
Result
Intervention 1
B
Intervention 2
H
and File 2:
Name
Outcome 1
Outcome 2
Outcome 3
Sample 1
A
B
C
Sample 2
D
E
F
Sample 3
G
H
I
I would like to merge and align the two .csvs such that the result each row of File 1 is aligned by its "result" cell, against any of the three "outcome" columns in File 2, leaving blanks or "NA"s if there are no similarities.
Ideally, would look like this:
Attempt
Result
Name
Outcome 1
Outcome 2
Outcome 3
Intervention 1
B
Sample 1
A
B
C
Sample 2
D
E
F
Intervention 2
H
Sample 3
G
H
I
I've looked and only found answers when merging two .csv files with one common column. Any help would be very appreciated.
I will assume that " Result " in File 1 is unique, since more File 1 rows with same result value (i.e "B") will force us to consider new columns in the final data frame.
By this way,
Attempt <- c("Intervention 1","Intervention 2")
Result <- c("B","H")
df1 <- as.data.frame(cbind(Attempt,Result))
one <- c("Sample 1","A","B","C")
two <- c("Sample 2","D","E","F")
three <- c("Sample 3","G","H","I")
df2 <- as.data.frame(rbind(one,two,three))
row.names(df2) <- 1:3
colnames(df2) <- c("Name","Outcome 1","Outcome 2","Outcome 3")
vec_at <- rep(NA,nrow(df2));vec_res <- rep(NA,nrow(df2)); # Define NA vectors
for (j in 1:nrow(df2)){
a <- which(is.element(df1$Result,df2[j,2:4])==TRUE) # Row names which satisfy same element in two dataframes?
if (length(a>=1)){ # Don't forget that "a" may not be a valid index if no element satify the condition
vec_at[j] <- df1$Attempt[a] #just create a vector with wanted information
vec_res[j] <- df1$Result[a]
}
}
desired_df <- as.data.frame(cbind(vec_at,vec_res,df2)) # define your wanted data frame
Output:
vec_at vec_res Name Outcome 1 Outcome 2 Outcome 3
1 Intervention 1 B Sample 1 A B C
2 <NA> <NA> Sample 2 D E F
3 Intervention 2 H Sample 3 G H I
I wonder if you could use fuzzyjoin for something like this.
Here, you can provide a custom function for matching between the two data.frames.
library(fuzzyjoin)
fuzzy_left_join(
df2,
df1,
match_fun = NULL,
multi_by = list(x = paste0("Outcome_", 1:3), y = "Result"),
multi_match_fun = function(x, y) {
y == x[, "Outcome_1"] | y == x[, "Outcome_2"] | y == x[, "Outcome_3"]
}
)
Output
Name Outcome_1 Outcome_2 Outcome_3 Attempt Result
1 Sample_1 A B C Intervention_1 B
2 Sample_2 D E F <NA> <NA>
3 Sample_3 G H I Intervention_2 H

Group a data.table using a column which is list

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

trying to subset a data table in R by removing items that are in a 2nd table

I have two data frames (from a csv file) in R as such:
df1 <- data.frame(V1 = 1:9, V2 = LETTERS[1:9])
df2 <- data.frame(V1 = 1:3, V2 = LETTERS[1:3])
I convert both to data.table as follows:
dt1 <- data.table(df1, key="V1")
dt2 <- data.table(df2, key="V1")
I want to now return a table that looks like dt1 but without any rows where the key is found in dt2. So in this instance I would like to get back:
4 D
5 E
...
9 I
I'm using the following code in R:
dt3 <- dt1[!dt2$V1]
this works on this example, however when I try it for a large data set (100k)
it does not work. It only removes 2 rows, and I know it should be a lot more than that. is there a limit to this type of operation or something else I havent considered?
Drop the column name "V1" to do a not-join. The tables are already keyed by V1.
dt3 <- dt1[!dt2]
Because the tables are keyed, you can do this with a "not-join"
dt1 <- data.table(rep(1:3,2), LETTERS[1:6], key="V1")
# V1 V2
# 1: 1 A
# 2: 1 D
# 3: 2 B
# 4: 2 E
# 5: 3 C
# 6: 3 F
dt2 <- data.table(1:2, letters[1:2], key="V1")
# V1 V2
# 1: 1 a
# 2: 2 b
dt1[!.(dt2$V1)]
# V1 V2
# 1: 3 C
# 2: 3 F
According to the documentation, . or J should not be necessary, since the ! alone is enough:
All types of i may be prefixed with !. This signals a not-join or not-select should be performed.
However, the OP's code does not work:
dt1[!(dt2$V1)]
# V1 V2
# 1: 2 B
# 2: 2 E
# 3: 3 C
# 4: 3 F
In this case, dt2$V1 is read as a vector of row numbers, not as part of a join. Looks like this is what is meant by a "not-select", but I think it could be more explicit. When I read the sentence above, for all I know "not-select" and "not-join" are two terms for the same thing.
You could try:
dt1[!(dt1$V1 %in% dt2$V1)]
This assumes that you don't care about ordering.

Sampling elements in data frame [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I'm trying to do resampling of the elements of a data frame. I'm open to use other data structures if recommended, but my understanding is that a DF would be better for combining strings, numbers, etc.
Let's say my input is this data frame:
16 x y z 2
11 a b c 1
.........
And I'd like to build as output another data structure (I take, another df) like this:
16 x y z
16 x y z
11 a b c
.........
I guess my main issue is the way to append the content, which is on columns df[,1:4].
Thanks in advance, p.
It's unclear from your description, but your desired output implies that you want to duplicate columns 1:4 according to column 5, this should do the job
df[rep(seq_len(nrow(df)), df[, 5]), -5]
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
Assuming you're starting with something like:
mydf
# V1 V2 V3 V4 V5
# 1 16 x y z 2
# 2 11 a b c 1
Then, you can just use expandRows from my "splitstackshape" package, like this:
library(splitstackshape)
expandRows(mydf, count = "V5")
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
By default, the function assumes that you are expanding your dataset based on an existing column, but you can just as easily add a numeric vector as the count argument, and set count.is.col = FALSE.
If you want to sample with replacement n rows from df data frame:
df[sample(nrow(df), n, replace=TRUE), ]

Resources