I have a "wide" dataset where for each observation I measure a value from a bunch of categorical variables. It is presented just like this:
V1
V2
V3
a
z
f
a
z
f
b
y
g
b
y
g
a
y
g
b
y
f
this means that V1 has two categories "a" and "b", V2 has two categories "z" and "y", and so on. But suppose that I have 30 variables (a quite bigger dataset).
I want to obtain a dataset in this form
V1
V2
V3
Freq
a
z
f
2
b
y
g
2
a
y
g
1
b
y
f
1
How can I get it in R? with smaller datasets I use transform(table(data.frame(data))) but it doesn't work with bigger datasets since it requires to build giant tables. Can somebody help please?
I would like to get a "general" code that does not depend on the variables name since I will be using it in a function. And moreover, since the datasets will be big I prefer to do it without the function table.
Thanks
In base R, with interaction:
as.data.frame(table(interaction(df, sep = "", drop = TRUE)))
Or, with table:
subset(data.frame(table(df)), Freq > 0)
# V1 V2 V3 Freq
#2 b y f 1
#3 a z f 2
#5 a y g 1
#6 b y g 2
With dplyr:
library(dplyr)
df %>%
count(V1, V2, V3, name = "Freq")
# V1 V2 V3 Freq
#1 a y g 1
#2 a z f 2
#3 b y f 1
#4 b y g 2
I assume your dataset dt contains only categorical variables and Freq represents the number of observations for each unique combination of the categorical variables.
As you want codes "without using dplyr," here is an alternative using data.table.
library(data.table)
dt[, Freq:=.N, by=c(colnames(dt))]
Related
I currently have two .csv files that look like this:
File 1:
Attempt
Result
Intervention 1
B
Intervention 2
H
and File 2:
Name
Outcome 1
Outcome 2
Outcome 3
Sample 1
A
B
C
Sample 2
D
E
F
Sample 3
G
H
I
I would like to merge and align the two .csvs such that the result each row of File 1 is aligned by its "result" cell, against any of the three "outcome" columns in File 2, leaving blanks or "NA"s if there are no similarities.
Ideally, would look like this:
Attempt
Result
Name
Outcome 1
Outcome 2
Outcome 3
Intervention 1
B
Sample 1
A
B
C
Sample 2
D
E
F
Intervention 2
H
Sample 3
G
H
I
I've looked and only found answers when merging two .csv files with one common column. Any help would be very appreciated.
I will assume that " Result " in File 1 is unique, since more File 1 rows with same result value (i.e "B") will force us to consider new columns in the final data frame.
By this way,
Attempt <- c("Intervention 1","Intervention 2")
Result <- c("B","H")
df1 <- as.data.frame(cbind(Attempt,Result))
one <- c("Sample 1","A","B","C")
two <- c("Sample 2","D","E","F")
three <- c("Sample 3","G","H","I")
df2 <- as.data.frame(rbind(one,two,three))
row.names(df2) <- 1:3
colnames(df2) <- c("Name","Outcome 1","Outcome 2","Outcome 3")
vec_at <- rep(NA,nrow(df2));vec_res <- rep(NA,nrow(df2)); # Define NA vectors
for (j in 1:nrow(df2)){
a <- which(is.element(df1$Result,df2[j,2:4])==TRUE) # Row names which satisfy same element in two dataframes?
if (length(a>=1)){ # Don't forget that "a" may not be a valid index if no element satify the condition
vec_at[j] <- df1$Attempt[a] #just create a vector with wanted information
vec_res[j] <- df1$Result[a]
}
}
desired_df <- as.data.frame(cbind(vec_at,vec_res,df2)) # define your wanted data frame
Output:
vec_at vec_res Name Outcome 1 Outcome 2 Outcome 3
1 Intervention 1 B Sample 1 A B C
2 <NA> <NA> Sample 2 D E F
3 Intervention 2 H Sample 3 G H I
I wonder if you could use fuzzyjoin for something like this.
Here, you can provide a custom function for matching between the two data.frames.
library(fuzzyjoin)
fuzzy_left_join(
df2,
df1,
match_fun = NULL,
multi_by = list(x = paste0("Outcome_", 1:3), y = "Result"),
multi_match_fun = function(x, y) {
y == x[, "Outcome_1"] | y == x[, "Outcome_2"] | y == x[, "Outcome_3"]
}
)
Output
Name Outcome_1 Outcome_2 Outcome_3 Attempt Result
1 Sample_1 A B C Intervention_1 B
2 Sample_2 D E F <NA> <NA>
3 Sample_3 G H I Intervention_2 H
I am trying to "tidy" a large dataset, where multiple different types of data is merged in columns, and some data in column names. This is a common scenario in biological dataset.
My data table has replicate measurements which I want to collapse into a mean. Converting the data into tidy format, these replicate values become additional rows. If I try to aggregate/group by several columns and calculate the mean of the replicates:
collapsed.data <- tidy.dt[, mean(expression, na.rm = T), by=list(Sequence.window,Gene.names,ratio,enrichment.type,condition)]
I get a resultant table that has only the columns used in by statement and followed with the mean(expression) as column V1. Is it possible to get all the other (unchanged) columns as well?
A minimalist example showing what I am trying to achieve is as follows:
library(data.table)
dt <- data.table(a = c("a", "a", "b", "b", "c", "a", "c", "a"), b = rnorm(8),
c = c(1,1,1,1,1,2,1,2), d = rep('x', 8), e = rep('test', 8))
dt[, mean(b), by = list(a, c)]
# a c V1
#1: a 1 -0.7597186
#2: b 1 -0.3001626
#3: c 1 -0.6893773
#4: a 2 -0.1589146
As you can see the columns d and e are dropped.
One possibility is to include d and e in the grouping:
res <- dt[, mean(b), by = list(a, c, d, e)]
res
# a c d e V1
#1: a 1 x test 0.9271986
#2: b 1 x test -0.3161799
#3: c 1 x test 1.3709635
#4: a 2 x test 0.1543337
If you want to keep all columns except the one you want to aggregrate, you can do this in a more programmatic way:
cols_to_group_by <- setdiff(colnames(dt), "b")
res <- dt[, mean(b), by = cols_to_group_by]
The result is the same as above.
By this, you have reduced the number of rows. If you want to keep all rows, you can add an additional column:
dt[, mean_b := mean(b), by = list(a, c)]
dt
# a b c d e mean_b
#1: a 1.1127632 1 x test 0.9271986
#2: a 0.7416341 1 x test 0.9271986
#3: b 0.9040880 1 x test -0.3161799
#4: b -1.5364479 1 x test -0.3161799
#5: c 1.9846982 1 x test 1.3709635
#6: a 0.2615139 2 x test 0.1543337
#7: c 0.7572287 1 x test 1.3709635
#8: a 0.0471535 2 x test 0.1543337
Here, dtis modified by reference, i.e., without copying all of dt, which might save time on large data.
I have 4 dataframes in a list L like below:
L[[1]]:
V1 V2
B C
A B
Z B
L[[2]]:
V1 V2
B D
A B
Z B
L[[3]]:
V1 V2
Z Y
X Z
N Z
L[[4]]:
V1 V2
Z J
X Z
N Z
This come from graph with the head C,D,Y, and J.
Obviously, C and D is from the same graph, so is Y and J.
How can I merge C with D and Y with J given these dataframes is in a list L?
What I'm thinking is, to iterate the list and pairwise comparison. If dfx intersect with dfy merge. Anyone can help with the R code?
Edit:
What I'm thinking is like this:
Get first element, compare to second, if okay, merged and save to the first element, remove the second element, move to next element until last. Repeat until remaining element not removed. With this, the list will consist of remaining element which has been merged Anyone know how to implement this in the code?
Output expected :
L[[1]]:
V1 V2
B C
B D
A B
Z B
L[[2]]:
V1 V2
Z Y
Z J
X Z
N Z
Could this be an approach to a solution for you?
# create list of data.frames
ld <- list(
data.frame(V1 = c("B","A","Z"), V2 = c("C","B","B")),
data.frame(V1 = c("B","A","Z"), V2 = c("D","B","B")),
data.frame(V1 = c("Z","X","N"), V2 = c("Y","Z","Z")),
data.frame(V1 = c("Z","X","N"), V2 = c("J","Z","Z"))
)
# suggested solution
union_ld <- data.table::rbindlist(ld)
unique(union_ld)
Results:
V1 V2
1: B C
2: A B
3: Z B
4: B D
5: Z Y
6: X Z
7: N Z
8: Z J
Update 1
Quick hack: two data frames in a list as requested by the OP. According to comment of OP, the order of the rows within each result data frame doesn't matter.
list(
unique(data.table::rbindlist(ld[1:2])),
unique(data.table::rbindlist(ld[3:4]))
)
results in:
[[1]]
V1 V2
1: B C
2: A B
3: Z B
4: B D
[[2]]
V1 V2
1: Z Y
2: X Z
3: N Z
4: Z J
The proposed solution combines the first two data frames in the list into one data frame, removes the duplicate rows. This is repeated for the last two data frames in the list. Then, the resulting data frames are combined to a list again.
Update 2
This solution uses rbindlist from package data.table. If you don't like this, the result can be returned as "pure" data frames like this
library(data.table)
list(
setDF(unique(rbindlist(ld[1:2]))),
setDF(unique(rbindlist(ld[3:4])))
)
Update 3
According to OP's comment there are more data frames which need to be combined in several groups.
# set up a list of vectors of numbers of data.frames to combine
dfs_to_combine <- list(c(1:2), c(3:4))
dfs_to_combine
[[1]]
[1] 1 2
[[2]]
[1] 3 4
# now, combine data.frames as specified
library(data.table)
lapply(dfs_to_combine, function(x) setDF(unique(rbindlist(ld[x]))))
[[1]]
V1 V2
1 B C
2 A B
3 Z B
4 B D
[[2]]
V1 V2
1 Z Y
2 X Z
3 N Z
4 Z J
This is just to reproduce your initial example. If you want to combine differently change the numbers, e.g.,
dfs_to_combine <- list(c(1), c(2, 4), c(3))
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I'm trying to do resampling of the elements of a data frame. I'm open to use other data structures if recommended, but my understanding is that a DF would be better for combining strings, numbers, etc.
Let's say my input is this data frame:
16 x y z 2
11 a b c 1
.........
And I'd like to build as output another data structure (I take, another df) like this:
16 x y z
16 x y z
11 a b c
.........
I guess my main issue is the way to append the content, which is on columns df[,1:4].
Thanks in advance, p.
It's unclear from your description, but your desired output implies that you want to duplicate columns 1:4 according to column 5, this should do the job
df[rep(seq_len(nrow(df)), df[, 5]), -5]
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
Assuming you're starting with something like:
mydf
# V1 V2 V3 V4 V5
# 1 16 x y z 2
# 2 11 a b c 1
Then, you can just use expandRows from my "splitstackshape" package, like this:
library(splitstackshape)
expandRows(mydf, count = "V5")
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
By default, the function assumes that you are expanding your dataset based on an existing column, but you can just as easily add a numeric vector as the count argument, and set count.is.col = FALSE.
If you want to sample with replacement n rows from df data frame:
df[sample(nrow(df), n, replace=TRUE), ]
I would like to create a new data frame that borrows an ID variable from another data frame. The data frame I would like to merge has repeated observations in the ID column which is causing me some problems.
DF1<-data.frame(ID1=rep(c("A","B", "C", "D", "E") , 2), X1=rnorm(10))
DF2<-data.frame(ID1=c("A", "B", "C", "D", "E"), ID2=c("V","W","X","Y" ,"Z"), X2=rnorm(5), X3=rnorm(5))
What I would like to append DF2$ID2 onto DF by the ID1 column. My goal is something that looks like this (I do not want DF2$X2 and DF$X3 in the 'Goal' data frame):
Goal<-data.frame(ID2=DF2$ID2, DF1)
I have tried merge but it complains because DF1$ID1 is not unique. I know R can goggle this up in 1 line of code but I can't seem to make the functions I know work. Any help would be greatly appreciated!
There should be no problem with a simple merge. Using your sample data
merge(DF1, DF2[,c("ID1","ID2")], by="ID1")
produces
ID1 X1 ID2
1 A 0.03594331 V
2 A 0.42814900 V
3 B -2.17161263 W
4 B -0.33403550 W
5 C 0.95407844 X
6 C -0.23186723 X
7 D 0.46395514 Y
8 D -1.49919961 Y
9 E -0.20342430 Z
10 E -0.49847569 Z
You could also use left_join from library(dplyr)
library(dplyr)
left_join(DF1, DF2[,c("ID1", "ID2")])
# ID1 X1 ID2
#1 A -1.20927237 V
#2 B -0.03003128 W
#3 C -0.75799708 X
#4 D 0.53946986 Y
#5 E -0.52009921 Z
#6 A 1.15822659 V
#7 B -0.91976194 W
#8 C 0.74620142 X
#9 D -2.46452560 Y
#10 E 0.80015219 Z