Using merge command in r for merging depending upon column values - r

So, I have several dataframes like this
1 2 a
2 3 b
3 4 c
4 5 d
3 5 e
......
1 2 j
2 3 i
3 4 t
3 5 r
.......
2 3 t
2 4 g
6 7 i
8 9 t
......
What I want is, I want to merge all of these files into one single file showing the values of third column for each pair of values in columns 1 and columns 2 and 0 if that pair is not present.
So, the output for this will be, since, there are three files (there are more)
1 2 aj0
2 3 bit
3 4 ct0
4 5 d00
3 5 er0
6 7 00i
8 9 00t
......
What I did was combine all my text .txt files in a single list.
Then,
L <- lapply(seq_along(L), function(i) {
L[[i]][, paste0('DF', i)] <- 1
L[[i]]
})
Which will indicate the presence of a value when we will be merging them.
I don't know how to proceed further. Any inputs will be great. Thanks!

Here is one way to do it with Reduce
# function to generate dummy data
gen_data<- function(){
data.frame(
x = 1:3,
y = 2:4,
z = sample(LETTERS, 3, replace = TRUE)
)
}
# generate list of data frames to merge
L <- lapply(1:3, function(x) gen_data())
# function to merge by x and y and concatenate z
f <- function(x, y){
d <- merge(x, y, by = c('x', 'y'), all = TRUE)
# set merged column to zero if no match is found
d[['z.x']] = ifelse(is.na(d[['z.x']]), 0, d[['z.x']])
d[['z.y']] = ifelse(is.na(d[['z.y']]), 0, d[['z.y']])
d$z <- paste0(d[['z.x']], d[['z.y']])
d['z.x'] <- d['z.y'] <- NULL
return(d)
}
# merge data frames
Reduce(f, L)

Related

bind columns with different number of rows

I want to create iteration that takes a list (which is column of another dataframe) and add it to the current data frame as column. but the length of the columns are not equal. So, I want to generate NA as unmatched rows.
seq_actions=as.data.frame(x = NA)
for(i in 1:20){
temp_seq=another_df$c1[some conditions]
seq_actions=cbind(temp_seq,seq_actions)
}
to simplify, lets say i have
df
1 3
3 4
2 2
adding the list of 5,6 as new column to df, so I want:
df
1 3 5
3 4 6
2 2 NA
another adding list is 7 7 7 8, so my df will be:
df
1 3 5 7
3 4 6 7
2 2 NA 7
NA NA NA 8
How can I do it?
Here's one way. The merge function by design will add NA values whenever you combine data frames and no match is found (e.g., if you have fewer values in 1 data frame than the other data frame).
If you assume that you're matching your data frames (what rows go together) based on the row number, just output the row number as a column in your data frames. Then merge on that column. Merge will automatically add the NA values you want and deal with the fact that the data frames have different numbers of rows.
#test data frame 1
a <- c(1, 3, 2)
b <- c(3, 4, 2)
dat <- as.data.frame(cbind(a, b))
#test data frame 2 (this one has fewer rows than the first data frame)
c <- c(5, 6)
dat.new <- as.data.frame(c)
#add column to each data frame with row number
dat$number <- row.names(dat)
dat.new$number <- row.names(dat.new)
#merge data frames
#"all = TRUE" will mean that NA values will be added whenever there is no match
finaldata <- merge(dat, dat.new, by = "number", all = TRUE)
If you know the maximum possible size of df, and the total number of columns you want to append, you can create df in advance with all NA values and fill a column in based on its length. This would leave everything after its length still NA.
e.g.
max_col_num <- 20
max_col_size <- 10 #This could be the number of rows in the largest dataframe you have
df <- as.data.frame(matrix(ncol = max_col_num, nrow = max_col_size))
for(i in 1:20){
temp_seq=another_df$c1[some conditions]
df[c(1:length(temp_seq), i] <- temp_seq
}
This would only work if you new the total possible number of rows and columns.
I think the best could be to write a custom function which is based on nrow of data frame and length of vector/list.
Once such function can be written as:
#Function to add vector as column
addToDF <- function(df, v){
nRow <- nrow(df)
lngth <- length(v)
if(nRow > lngth){
length(v) <- nRow
}else if(nRow < lngth){
df[(nRow+1):lngth, ] <- NA
}
cbind(df,v)
}
Let's test above function with data.frame provided by OP.
df <- data.frame(A= c(1,3,2), B = c(3, 4, 2))
v <- c(5,6)
w <-c(7,7,8,9)
addToDF(df, v)
# A B v
# 1 1 3 5
# 2 3 4 6
# 3 2 2 NA
addToDF(df, w)
# A B v
# 1 1 3 7
# 2 3 4 7
# 3 2 2 8
# 4 NA NA 9
Following MKRs response, if you want to to add a specific name to the new added column, you can try:
addToDF <- function(df, v, col_name){
nRow <- nrow(df)
lngth <- length(v)
if(nRow > lngth){
length(v) <- nRow
}else if(nRow < lngth){
df[(nRow+1):lngth, ] <- NA
}
df_new<-cbind(df,v)
colnames(df_new)[ncol(df_new)]=col_name
return(df_new)
}
where col_name is the new of the added column.

Sum the values of a 2 dimensional table according to labels in R

Coming from Sum the values according to labels in R.
I've been notified that working with 2 dimensional tables is rather significantly different with 1 dimensional ones, like:
a a,b a,b,c c
d 5 2 1 2
d,e 2 1 1 1
And we want to achieve:
a b c
d 12 5 5
e 4 2 2
So how can this be achieved using R?
A little bit convoluted, but it should work :
m <- as.matrix(data.frame('a'=c(5,2),'a,b'=c(2,1),
'a,b,c'=c(1:1),'c'=c(2,1),
check.names = FALSE,row.names=c('d','d,e')))
colNamesSplits <- strsplit(colnames(m),',')
rowNamesSplits <- strsplit(rownames(m),',')
colNms <- unique(unlist(colNamesSplits))
rowNms <- unique(unlist(rowNamesSplits))
colIdxs <- unlist(sapply(1:length(colNamesSplits),
function(i) rep.int(i,length(colNamesSplits[[i]]))))
rowIdxs <- unlist(sapply(1:length(rowNamesSplits),
function(i) rep.int(i,length(rowNamesSplits[[i]]))))
colIdxsMapped <- unlist(sapply(colNamesSplits, function(n) match(n,colNms)))
rowIdxsMapped <- unlist(sapply(rowNamesSplits, function(n) match(n,rowNms)))
# let's create the fully expanded matrix
expanded <- as.matrix(m[rowIdxs,colIdxs])
rownames(expanded) <- rowNms[rowIdxsMapped]
colnames(expanded) <- colNms[colIdxsMapped]
# aggregate expanded by cols :
expanded <- do.call(cbind,lapply(split(1:ncol(expanded),colnames(expanded)),
function(ii) rowSums(expanded[,ii,drop=FALSE])))
# aggregate expanded by rows :
expanded <- do.call(rbind,lapply(split(1:nrow(expanded),rownames(expanded)),
function(ii) colSums(expanded[ii,,drop=FALSE])))
> expanded
a b c
d 12 5 5
e 4 2 2

put duplicated rows in different data.frame(s)

Let
x=c(1,2,2,3,4,1)
y=c("A","B","C","D","E","F")
df=data.frame(x,y)
df
x y
1 1 A
2 2 B
3 2 C
4 3 D
5 4 E
6 1 F
How can I put duplicate rows in this data frame in different data frames
like this :
df1
x y
1 A
1 F
df2
x y
2 B
2 C
Thank you for help
You could use split
split(df, f = df$x)
f = df$x is used to specify the grouping column
check ?split for more details
to remove the non duplicated rows you could use
mylist = split(df, f = df$x)[df$x[duplicated(df$x)]]
names(mylist) = c('df1', 'df2')
list2env(mylist,envir=.GlobalEnv) # to separate the data frames

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

Assign results of apply to multiple columns of data frame

I would like to process all rows in data frame df by applying function f to every row. As function f returns numeric vector with two elements I would like to assign individual elements to new columns in df.
Sample df, trivial function f returning two elements and my trial with using apply
df <- data.frame(a = 1:3, b = 3:5)
f <- function (a, b) {
c(a + b, a * b)
}
df[, c('apb', 'amb')] <- apply(df, 1, function(x) f(a = x[1], b = x[2]))
This does not work results are assigned by columns:
> df
a b apb amb
1 1 3 4 8
2 2 4 3 8
3 3 5 6 15
You could also use Reduce instead of apply as it is generally more efficient. You just need to slightly modify your function to use cbind instead of c
f <- function (a, b) {
cbind(a + b, a * b) # midified to use `cbind` instead of `c`
}
df[c('apb', 'amb')] <- Reduce(f, df)
df
# a b apb amb
# 1 1 3 4 3
# 2 2 4 6 8
# 3 3 5 8 15
Note: This will only work nicely if you have only two columns (as in your example), thus if you have more columns in you data set, run this only on a subset
You need to transpose apply results to get what you want :
df[, c('apb', 'amb')] <- t(apply(df, 1, function(x) f(a = x[1], b = x[2])))
> df
a b apb amb
1 1 3 4 3
2 2 4 6 8
3 3 5 8 15

Resources