Merge two rows into one column in R - r

How can I merge two rows to one row like
...
type1,-2.427,-32.962,-61.097
type2,0.004276057,0.0015271631,-0.005192355
type1,-2.427,-32.962,-60.783
type2,0.0018325958,0.0033597588,-0.0021380284
...
to
type1,-2.427,-32.962,-61.097,type2,0.004276057,0.0015271631,-0.005192355
type1,-2.427,-32.962,-60.783,type2,0.0018325958,0.0033597588,-0.0021380284
or
-2.427,-32.962,-61.097,0.004276057,0.0015271631,-0.005192355
-2.427,-32.962,-60.783,0.0018325958,0.0033597588,-0.0021380284
in GNU R?

Do they always alternate 1/2? If so, you can just use basic indexing. Using this sample data.frame
dd<-structure(list(V1 = structure(c(1L, 2L, 1L, 2L), .Label = c("type1",
"type2"), class = "factor"), V2 = c(-2.427, 0.004276057, -2.427,
0.0018325958), V3 = c(-32.962, 0.0015271631, -32.962, 0.0033597588
), V4 = c(-61.097, -0.005192355, -60.783, -0.0021380284)), .Names = c("V1",
"V2", "V3", "V4"), row.names = c(NA, 4L), class = "data.frame")
you can do
cbind(dd[seq(1,nrow(dd), by=2),], dd[seq(2,nrow(dd), by=2),])
# V1 V2 V3 V4 V1 V2 V3 V4
# 1 type1 -2.427 -32.962 -61.097 type2 0.004276057 0.001527163 -0.005192355
# 3 type1 -2.427 -32.962 -60.783 type2 0.001832596 0.003359759 -0.002138028
to include the "type" column or you can do
cbind(dd[seq(1,nrow(dd), by=2),-1], dd[seq(2,nrow(dd), by=2),-1])
# V2 V3 V4 V2 V3 V4
# 1 -2.427 -32.962 -61.097 0.004276057 0.001527163 -0.005192355
# 3 -2.427 -32.962 -60.783 0.001832596 0.003359759 -0.002138028
to leave it off

Here's an alternative using #MrFlick's sample data:
## Use `ave` to create an indicator variable
dd$ind <- with(dd, ave(as.numeric(V1), V1, FUN = seq_along))
## use `reshape` to "merge" your rows by indicator
reshape(dd, direction = "wide", idvar = "ind", timevar = "V1")
# ind V2.type1 V3.type1 V4.type1 V2.type2 V3.type2 V4.type2
# 1 1 -2.427 -32.962 -61.097 0.004276057 0.001527163 -0.005192355
# 3 2 -2.427 -32.962 -60.783 0.001832596 0.003359759 -0.002138028

You could use split to split the data by type, and then use cbind to bring them together. The following method removes the first column ( [-1] ) from the result, and also uses MrFlick's data, dd,
> do.call(cbind, split(dd[-1], dd[[1]]))
# type1.V2 type1.V3 type1.V4 type2.V2 type2.V3 type2.V4
# 1 -2.427 -32.962 -61.097 0.004276057 0.001527163 -0.005192355
# 3 -2.427 -32.962 -60.783 0.001832596 0.003359759 -0.002138028
On my machine, I have this as the fastest among the current three answers.

Related

Matching values from two column pairs in different data frames in R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I have two data frames which are edge lists containing columns "source" and "target" as their first two columns and the second data frame includes a third column with edge attributes. The two data frames are not of the same length and I want (1) to retrieve the edges from one data frame that are not in the other and (2) to get the values from the second data frame for matching edges.
Example:
> A <- data.frame(source=c("v1", "v1", "v2", "v2"), target=c("v2", "v4", "v3", "v4"))
> B <- data.frame(source=c("V1", "V2", "v1", "V4", "V4", "V5"), target=c("V2", "V5", "V3", "V3", "V2", "V4"), variable=c(3,4,0,2,1,0))
> A
source target
1 v1 v2
2 v1 v4
3 v2 v3
4 v2 v4
> B
source target variable
1 V1 V2 3
2 V2 V5 4
3 v1 V3 0
4 V4 V3 2
5 V4 V2 1
6 V5 V4 0
desirable outcome (1):
source target
1 V2 V5
2 V1 V3
3 V4 V3
4 V5 V4
desirable outcome (2):
source target variable
1 V1 V2 3
2 V2 V4 1
How can this be achieved with R?
The first you will get with an anti_join, though you will need to anti-join on both combinations of source and target since direction appears not to matter in your example. Note I've had to use toupper because the capitalization in your example was erratic and the example suggested case should be ignored.
library(dplyr)
anti_join(anti_join(B, A %>% mutate_all(toupper),
by = c("source", "target")),
A %>% mutate_all(toupper),
by = c(target = "source", source = "target")) %>%
select(-variable)
#> source target
#> 1 V2 V5
#> 2 v1 V3
#> 3 V4 V3
#> 4 V5 V4
The second result you can get from binding two inner_joins:
bind_rows(inner_join(B, A %>% mutate_all(toupper),
by = c("source", "target")),
inner_join(B, A %>% mutate_all(toupper),
by = c(source = "target", target = "source")))
#> source target variable
#> 1 V1 V2 3
#> 2 V4 V2 1
Using data.table:
# Load data.table and convert to data.frames to data.tables
library(data.table)
setDT(A)
setDT(B)
# If direction doesn't matter sort "source/target"
# Also need to standardise the data format, toupper()
cols <- c("source", "target")
foo <- function(x) paste(toupper(sort(unlist(x))), collapse="-")
A[, oedge := foo(.SD), .SDcols = cols, by = seq_len(nrow(A))]
B[, oedge := foo(.SD), .SDcols = cols, by = seq_len(nrow(B))]
# Do anti-join and inner join
B[!A, .SD, on="oedge", .SDcols=cols]
# source target
# 1: V2 V5
# 2: v1 V3
# 3: V4 V3
# 4: V5 V4
B[A, .SD, on="oedge", .SDcols=c(cols, "variable"), nomatch = NULL]
# source target variable
# 1: V1 V2 3
# 2: V4 V2 1

How to use a dataset to extract specific columns from another dataset?

How to use a dataset to extract specific columns from another dataset?
Use intersect to find common names between two data sets.
snp.common <- intersect(data1$snp, colnames(data2$snp))
data2.separated <- data2[,snp.common]
It's always better to supply a minimal reproducible example:
df1 <- data.frame(V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
Now we can use a character vector to index the columns we want:
df1[, df2$snp]
Returns:
V2 V3
1 4 7
2 5 8
3 6 9
Edit:
Would you know how to do this so that it retains the "i..POP" column in data2?
df1 <- data.frame(ID = letters[1:3],
V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
names(df1)[1] <- "ï..POP"
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
We can use c to combine the names of the columns:
df1[, c("ï..POP", df2$snp)]
ï..POP V2 V3
1 a 4 7
2 b 5 8
3 c 6 9

Efficient way to cbind list by groups in data.table

I have a data.frame
data
data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD",
"ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring",
"class"), row.names = c(NA, -2L), class = "data.frame")
which looks like
#> dtt1
# mystring class
#1 AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD cat
#2 ASDSDFJSKADDKJSJKDFKSADDLKJFLAK dog
I am searching the start and end positions of a pattern "ADD" with in the first 20 characters in the strings under mystring considering class as the group.
I am doing this using str_locate of stringr package. Here is my attempt
setDT(dtt1)[,
cbind(list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,1]),
list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,2])),
by = class]
This gives my desired output
# class V1 V2
#1: cat 8 10
#2: cat 16 18
#3: dog 10 12
Question:
I would like to know if this is a standard approach or this can be done in a more efficient manner. str_locate gives the start and end positions of the matched pattern in separate columns, and I am putting them in separate list to cbind them together with the data.table? Also how can I specify the colnames for the cbinded columns here?
I think you first should reduce your operations per group, so I would first create a substring for all groups at once.
setDT(data)[, submystring := .Internal(substr(mystring, 1L, 20L))]
Then, using the stringi package (I don't like wrappers), you could do (though can't currently vouch for efficiency)
library(stringi)
data[, data.table(matrix(unlist(stri_locate_all_fixed(submystring, "ADD")), ncol = 2)), by = class]
# class V1 V2
# 1: cat 8 10
# 2: cat 16 18
# 3: dog 10 12
Alternatively, you could avoid matrix and data.table calls per group but spread the data after all the location were detected
res <- data[, unlist(stri_locate_all_fixed(submystring, "ADD")), by = class]
res[, `:=`(varnames = rep(c("V1", "V2"), each = .N/2), MatchCount = rep(1:(.N/2), .N/2)), by = class]
dcast(res, class + MatchCount ~ varnames, value.var = "V1")
# class MatchCount V1 V2
# 1: cat 1 8 10
# 2: cat 2 16 18
# 3: dog 1 10 12
Third similar option could be to try first run stri_locate_all_fixed over the whole data set and only then to unlist per group (instead of running both and unlist and stri_locate_all_fixed per group)
res <- data[, .(stri_locate_all_fixed(submystring, "ADD"), class = class)]
res[, N := lengths(V1)/2L]
res2 <- res[, unlist(V1), by = "class,N"]
res2[, `:=`(varnames = rep(c("V1", "V2"), each = N[1L]), MatchCount = rep(1:(N[1L]), N[1L])), by = class]
dcast(res2, class + MatchCount ~ varnames, value.var = "V1")
# class MatchCount V1 V2
# 1: cat 1 8 10
# 2: cat 2 16 18
# 3: dog 1 10 12
We could change the matrix output from str_locate_all to data.frame and use rbindlist to create the columns.
setDT(data)[,rbindlist(lapply(str_locate_all(substr(mystring, 1, 20),
'ADD'), as.data.frame)) , class]
# class start end
#1: cat 8 10
#2: cat 16 18
#3: dog 10 12
Here's how I did it.
library(stringi)
library(dplyr)
library(magrittr)
data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD",
"ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring",
"class"), row.names = c(NA, -2L), class = "data.frame")
my_function = function(row)
row$mystring %>%
stri_sub(to = 20) %>%
stri_locate_all_fixed(pattern = "ADD") %>%
extract2(1) %>%
as_data_frame
test =
data %>%
group_by(mystring) %>%
do(my_function(.)) %>%
left_join(data)

How can I divide a dataframe on the basis of values in its columns?

So, I have a dataframe like this,
1 2 110 10 NA NA
2 3 101 100 NA NA
3 4 10 NA NA NA
3 2 110 100 101 NA
.................
Now, I want to divide this dataframe into individual files as 110,10,101,100,10,101..
And each file contains the first two columns which are present in it.
For example,
The file 110 will contain,
1 2
3 2
And the file, 10 will contain,
1 2
3 4
Like this, I want to divide it. I know how to divide it on the basis of only column value, but since the file contains multiple columns, I don't know how to do it?
Any help would be appreciated.
The code that I was able to make for single column is and then create text files was,
X <- split(myFile, myFile[, 4])
invisible(lapply(names(X), function(y)
write.table(X[[y]], file = paste0(y, ".txt"))))
Make the dataset in long rather than wide form, then split it:
vals <- apply(dat[3:6], 1, function(x) x[!is.na(x)] )
df <- cbind(dat[1:2][rep(rownames(dat), sapply(vals,length)),], val=unlist(vals))
split(df, intm$val)
#$`10`
# V1 V2 val
#1.1 1 2 10
#3 3 4 10
#
#$`100`
# V1 V2 val
#2.1 2 3 100
#4.1 3 2 100
#
#$`101`
# V1 V2 val
#2 2 3 101
#4.2 3 2 101
#
#$`110`
# V1 V2 val
#1 1 2 110
#4 3 2 110
You may do it this way:
dat is your data.frame
dat110 <- dat[which(dat[, 3:4] == 110, arr.ind = T)[, 1], 1:2]
first we look for array indices for which columns 3:4 have value 110 # which(dat[, 3:4] == 110, arr.ind = T) (here you should change 3:4 to indices of your columns)
next we selectonly row indeces # it is [, 1] following which(...)
finally we pick first 2 columns of dat but only rows selected in previous section # dat[which(...), 1:2]
You could use for loop to change condition value, i.e. 110.
My example:
dat <- data.frame(x=1:3,y=2:4,z=0:2,w=2:4)
for(i in unique(unlist(dat[,3:4])))
{
tmp <- dat[which(dat[, 3:4] == i, arr.ind = T)[, 1], 1:2]
print(i)
print(tmp)
}
You could also try:
library(dplyr)
library(tidyr)
dat1 <- dat %>%
mutate(indx=row_number()) %>%
gather(Var, Val, V3:V6) %>%
filter(!is.na(Val))%>%
arrange(Val, indx) %>%
select(-indx, -Var)
lst1 <- split(dat1, dat1$Val)
lst1
#$`10`
# V1 V2 Val
#1 1 2 10
#2 3 4 10
#$`100`
# V1 V2 Val
#3 2 3 100
#4 3 2 100
#$`101`
# V1 V2 Val
#5 2 3 101
#6 3 2 101
#$`110`
# V1 V2 Val
#7 1 2 110
#8 3 2 110
If you need them as individual datasets in the global environment, one option is list2env or you could use assign (but not recommended as it will create a lot of objects in the global env). Instead, you could do all the necessary calculation within the list itself and use lapply with write.table/write.csv to save as individual files. But, if you need as individual datasets:
list2env(setNames(lst1, paste("dat", names(lst1), sep="_")), envir=.GlobalEnv)
<environment: R_GlobalEnv>
dat_10
# V1 V2 Val
#1 1 2 10
#2 3 4 10
data
dat <- structure(list(V1 = c(1L, 2L, 3L, 3L), V2 = c(2L, 3L, 4L, 2L),
V3 = c(110L, 101L, 10L, 110L), V4 = c(10L, 100L, NA, 100L
), V5 = c(NA, NA, NA, 101L), V6 = c(NA, NA, NA, NA)), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6"), class = "data.frame", row.names = c(NA,
-4L))

Merging different columns with the same name into single columns

I having a data.frame in which some columns have the same Name. Now I want to merge/add up these columns into single columns. So for example I want to turn....
v1 v1 v1 v2 v2
1 0 2 4 1
3 1 1 1 0
...into...
v1 v2
3 5
5 1
I only found threads dealing with two data.frames supposed to be merged into one but none dealing with this (rather simple?) problem.
The data can be recreated with this:
df <- structure(list(v1 = c(1L, 3L), v1 = 0:1, v1 = c(2L, 1L),
v2 = c(4L, 1L), v2 = c(1L, 0L)),
.Names = c("v1", "v1", "v1", "v2", "v2"),
class = "data.frame", row.names = c(NA, -2L))
as.data.frame(lapply(split.default(df, names(df)), function(x) Reduce(`+`, x)))
produces:
v1 v2
1 3 5
2 5 1
split.default(...) breaks up the data frame into groups with equal column names, then we use Reduce on each of those groups to sum the values of each column in the group iteratively until there is only one column left per group (see ?Reduce, that is what the function does), and finally we convert back to data frame with as.data.frame.
We have to use split.default because split (or really, split.data.frame, which it will dispatch) splits on rows, not columns.
You can do this quite easily with melt and dcast from "reshape2". Since there's no "id" variable, I've used melt(as.matrix(df)) instead of melt(df, id.vars="id"). This automatically creates a long version of your data that has "Var1" as representing your rownames and "Var2" as your colnames. Using that knowledge, you can do:
library(reshape2)
dcast(melt(as.matrix(df)), Var1 ~ Var2,
value.var = "value", fun.aggregate=sum)
# Var1 v1 v2
# 1 1 3 5
# 2 2 5 1

Resources