combining rows in sequence file

combining rows in sequence file - r

I have a data frame, in which each individual has two rows and I want to combine these two row in on row.
Code lines:
dat <- read.table("cbin.csv",sep="\t", row.names=1)
dat
V2 V3 V4 V5
1_1 A B C D
1_2 a b c d
2_1 E F G H
2_2 e f g h
3_1 J K L M
3_2 j k l m
d <- apply( dat[ , colnames(dat) ] , 2 , paste , collapse = " " )
d
V2 V3 V4 V5
"A a E e J j" "B b F f K k" "C c G g L l" "D d H h M m"
But I want to combine each two rows like this
1 A a B b C c D d
2 E e F f G g H h
3 I i J j K k L l
How can I do this?

This will get you more or less the data.frame you want. I just pull out the even rows and cbind them next to the odd rows.
dat2 <- cbind(dat[seq(1, nrow(dat), by = 2), ],
dat[seq(2, nrow(dat), by = 2), ])
I'll leave reordering the columns (or pasting them together, if you want to combine them into individual strings) as an exercise for the reader.

Here are a couple of options:
Option 1: Use stack to get a long data.frame, then use paste within aggregate to get the output you want.
Here's how you make your "long" data.frame.
Long <- cbind(rn = rownames(dat), stack(dat))
head(Long)
# rn values ind
# 1 1_1 A V2
# 2 1_2 a V2
# 3 2_1 E V2
# 4 2_2 e V2
# 5 3_1 J V2
# 6 3_2 j V2
If the values in "dat" are factors, you might need to do:
Long <- cbind(rn = rownames(dat), stack(lapply(dat, as.character)))
Once your data are in a long form, use aggregate along with substr (among other choices) to get the values you need to paste together.
aggregate(values ~ substr(rn, 1, 1), Long, paste, collapse = " ")
# substr(rn, 1, 1) values
# 1 1 A a B b C c D d
# 2 2 E e F f G g H h
# 3 3 J j K k L l M m
An alternative is a similar approach to what #Gregor is suggesting. This is basically an alternative approach to getting every alternate row and binding it, but goes the extra step to reorder and paste the values together.
do.call(paste,
cbind(dat[c(TRUE, FALSE), ],
dat[c(FALSE, TRUE), ])[order(rep(names(dat), 2))])
# [1] "A a B b C c D d" "E e F f G g H h" "J j K k L l M m"

Related

avoiding nested sapply when collapsing variable in data.frame with multiple factors

I have a dataframe with multiple factors and multiple numeric vars. I would like to collapse one of the factors (say by mean).
In my attempts I could only think of nested sapply or for loops to isolate the numerical elements to be averaged.
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
> var
A B C D E
1 a c e 1.1601720731 -0.57092435
2 a c f -0.0120178626 1.05003748
3 a d e 0.5311032778 1.67867806
4 a d f -0.3399901000 0.01459940
5 a c e -0.2887561691 -0.03847519
6 a c f 0.0004299922 -0.36695879
7 a d e 0.8124655890 0.05444033
8 a d f -0.3777058654 1.34074427
9 b c e 0.7380720821 0.37708543
10 b c f -0.3163496271 0.10921373
11 b d e -0.5543252191 0.35020193
12 b d f -0.5753686426 0.54642790
13 b c e -1.9973216646 0.63597405
14 b c f -0.3728926714 -3.07669300
15 b d e -0.6461596329 -0.61659041
16 b d f -1.7902722068 -1.06761729
sapply(4:ncol(var), function(i){
sapply(1:length(levels(var$A)), function(j){
sapply(1:length(levels(var$B)), function(t){
sapply(1:length(levels(var$C)), function(z){
mean(var[var$A == levels(var$A)[j] &
var$B == levels(var$B)[t] &
var$C == levels(var$C)[z],i])
})
})
})
})
[,1] [,2]
[1,] 0.435707952 -0.3046998
[2,] -0.005793935 0.3415393
[3,] 0.671784433 0.8665592
[4,] -0.358847983 0.6776718
[5,] -0.629624791 0.5065297
[6,] -0.344621149 -1.4837396
[7,] -0.600242426 -0.1331942
[8,] -1.182820425 -0.2605947
Is there a way to do this without this many sapply? maybe with mapply or outer

Maybe just,
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
library(dplyr)
var %>%
group_by(A,B,C) %>%
summarise_if(is.numeric,mean)
(Note that the output you show isn't what I get when I run your sapply code, but the above is identical to what I get when I run your sapply's.)

For inline aggregation (keeping same number of rows of data frame), consider ave:
var$D_mean <- with(var, ave(D, A, B, C, FUN=mean))
var$E_mean <- with(var, ave(E, A, B, C, FUN=mean))
For full aggregation (collapsed to factor groups), consider aggregate:
aggregate(. ~ A + B + C, var, mean)

I will complete the holy trinity with a data.table solution. Here .SD is a data.table of all the columns not listed in the by portion. This is a near-dupe of this question (only difference is >1 column being summarized), so click that if you want more solutions.
library(data.table)
setDT(var)
var[, lapply(.SD, mean), by = .(A, B, C)]
# A B C D E
# 1: a c e 0.07465822 0.032976115
# 2: a c f 0.40789460 -0.944631574
# 3: a d e 0.72054938 0.039781185
# 4: a d f -0.12463910 0.003363382
# 5: b c e -1.64343115 0.806838905
# 6: b c f -1.08122890 -0.707975411
# 7: b d e 0.03937829 0.048136471
# 8: b d f -0.43447899 0.028266455

R - Adding a total row in Excel output

I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na

library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.

Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>

Multiple String Matching in R

Consider A,B,C,D .... as words.
I have two DFs.
df1:
ColA
A B
B C
C D
E F
G H
A M
M
df2:
ColB
A B C D X Y Z
C D M N F K L
S H A F R M T U
Operation:
I want to search all element of df1 in df2 then append all the matching values in a new column OR may be create multiple rows.
Output 1:
ColB COlB
A B C D X Y Z A,A B,B C,C D
C D M N F K L C D,M
S H A F R M T U A,A M
Output2:
ColB Output
A B C D X Y Z A
A B C D X Y Z A B
A B C D X Y Z B C
A B C D X Y Z C D
C D M N F K L C D
C D M N F K L M
S H A F R M T U A
S H A F R M T U A M

I think this will do it, although it differs a bit from your expected answer, which I think is wrong.
First set up the input data frames:
# set up the data
df1 <- data.frame(ColA = c("A B",
"B C",
"C D",
"E F",
"G H",
"A M",
"M"),
stringsAsFactors = FALSE)
df2 <- data.frame(ColB = c("A B C D X Y Z",
"C D M N F K L",
"S H A F R M T"),
stringsAsFactors = FALSE)
Next we will form all the pairwise combinations of the things to search with the things to be searched:
# create a vector of patterns and items to search
intermediate <- as.vector(outer(df2$ColB, df1$ColA, paste, sep = "|"))
# split it into a list
intermediate <- strsplit(intermediate, "|", fixed = TRUE)
Then we can create a function to match the elements for each row of this full combination dataset The core is the foundMatch which returns a logical indicating whether all elements in ColA were present in ColB. In your examples, order does not matter, so here we split the elements and look for all of the first to be in the second.
# set up the output data.frame
Output2 <- data.frame(do.call(rbind, intermediate))
names(Output2) <- c("ColB", "Output")
# here is the core, which does the element matching
foundMatch <- apply(Output2, 1, function(x) {
tokens <- strsplit(x, " ", fixed = TRUE)
all(tokens[[2]] %in% tokens[[1]])
})
# filter out the ones with the match
Output2 <- Output2[foundMatch, ]
Output2
## ColB Output
## 1 A B C D X Y Z A B
## 2 C D M N F K L A B
## 3 S H A F R M T A B
## 10 A B C D X Y Z E F
## 14 C D M N F K L G H
## 20 C D M N F K L M
## 21 S H A F R M T M
Not exactly what you have above but I think it's correct.

It is not obvious for me how your data.frames df1 and df2 are built. But you can try to vectorise your data and match both sets.
d1 <- sort(as.character(unlist(df1)))
d2 <- sort(as.character(unlist(df2)))
# get the intersection/difference without duplicates
intersect(d1,d2)
setdiff(d1,d2)
# get all values matching with the first or with the second dataset, respectively
d1[ d1 %in% d2 ]
d2[ d2 %in% d1 ]

split data.frame into list based on row values across columns

I would like to split a data.frame into a list based on row values/characters across all columns of the data.frame.
I wrote lists of data.frames to file using write.list {erer}
So now when I read them in again, they look like this:
dummy data
set.seed(1)
df <- cbind(data.frame(col1=c(sample(LETTERS, 4),"col1",sample(LETTERS, 7))),
data.frame(col2=c(sample(LETTERS, 4),"col2",sample(LETTERS, 7))),
data.frame(col3=c(sample(LETTERS, 4),"col3",sample(LETTERS, 7))))
col1 col2 col3
1 G E Q
2 J R D
3 N J G
4 U Y I
5 col1 col2 col3
6 F M A
7 W R J
8 Y X U
9 P I H
10 N Y K
11 B T M
12 E E Y
And I would like to split into lists by c("col1","col2","col3") producing
[[1]]
col1 col2 col3
1 G E Q
2 J R D
3 N J G
4 U Y I
[[2]]
col1 col2 col3
1 F M A
2 W R J
3 Y X U
4 P I H
5 N Y K
6 B T M
7 E E Y
Feels like it should be straightforward using split, but my attempts so far have failed. Also, as you see, I can't split by a certain row interval.
Any pointers would be highly appreciated, thanks!

Try
lapply(split(d1, cumsum(grepl(names(d1)[1], d1$col1))), function(x) x[!grepl(names(d1)[1], x$col1),])
#$`0`
# col1 col2 col3
#1 G E Q
#2 J R D
#3 N J G
#4 U Y I
#$`1`
# col1 col2 col3
#6 F M A
#7 W R J
#8 Y X U
#9 P I H
#10 N Y K
#11 B T M
#12 E E Y

This should be general, if you want to split if a line is exactly like the colnames:
dfSplit<-split(df,cumsum(Reduce("&",Map("==",df,colnames(df)))))
for (i in 2:length(dfSplit)) dfSplit[[i]]<-dfSplit[[i]][-1,]
The second line can be written a little more R-style as #DavidArenburg suggested in the comments.
dfSplit[-1] <- lapply(dfSplit[-1], function(x) x[-1, ])
It has also the added benefit of doing nothing if dfSplit has length 1 (opposite to my original second line, which would throw an error).

Count unique elements in data frame row and return one with maximum occurrence [duplicate]

This question already has answers here:
How to find mode across variables/vectors within a data row in R
(3 answers)
Closed 9 years ago.
Is it possible to count unique elements in data frame row and return one with maximum occurrence and as result form the vector.
example:
a a a b b b b -> b
c v f w w r t -> w
s s d f b b b -> b

You can use apply to use table function on every row of dataframe.
df <- read.table(textConnection("a a a b b b b\nc v f w w r t\ns s d f b b b"), header = F)
df$result <- apply(df, 1, function(x) names(table(x))[which.max(table(x))])
df
## V1 V2 V3 V4 V5 V6 V7 result
## 1 a a a b b b b b
## 2 c v f w w r t w
## 3 s s d f b b b b

Yes with table
x=c("a", "a", "a", "b" ,"b" ,"b" ,"b")
table(x)
x
a b
3 4
EDIT with data.table
DT = data.table(x=sample(letters[1:5],10,T),y=sample(letters[1:5],10,T))
#DT
# x y
# 1: d a
# 2: c d
# 3: d c
# 4: c a
# 5: a e
# 6: d c
# 7: c b
# 8: a b
# 9: b c
#10: c d
f = function(x) names(table(x))[which.max(table(x))]
DT[,lapply(.SD,f)]
# x y
#1: c c

Note that if you want to keep ALL max's, you need to ask for them explicitly.
You can save them as a list inside the data.frame. If there is only one per row, then the list will be simplified to a common vector
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
With Ties for max:
df2 <- df[, 1:6]
df2$result <- apply(df2, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df2
V1 V2 V3 V4 V5 V6 result
1 a a a b b b 3, 3
2 c v f w w r 2
3 s s d f b b 2, 2
With No Ties for max:
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df
V1 V2 V3 V4 V5 V6 V7 result
1 a a a b b b b 4
2 c v f w w r t 2
3 s s d f b b b 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

combining rows in sequence file - r

Related

avoiding nested sapply when collapsing variable in data.frame with multiple factors

R - Adding a total row in Excel output

Multiple String Matching in R

split data.frame into list based on row values across columns

Count unique elements in data frame row and return one with maximum occurrence [duplicate]

Categories

Resources