I want to transpose a column in several smaller parts based on another column's values e.g.
1 ID1 V1
2 ID1 V2
3 ID1 V3
4 ID2 V4
5 ID2 V5
6 ID3 V6
7 ID3 V7
8 ID3 V8
9 ID3 V9
I wish to have all V values for each ID to be in one row e.g.
ID1 V1 V2 V3
ID2 V4 V5
ID3 V6 V7 V8 V9
Each id has different number of rows to transpose as shown in the example. If it is easier to use the serial number column to perform this then that is fine too.
Can anyone help ?
Here is a simple awk one-liner to do the trick:
awk '1 {if (a[$2]) {a[$2] = a[$2]" "$3} else {a[$2] = $3}} END {for (i in a) { print i,a[i]}}' file.txt
Output:
ID1 V1 V2 V3
ID2 V4 V5
ID3 V6 V7 V8 V9
If you like coding in Javascript this is how to do it on the command line using jline: https://github.com/bitdivine/jline/
mmurphy#violet:~$ cat ,,, | jline-foreach 'begin::global.all={}' line::'fields=record.split(/ +/);if(fields.length==3)tm.incrementPath(all,fields.slice(1))' end::'tm.find(all,{maxdepth:1},function(path,val){console.log(path[0],Object.keys(val).join(","));})'
ID1 V1,V2,V3
ID2 V4,V5
ID3 V6,V7,V8,V9
where the input is:
mmurphy#violet:~$ cat ,,,
1 ID1 V1
2 ID1 V2
3 ID1 V3
4 ID2 V4
5 ID2 V5
6 ID3 V6
7 ID3 V7
8 ID3 V8
9 ID3 V9
mmurphy#violet:~$
Explanation: This builds a tree where the first level of branches is the user ID and the second is the V (version?). You could do this for any number of levels. The leaves are just counters. First we create an empty tree:
'begin::global.all={}'
Then each line that comes in is split into counter, ID and version number. The counter is sliced off leaving just the array [userID,version]. incrementCounter creates those branches in the tree, a bit like mkdir -p, and increments the leaf counter although you don't actually need to know how often each user,version combination has been seen:
line::'fields=record.split(/ +/);if(fields.length==3)tm.incrementPath(all,fields.slice(1))' end::'tm.find(all,{maxdepth:1},function(path,val){console.log(path[0],Object.keys(val).join(","));})'
At the end we have tm.find which behaves just like UNIX find and prints every path in the tree. Except that that we limit the depth of the search to the desired breakdown (1, but if you're like me you'll be wanting to do a breakdown of 2,3,5 or 8 variables next). That way you have separated out the breakdown and your list of values and you can print your answer.
If you are never going to need deeper breakdowns you will probably want to stick with awk, as it's probably preinstalled.
Related
I have a dataset in R that I am trying to subset into a second data frame.
I'm not really sure it's relevant, but just in case, the data is something along the lines of this:
V1 V2 V3 V4 V5 V6
ab 10 98 0.9 0.1 abc
cd 11 99 0.8 0.05 cde
So I was trying to subset it by doing the following:
df_new = data.frame(data$V2, data$V5, data$V6)
This has actually worked in the past so I didn't think anything of using it here, but for some reason, the output of this was:
data.V2 data.V5 data.V6
10 0.1 abc
11 0.05 cde
So, for some reason the function was adding the name of the original data frame to the column names when I was subsetting it. I checked the documentation and couldn't see an option for preventing this (I just want to keep the original names). So I'm not really sure what exactly was going wrong here.
When you try to use, e.g., data$V2, that is something that doesn't have a name:
data$V2
# [1] 10 11
So, this kind of behaviour is expected. The best option would probably be
data[, c("V2", "V5", "V6")]
# V2 V5 V6
# 1 10 0.10 abc
# 2 11 0.05 cde
or, if you want to stick with data.frame,
with(data, data.frame(V2, V5, V6))
# V2 V5 V6
# 1 10 0.10 abc
# 2 11 0.05 cde
Something longer but with a possibility to assign any names would be
data.frame(A = data$V2, B = data$V5, C = data$V6)
# A B C
# 1 10 0.10 abc
# 2 11 0.05 cde
or
with(data, data.frame(A = V2, B = V5, C = V6))
I have a file that is laid out in the following way:
# Query ID 1
# note
# note
tab delimited data across 12 columns
# Query ID 2
# note
# note
tab delimited data across 12 columns
I'd like to import this data into R so that each query is its own dataframe. Ideally as a list of dataframes with the query ID as the name of each item in the list. I've been searching for awhile, but I haven't seen a good way to do this. Is this possible?
Thanks
We have used comma instead of tab to make it easier to see and have put the body of the file in a string but aside from making the obvious changes try this. First we use readLines to read in the file and then determine where the headers are and create a grp vector which has the same number of elements as lines in the file and whose values are the header for that line. Finally split the lines, and apply Read to each group.
but aside from that try this:
# test data
Lines <- "# Query ID 1
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
# Query ID 2
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12"
L <- readLines(textConnection(Lines)) # L <- readLines("myfile")
isHdr <- grepl("Query", L)
grp <- L[isHdr][cumsum(isHdr)]
# Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, comment = "#")
Read <- function(x) read.table(text = x, sep = ",", fill = TRUE, comment = "#")
Map(Read, split(L, grp))
giving:
$`# Query ID 1`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
$`# Query ID 2`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
No packages needed.
I have this line in one my function - result[result>0.05] <- "", that replaces all values from my data frame grater than 0.05, including the row names from the first column. How to avoid this?
This is a fast way too:
df <- as.data.frame(matrix(runif(100),nrow=10))
df[-1][df[-1]>0.05] <- ''
Output:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.60105471
2 0.63340567
3 0.11625581
4 0.96227379 0.0173133104108274
5 0.07333583
6 0.05474430 0.0228175506927073
7 0.62610309
8 0.76867090
9 0.76684615 0.0459537433926016
10 0.83312158
I have a data.frame with 16 columns. Here's one example row.
> data[16,]
V1 V2 V3 V4
16 comp27182_c0_seq4 ENSP00000442096 ENSG00000011143 ENSFCAP00000011376
V5 V6 V7 V8
16 ENSFCAG00000012261 comp48601_c0_seq1 comp19130_c0_seq3 comp22796_c2_seq3
V9 V10 V11 V12
16 comp146901_c0_seq1 comp157916_c0_seq1 comp158124_c0_seq1
V13 V14 V15 V16
16 comp229797_c0_seq1 comp61875_c0_seq2
I'm only interested in columns 1 and 6-16. The first column contains the name I would like to use as a column name in the matrix, 6 to 16 may contain either a string or '' (nothing).
I would like to transform this data.frame into a matrix showing 1 or 0, reflecting the content in columns 6-16.
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
comp27182_c0_seq4 1 1 1 1 0 1 1 1 1 0 0
I've trying to use mask without success. I'm sure there's a very easy option out there.
Thanks for any help.
Try this:
do.call(cbind, lapply(c(1,6:16),
function(x) as.numeric(nchar(as.character(data[,x])) > 0)))
I slightly modified your code to my exact needs. Now the first column is naming the rows.
a<-do.call(cbind, lapply(c(6:16),
function(x) as.numeric(nchar(as.character(data[,x])) > 0)))
rownames(a)<-data[,1]
It works great, thanks!
I have a simulation dataset that explores a set of parameter space, and each set of parameter are run multiple times (iterations), it looks like so:
p1 p2 p3 iteration result
=================================
v3 v2 v1 1 23.8
v2 v1 v3 2 20.36
v3 v2 v1 2 28.8
v2 v1 v3 1 29.36
...
As can be seen from this example, both (v3, v2, v1) and (v2, v1, v3) are run twice. I am trying to extract only the rows with max result for each parameter setting, in this example:
only row 3 and 4 should be kept, as they represent the best results from that parameter set. Is there a easy way to accomplish that in R? Thanks
df <- read.table(textConnection("p1 p2 p3 iteration result
v3 v2 v1 1 23.8
v2 v1 v3 2 20.36
v3 v2 v1 2 28.8
v2 v1 v3 1 29.36"), header = T)
library(plyr)
ddply(df, .(p1,p2,p3), function(x) return(x[(which(x$result == max(x$result))), ]))
p1 p2 p3 iteration result
1 v2 v1 v3 1 29.36
2 v3 v2 v1 2 28.80