Conditional input using read.table or readLines - r

I'm struggling with using readLines() and read.Table() to get a well formatted data frame in R.
I want to read files like this which are Hockey stats. I'd like to get a nicely formatted data frame, however, specifying the concrete amount of lines to read is difficult because in other files like this the number of players is different. Also, non-players, signed as #.AC, #.HC and so on, should not be read in.
I tried something like this
LINES <- 19
stats <- read.table(file=Datei, skip=11, header=FALSE, stringsAsFactors=FALSE,
encoding="UTF-8", nrows=LINES)
but as mentioned above, the value for LINES is different each time.
I also tried readLines as in this post, but had no luck with it.
Is there a way to integrate a condition in read.table, like (pseudo code)
if (first character == "AC") {
break read.table
}
Sorry if this looks strange, I don't have that much experience in scripting or coding.
Any help is appreciated, thanks a lot!
Greetz!

Your data show a couple of difficulties which should be handled in a sequence, which means you should not try to read the entire file with one command:
Read plain lines and find start and stop row
Depending on the specification of the files you read in my suggestion is to first find the the first row you actually want to read in by any indicator. So this can be a lone number which is always the same or as in my example two lines after the line "TEAM STATS". Finding the last line is then simple again by just looking for the first line containing only whitespaces after the start line:
lines <- readLines( Datei )
start <- which(lines == "TEAM STATS") + 2
end <- start + min( grep( "^\\s+$", lines[ start:length(lines) ] ) ) -2
lines <- lines[start:end]
Read the data to data.frame
In your case you meet a couple of complications:
Your header line starts with an # which is on default recognized as a comment character, ignoring the line. But even if you switch this behavior off (comment.char = "") it's not a valid column name.
If we tell read.table to split the columns along whitespaces you end up with one more column in the data, than in the header since the Player column contains white spaces in the cells. So the best is at the moment to just ignore the header line and let read.table do this with it's default behavior (comment.char = "#"). Also we let the PLAYER column be split into two and will fix this later.
You won't be able to use the first column as row.names since they are not unique.
The rows have unequal length, since the POS column is not filled everywhere.
:
tab <- read.table( text = lines[ start:end ], fill = TRUE, stringsAsFactors=FALSE )
# fix the PLAYER column
tab$V2 <- paste( tab$V2, tab$V3 )
tab <- tab[-3]
Fix the header
Just split the start line at multiple whitespaces and reset the first entry (#) by a valid column name:
colns <- strsplit( lines[start], "\\s+" )[[1]]
colns[1] <- "code"
colnames(tab) <- colns
Fix cases were "POS" was empty
This is done by finding the rows which last cell contains NAs and shift them by one cell to the right:
colsToFix <- which( is.na(tab[, "SHO%"]) )
tab[ colsToFix, 4:ncol(tab) ] <- tab[ colsToFix, 3:(ncol(tab)-1) ]
tab[ colsToFix, 3 ] <- NA
> str(tab)
'data.frame': 25 obs. of 20 variables:
$ code : chr "93" "91" "61" "88" ...
$ PLAYER: chr "Eichelkraut, Flori" "Müller, Lars" "Alt, Sebastian" "Gross, Arthur" ...
$ POS : chr "F" "F" "D" "F" ...
$ GP : chr "8" "6" "7" "8" ...
$ G : int 10 1 4 3 4 2 0 2 1 0 ...
$ A : int 5 11 5 5 3 4 6 3 3 4 ...
$ PTS : int 15 12 9 8 7 6 6 5 4 4 ...
$ PIM : int 12 10 12 6 2 36 37 29 6 0 ...
$ PPG : int 3 0 1 1 1 1 0 0 1 0 ...
$ PPA : int 1 5 2 2 1 2 4 2 1 1 ...
$ SHG : int 0 1 0 1 1 0 0 0 0 0 ...
$ SHA : int 0 0 1 0 1 0 0 1 0 0 ...
$ GWG : int 2 0 1 0 0 0 0 0 0 0 ...
$ FG : int 1 0 1 1 1 0 0 0 0 0 ...
$ OTG : int 0 0 0 0 0 0 0 0 0 0 ...
$ UAG : int 1 0 1 0 0 0 0 0 0 0 ...
$ ENG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOA : num 0 0 0 0 0 0 0 0 0 0 ...
$ SHO% : num 0 0 0 0 0 0 0 0 0 0 ...

Related

Change a character column to numeric in a data frame

I want to change a column from a dataframe from character to numeric.
My data frame was a .txt file with 12 columns and 1000 rows.
When I passed the .txt file to R, one of my columns is now character.
I tried to use
as.numeric(my_data$iw)
But I get a "Warning message:
NAs introduced by coercion
Here is the data frame structure:
data.frame': 1000 obs. of 12 variables:
$ im : num 0 15396 16537 20252 17967 ...
$ iw : chr "20064.97" "7397.191" "18380.77" "14042.25" ...
$ r : num 5984 0 0 0 0 ...
$ am : num 0 42 33 38 24 62 27 38 0 29 ...
$ af : num 38 30 28 38 39 42 18 33 24 35 ...
$ a1c: num 0 1 1 1 1 0 0 1 0 1 ...
$ a2c: num 0 0 0 1 0 0 0 1 0 1 ...
$ a3c: num 0 0 0 0 0 0 0 1 0 0 ...
$ a4c: num 0 0 0 0 0 0 0 0 0 0 ...
$ a5c: num 0 0 0 0 0 0 0 0 0 0 ...
$ a6c: num 0 0 0 0 0 0 0 0 0 0 ...
$ a7c: num 0 0 0 0 0 0 0 0 0 0 ...
May I change it with gsub?
structure(list(im = c(0, 15395.61, 16536.74, 20251.87, 17967.04,
12686.43, 16833.22, 16919.34, 0, 20515.88, 17991.9, 15528.29,
16683.96, 14485.19, 17957.98, 19923.31, 13526.9, 16516.68, 16337.52,
12904.97, 17418.99, 12419.21, 14561.9, 12309.77, 21138.87, 0,
17315.74, 17762.09, 12678.82, 13883.37, 11140.66, 16502.91, 18293.78,
12533.36, 16536.61, 4336.741, 22449.17, 16532.1, 0, 15905.14,
0, 8542.03, 12589.29, 15154.76, 15441.59, 18575.05, 15915.47,
0, 15085.51, 16597.42, 15358.47, 22480.95, 10555.28, 21771.2,
22863.56, 15937.55, 12230.58, 17814.67, 7972.471, 10286.75, 15335.8,
10762.59, 18583.2, 12167.99, 21723.37, 15670.79, 13045.83, 13305.73,
14305.99, 10353.15, 4504.009, 10157.7, 15967.28, 23640.21, 15053.78,
21404.11, 8509.353, 15693.39, 9009.99, 17249.29, 9115.844, 16057.39,
14069.93, 0, 0, 16840.09, 0, 15289.29, 12223.93, 13048.58, 18524.13,
14344.22, 20658.66, 0, 0, 13984.69, 21636.72, 13969.12, 12919.83,
13214.16, 17066.98, 20060.25, 11414.15, 12907.53, 11289.97, 17600.97,
14741.77, 12089.57, 13603.85, 9330.662, 0, 16191.81, 12029.75,
12666.29, 8138.166, 10636.2, 22570.1, 12833.66, 12585.56, 20197.42,
12621.56, 19021.65, 9948.49, 25772.41, 15102.54, 19225.57, 11188.96,
11707.66, 9766.824, 16082.82, 17693.2......
To read in the .txt file, I wrote:
my_data <- read.table("project.txt", header=TRUE);
As some comments already said, this is because either there are some rows which cannot be coerced to numeric. An unsuitable description of missing data or maybe a comma to denote decimals as in
expl <- read.table(text = "1.0 2.0 2,3
2.0 2.1 2.5
. 2.2 2.1")
str(expl)
which leads to
> str(expl)
'data.frame': 3 obs. of 3 variables:
$ V1: chr "1.0" "2.0" "."
$ V2: num 2 2.1 2.2
$ V3: chr "2,3" "2.5" "2.1"
for the reasons stated above.
It is not always easy to find the culprid in 1000 lines, but something like this may help:
> which(is.na(as.numeric(expl$V1)))
[1] 3
This will provide you with the row numbers that produce NA in conversion.

Stacked barplot in UpSetR

I have been looking for a way of having a stacked bar plot in an upsetR graph.
I downloaded the movies data set (from here) and added a column having only two values "M" and "C".
Below, information on how I loaded the data and added the "x" column.
Edit:
m <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = T, sep = ";")
nrow(m)
[1] 3883
x<-c(rep("M", 3000), rep("C", 883))
m<-cbind(m, x)
unique(m$x)
[1] M C
This is the structure of the data frame:
str(m)
'data.frame': 3883 obs. of 22 variables:
$ Name : Factor w/ 3883 levels "$1,000,000 Duck (1971)",..: 3577 1858 1483 3718 1175 1559 3010 3548 3363 1420 ...
$ ReleaseDate: int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
$ Action : int 0 0 0 0 0 1 0 0 1 1 ...
$ Adventure : int 0 1 0 0 0 0 0 1 0 1 ...
$ Children : int 1 1 0 0 0 0 0 1 0 0 ...
$ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
$ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
$ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
$ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
$ Fantasy : int 0 1 0 0 0 0 0 0 0 0 ...
$ Noir : int 0 0 0 0 0 0 0 0 0 0 ...
$ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
$ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
$ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
$ Romance : int 0 0 1 0 0 0 1 0 0 0 ...
$ SciFi : int 0 0 0 0 0 0 0 0 0 0 ...
$ Thriller : int 0 0 0 0 0 1 0 0 0 1 ...
$ War : int 0 0 0 0 0 0 0 0 0 0 ...
$ Western : int 0 0 0 0 0 0 0 0 0 0 ...
$ AvgRating : num 4.15 3.2 3.02 2.73 3.01 3.88 3.41 3.01 2.66 3.54 ...
$ Watches : int 2077 701 478 170 296 940 458 68 102 888 ...
$ x : Factor w/ 2 levels "M","C": 1 1 1 1 1 1 1 1 1 1 ...
Now I tried to implement the stacked bar plot as follow:
upset(m,
queries = list(
list(query = elements,
params = list("x", "M"), color = "#e69f00", active = T),
list(query = elements,
params = list("x", "C"), color = "#cc79a7", active = T)))
The result looks like this:
As you can see the proportions are wrong as there should be in each bar only two colors (factor) either "M" or "C".
This issue seems to be not a trivial one, as also pointed out here.
Does anyone have an idea on how to implement this in UpsetR?
Thanks a lot
Here is a way to create an upset plot with stacked barplot, but using my ComplexUpset rather than UpSetR:
library(ComplexUpset)
movies = as.data.frame(ggplot2movies::movies)
genres = colnames(movies)[18:24]
# for simplicity of examples, only use the complete data points
movies[movies$mpaa == '', 'mpaa'] = NA
movies = na.omit(movies)
upset(
movies,
genres,
base_annotations=list(
'Intersection size'=intersection_size(
counts=FALSE,
mapping=aes(fill=mpaa)
)
),
width_ratio=0.1
)
Please see more examples in the documentation.
The Installation instructions are available on GitHub: krassowski/complex-upset (there is also a comparison to UpSetR and other packages).
I had a similar problem and found this workaround:
library("UpSetR")
m <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = T, sep = ";")
x<-c(rep("M", 2000), rep("Q", 1000), rep("C", 883))
m<-cbind(m, x)
upset(m,
queries = list(
list(query = elements,
params = list("x", c("M","Q", "C")), color = "#e69f00", active = T),
list(query = elements,
params = list("x", c("Q","C")), color = "#cc79a7", active = T),
list(query = elements,
params = list("x", "C"), color = grey(0.7), active = T)))
The problem in the original example is that every query overlays over the total bar separately and starts at y=0. Thus, the remaining black part of the bar always has the exact same height as the purple part at the bottom. The workaround is to systematically add queries of combinations of the different values the variable can take:
Start with a query and a respective color for the combination of all possible values (here c("M","Q","C") as the second parameter to params = list()).
Successively leave out one of the possible values (e.g. c("Q","C") in the first step here). The value left out will be represented by the color of the query, the last one that still included it ("M" in this example).
Continue adding queries until you have only one value left for the second parameter to params = list().
It should be possible do this programmatically for larger numbers of possible values and providing some color palette. But this remains a workaround and a native implementation of stacking the queries would be nice to have--so if you would like to see this functionality, you might consider bumping up the respective issue over at the Github repo.
Below the nice answer by #dlaehnemann but a little bit modified in order to create that list of list using a loop as well as linking wanted colors to it.
m <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), header = T, sep = ";")
x<-c(rep("M", 2000), rep("Q", 1000), rep("C", 883))
m<-cbind(m, x)
i<-0
mylist<-list()
vectorUniqueValue <- unique(m$x)
colors = colorRampPalette(c("#332288",'#fdff00','#FF0000',"#CC6677","#88CCEE",'#36870c','#b786d2','#7c3c06',"#DDCC77",'#192194','#52cff4','#4f9c8b',"#4477AA",'#808080'))(length(vectorUniqueValue))
while ( length(vectorUniqueValue)>0 ){
i<-i+1
mylist[[i]]<-list(query = elements, params = list("x",as.character(vectorUniqueValue)), color = colors[i], active = T)
vectorUniqueValue<-vectorUniqueValue[-1]
}
upset(m, queries = mylist)
Hope it helps a bit until maybe one day someone works on the issue on github !

how to read a specific .Matrix file in R

I have a .Matrix file, I have been told it is similar to .csv file, and I take a look by web browser, it looks like this:
%TransMat_H0004.E1.L1.S1.B1.T1
CLUSTER,,3,3,2,2,1,1,3,1,1,1,1,3,2,3,1,2,2,1,1,3,3,1,2,1,3,1,1,2,1,3,3,2,3,3,1,1,1,1,1,3,3,1,2,3,2,1,1,1,1,2,1,2,2,3,1,3,2,2,2,1,3,3,2,3,3,1,2,3,3,2,2,2,3,2,2,2,1,1,2,1,1,2,1,1,1,1,2,3,1,3,2,3,3,3,3,2,1,1,3,3,3,1,1,1,2,1,3,1,2,1,1,1,1,1,1,1,3,1,3,2,3,1,1,3,2,2,3,3,1,3,1,1,2,1,2,2,1,1,3,3,1,2,1,2,2,2,2,2,1,3,1,2,3,2,2,2,2,3,2,1,1,2,3,3,2,1,3,1,1,1,1,3,3,3,1,3,3,1,2,2,3,2,3,2,2,3,1,2,2,1,3,1,2,2,3,1,2,3,2,3,3,1,3,2,3,1,1,2,3,1,1,3,2,1,2,1,1,3,1,1,3,1,1,2,1,2,2,2,3,1,3,3,3,1,3,1,1,3,2,3,1,3,2,1,3,1,1,1,2,3,3,3,1,3,3,3,1,1,2,2,3,2,3,3,3,1,3,3,1,1,2,3,2,1,1,3,1,1,1,1,1,3,3,2,2,1,1,1,1,1,3,1,1,2,3,3,1,1,3,2,2,1,1,2,1,1,3,2,1,2,1,2,3,2,1,1,3,2,1,3,2,1,2,2,1,3,3,1,3,3,2,3,2,3,1,3,3,3,3,2,1,3,2,3,3,3,2,1,2,1,2,3,1,1,3,3,3,3,3,2,3,3,1,3,1,1,2,3,3,3,3,3,3,2,2,2,3,1,2,3,3,3,3,2,1,2,2,3,2,3,2,3,2,3,3,2,1,2,3,3,2,1,2,3,3,3,1,3,2,3,3,1,2,2,3,1,1,2,2,3,2,1,1,2,2,1,3,1,2,3,1,3,1,1,2,3,3,1,2,3,2,2,1,1,2,3,2,2,2,1,2,1,2,2,3,2,1,2,1,3,1,2,3,1,2,3,1,2,1,1,2,1,3,3,3,1,3,3,2,2,2,1,2,3,1,3,1,2,1,3,1,2,2,1,2,3,1,1,3,3,2,2,3,1,1,2,1,1,1,2,1,2,3,3,2,2,1,2,3,2,3,1,2,2,2,1,3,3,3,3,3,3,2,3,2,1,2,1,3,3,1,3,3,1,3,2,3,3,1,2,3,3,3,3,3,1,2,1,2,1,1,1,1,2,2,3,1,1,2,3,2,3,2,2,3,3,1,2,1,3,2,3,2,2,3,2,3,1,1,1,3,1,2,3,1,3,2,3,2,2,1,2,3,1,3,2,1,2,3,1,3,1,2,2,1,3,3,2,1,3,3,1,2,3,1,2,1,1,3,1,3,2,3,3,3,3,2,2,1,1,3,3,2,1,3,1,1,3,3,3,1,3,3,1,1,3,3,3,1,1,3,3,2,1,3,2,3,1,3,2,2,2,2,2,3,3,1,2,2,3,2,3,3,1,3,1,3,3,1,3,2,1,2,3,1,3,1,3,2,2,1,1,1,1,3,2,3,3,2,2,3,2,3,1,3,2,1,2,3,1,2,2,1,1,1,3,3,2,3,3,3,3,2,3,1,1,3,3,3,1,1,3,2,1,2,3,2,3,1,3,3,2,1,1,1,1,3,3,2,3,1,2,1,3,3,3,2,2,2,2,3,3,1,1,2,3,2,2,3,3,2,2,3,3,3,2,2,1,2,2,3,3,3,3,1,2,2,3,2,2,2,2,3,2,2,2,1,1,2,2,2,1,2,3,2,2,3,3,2,1,3,1,2,2,1,3,2,3,1,1,3,1,2,2,2,3,3,1,3,3,1,2,1,2,3,1,3,2,3,1,1,3,3,3,1,2,3,3,3,1,3,3,1,3,2,2,2,3,2,1,2,3,3,2,1,2,1,2,1,1,3,3,1,1,3,2,1,3,2,1,3,3,3,2,2,2,1,3,2,3,2,3,1,2,3,1,3,3,1,1,3,2,1,2,3,2,1,1,2,3,1,3,2,1,2,2,3,2,2,1,3,2,1,1,3,3,2,1,3,1,2,2,1,2,2,3,2,2,2,3,1,1,3,3,3,3,1,2,2,3,3,3,2,1,3,2,1,2,3,3,1,3,2,1,2,1,1,2,2,3,2,2,3,1,2,3,2,3,1,2,3,3,2,3,3,1,1,2,1,1,1,3,1,3,1,3,3,2,3,1,2,2,1,2,3,3,2,3,2,3,2,1,1,3,2,3,2,3,1,1,3,1,3,2,1,3,2,2,2,3,1,1,2,3,1,1,1,2,3,3,3,1,2,3,3,3,3,2,3,1,3,1,3,2,3,2,3,3,1,1,2,3,1,1,3,3,2,3,3,1,2,3,1,2,3,3,2,3,3,2,1,2,3,3,2,3,1,2,2,3,1,2,1,3,2,3,1,2,2,3,3,2,2,3,1,3,3,3,3,2,3,2,2,1,3,1,2,1,1,1,3,2,3,1,1,1,1,3,3,2,3,1,1,2,1,3,1,2,3,3,2,2,1,1,3,2,2,3,1,2,3,3,3,2,1,2,2,3,1,3,3,2,1,2,2,3,3,2,2,3,2,1,1,3,1,3,3,1,3,2,3,3,3,1,1,1,3,1,2,2,3,2,3,2,3,1,1,2,1,2,1,3,3,1,3,3,2,2,1,3,1,2,2,3,2,2,2,3,3,2,1,1,1,1,3,1,1,2,1,2,2,3,3,2,3,3,3,2,1,1,3,2,2,2,3,1,3,3,3,2,2,3,1,3,3,3,1,3,3,3,2,3,1,2,1,1,3,1,2,3,2,1,3,3,2,1,3,2,3,2,3,1,2,2,3,3,2,3,3,3,1,2,3,3,3,3,3,1,1,2,3,1,2,1,1,1,1,2,1,1,2,3,1,3,3,2,2,3,2,2,1,3,2,2,3,1,1,1,1,1,3,1,3,1,1,3,2,2,3,3,3,1,2,2,3,3,2,3,2,3,3,2,1,2,3,3,1,3,1,2,1,1,2,2,2,2,2,2,1,3,1,3,2,3,2,2,2,2,2,3,2,2,1,3,1,1,1,2,1,2,1,2,1,3,1,3,3,1,3,1,3,3,1,3,2,3,3,3,3,1,3,3,2,3,2,3,3,3,1,1,2,2,3,3,3,2,2,3,3,1,3,1,2,1,2,2,1,1,3,3,1,1,3,1,1,1,2,2,3,2,2,2,3,3,1,2,1,2,2,2,3,2,2,1,2,1,1,1,3,3,3,2,1,3,3,3,2,2,3,1,2,1,3,1,3,3,1,3,2,3,2,2,1,1,1,3,3,2,3,1,3,2,2,2,2,2,3,1,3,2,3,1,3,1,3,1,2,3,2,2,3,3,3,3,3,1,1,2,3,3,2,3,1,3,3,1,3,3,2,2,1,3,3,3,3,2,1,3,2,2,2,3,3,1,1,3,3,3,1,3,1,1,2,3,1,3,3,3,2,1,3,1,2,1,3,2,2,3,1,3,1,2,3,3,3,2,2,3,1,2,1,1,1,2,3,1,2,3,2,3,3,2,1,1,2,3,3,1,2,3,1,1,1,3,1,2,3,1,2,3,2,2,3,2,3,2,3,1,2,3,3,1,3,3,2,2,1,1,2,3,2,2,3,3,2,1,1,1,3,3,3,2,2,1,3,2,2,1,3,2,3,3,1,1,3,2,3,3,2,3,1,3,3,1,3,3,2,3,3,2,3,1,3,3,3,3,3,1,1,3,2,2,3,3,3,3,1,1,1,1,3,2,3,3,1,3,2,2,1,1,1,1,3,2,2,3,2,2,3,3,2,3,1,1,1,3,3,3,3,2,3,1,3,3,1,1,3,3,1,3,3,3,1,3,2,1,1,3,3,2,3,3,3,2,2,1,3,3,3,1,2,2,2,2,1,2,2,1,2,3,2,1,2,2,3,3,3,3,3,2,2,3,2,2,3,2,1,3,1,1,2,2,3,1,2,3,2,1,3,1,1,2,1,2,2,3,1,2,2,3,3,1,3,2,1,3,3,2,1,3,3,3,1,3,2,3,3,2,3,2,2,3,2,1,3,3,3,3,2,1,3,3,3,1,3,3,1,3,1,3,3,3
tSNE-1,,8.13846968090103,12.8635212043927,10.3864480425066,7.17083119797853,-72.7452686458686,-49.7960088439495,45.63460621346,-50.3693843293848,-53.2415432674881,-54.6891175204711,-46.4635164735514,4.49644447816871,3.98243750756555,-9.99729157677144,-98.1041739031645,14.4129117311442,21.8090838800674,-46.5547640077783,-65.8379505581324,39.8907136841164,45.2453417297103,-43.4054353275594,5.58370171555427,-82.6419520577671,42.7647608862027,-91.125151907502,-37.9838559192307,62.9924569510685,-69.108888726706,62.7774653919852,60.3873481045592,62.825
I tried to read it by read.csv:
test=read.csv('TransMat_H0004.E1.L1.S1.B1.T1.Matrix',sep='' )
str(test)
'data.frame': 33141 obs. of 1 variable:
$ X.TransMat_H0004.E1.L1.S1.B1.T1: Factor w/ 33141 levels "A1BG,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"| truncated,..: 13453 31099 31100 1 2 3 4 5 6 7 ...
how should I read it in a right format, say, first character of 'sequence'(list?I guess?) as rowname.
Thanks in advance!
sorry, I cannot provide the data link because it is unpublished; but I can tell you what the data look like:
%TransMat_H0004.E1.L1.S1.B1.T1
cluster,1,2,3,2,3….
tsne-
1,-41,-80…..
tsne-
2,-41,-80…..
tsne-
3,-41,-80…..
(and the rest are all started with gene name and number, such as)
genea, 0,2,1,0…
….
genez,0,2,1,0
my desired output is to remove the first 4 factors(cluster, tsne-1, tsne-2,tsne-3), and extract the gene transcripts matrix,such as:
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
I figure this out by this:
read.csv("E2.Matrix", skip=1)
since the first row is annotation according to the bioinfor technician who arranged the .Matrix file
Thanks! # Stephan

Reorder a list of dataframes before rbind (R)

I'm working with R and I have a problem with rbinding dataframe.
My data come from a Json file and the first think I have done is to split it accordingly to Chromosome number
#Input
Control <- fromJSON(file=O5)
RNAi <- fromJSON(file=s25p5)
#Loop throug each chromosome
Control.1 <- lapply(Control, function(I)
{
data.frame(matrix(unlist(I),ncol = 1, byrow = TRUE))
})
The problem is that now I have a list of 6 data.frame but on a random order
str(Control.1)
List of 6
$ II :'data.frame': 1771887 obs. of 1 variable:
..$ matrix.unlist.I...ncol...1..byrow...TRUE.: num [1:1771887] 0 0 0 0 0 0 0 0 0 0 ...
$ I :'data.frame': 1507243 obs. of 1 variable:
..$ matrix.unlist.I...ncol...1..byrow...TRUE.: num [1:1507243] 0 0 0 0 0 0 0 0 0 0 ...
$ III :'data.frame': 1378370 obs. of 1 variable:
..$ matrix.unlist.I...ncol...1..byrow...TRUE.: num [1:1378370] 0 0 0 0 0 0 0 0 0 0 ...
etc.
I would like to reorder them in order to have $I as the first data.frame, then $II etc.
my aim is to use rbind after
Control.2 <-do.call(rbind,Control.1)
in order to have one data frame containing all the data frame but in the correct oder.
Does anybody have any idea how it could be done?
Thank you!
for alphabetical order you can use :
Control.2 <-do.call(rbind,Control.1[order(names(Control.1)))
or you can use any other function than order to sort the names vector.

R: filling matrix with values does not work

I have a data frame vec that I need to prepare for an image.plot() plot. The structure of vec is as follows:
> str(vec)
'data.frame': 31212 obs. of 5 variables:
$ x : int 8 24 40 56 72 88 104 120 136 152 ...
$ y : int 8 8 8 8 8 8 8 8 8 8 ...
$ dx: num 0 0 0 0 0 0 0 0 0 0 ...
$ dy: num 0 0 0 0 0 0 0 0 0 0 ...
$ d : num 0 0 0 0 0 0 0 0 0 0 ...
Note: the values in $dx, $dy and $d are not zero but only too small to be shown in this overview.
Background: the data is the output of a pixel tracking software. $x and $y are pixel coordinates while in $d are the displacement vector lengths (in pixels) of that pixel.
image.plot() expects as first and second argument the dimension of the matrix as ordered vectors, so I think sort(unique(vec$x)) and sort(unique(vec$y)) respectively should be good. So, I would like to end up with image.plot(sort(unique(vec$x)),sort(unique(vec$y)), data)
The third argument is the actual data. To build this I tried:
# spanning an empty matrix
data = matrix(NA,length(unique(vec$x)),length(unique(vec$y)))
# filling the matrix
data[match(vec$x, sort(unique(vec$x))), match(vec$y, sort(unique(vec$y)))] = vec$d
But, unfortunately, this isn't working. It reports no errors but data contains no values! This works:
for(i in c(1:length(vec$x))) data[match(vec$x[i], sort(unique(vec$x))), match(vec$y[i], sort(unique(vec$y)))] = vec$d[i]
But is very slow.
a) is there a better way to build data?
b) is there a better way to deal with my problem, anyways?
R allows indexing of a matrix by a two-column matrix, where the first column of the index is interpreted as the row index, and the second column as the column index. So create the indexes into data as a two-column matrix
idx = cbind(match(vec$x, sort(unique(vec$x))),
match(vec$y, sort(unique(vec$y))))
and use that
data[idx] = vec$d

Resources