I have a matrix, saved as a file (no extension) looking like this:
Peter Westons NH 54 RTcoef level B matrix from L70 Covstats.
2.61949322E+00 2.27966995E+00 1.68120147E+00 9.88238464E-01 8.38279026E-01
7.41276375E-01
2.27966995E+00 2.31885465E+00 1.53558372E+00 4.87789344E-01 2.90254400E-01
2.56963125E-01
1.68120147E+00 1.53558372E+00 1.26129096E+00 8.18048022E-01 5.66120186E-01
3.23866166E-01
9.88238464E-01 4.87789344E-01 8.18048022E-01 1.38558423E+00 1.21272607E+00
7.20283781E-01
8.38279026E-01 2.90254400E-01 5.66120186E-01 1.21272607E+00 1.65314082E+00
1.35926028E+00
7.41276375E-01 2.56963125E-01 3.23866166E-01 7.20283781E-01 1.35926028E+00
1.74777330E+00
How do I go about reading this in as a fixed 6*6 matrix, skipping the first header? I don't see any options for the amount of columns in read.matrix, I tried with the scan() -> matrix() option but I can't read in the file as the skip parameter in scan() doesn't seem to work. I feel there must be a simple option to do this.
My original file is larger, and has 17 full rows of 5 elements and 1 row of 1 element in this structure, example of what needs to be in one row:
[1] " 2.61949322E+00 2.27966995E+00 1.68120147E+00 9.88238464E-01 8.38279026E-01"
[2] " 7.41276375E-01 5.23588785E-01 1.09559244E-01 -9.58430529E-02 -3.24544839E-02"
[3] " 1.96694874E-02 3.39249911E-02 1.54438478E-02 2.38380549E-03 9.59475077E-03"
[4] " 8.02748175E-03 1.63922615E-02 4.51778592E-04 -1.32080759E-02 -2.06313988E-02"
[5] " -1.56037533E-02 -3.35496588E-03 -4.22450803E-03 -3.17468525E-03 3.23012615E-03"
[6] " -8.68914773E-03 -5.94151619E-03 2.34059840E-04 -2.76737270E-03 -4.90334584E-03"
[7] " 1.53812087E-04 5.69891977E-03 5.33816835E-03 3.32982333E-03 -2.62856968E-03"
[8] " -5.15188677E-03 -4.47782553E-03 -5.49510247E-03 -3.71780229E-03 9.80192203E-04"
[9] " 4.18101180E-03 5.47513662E-03 4.14679058E-03 -2.81461574E-03 -4.67580613E-03"
[10] " 3.41841523E-04 4.07771227E-03 7.06154094E-03 6.61650765E-03 5.97925136E-03"
[11] " 3.92987162E-03 1.72895946E-03 -3.47249017E-03 9.90977857E-03 -2.36066909E-31"
[12] " -8.62803933E-32 -1.32472387E-31 -1.02360189E-32 -5.11800943E-33 -4.16409844E-33"
[13] " -5.11800943E-33 -2.52126889E-32 -2.52126889E-32 -4.16409844E-33 -4.16409844E-33"
[14] " -5.11800943E-33 -5.11800943E-33 -4.16409844E-33 -2.52126889E-32 -2.52126889E-32"
[15] " -2.52126889E-32 -1.58614773E-33 -1.58614773E-33 -2.55900472E-33 -1.26063444E-32"
[16] " -7.93073863E-34 -1.04102461E-33 -3.19875590E-34 -3.19875590E-34 -3.19875590E-34"
[17] " -2.60256152E-34 -1.30128076E-34 0.00000000E+00 1.78501287E-02 -1.14423068E-11"
[18] " 3.00625863E-02"
So the full matrix should be 86*86.
Thanks a bunch
Try this option :
Read the file with readLines removing the first line. ([-1]).
Split values on whitespace and create 1 X 6 matrix from every combination of two rows.
Combine them together in one matrix with do.call(rbind, ..).
rows <- readLines('filename')[-1]
result <- do.call(rbind,
tapply(rows, ceiling(seq_along(rows)/2), function(x)
strsplit(paste0(trimws(x), collapse = ' '), '\\s+')[[1]]))
Related
I have this text file called textdata.txt:
TREATMENT DATA
------------------------------------
A: Text1
B: Text2
C: Text3
D: Text4
E: Text5
F: Text6
G: Text7
I would like to remove the whole line with --------- and the empty lines using readLines or read_lines:
When I use readLines("textdata.txt") I get:
[1] "TREATMENT DATA"
[2] ""
[3] "------------------------------------"
[4] "A: Text1"
[5] "B: Text2"
[6] ""
[7] "C: Text3"
[8] "D: Text4"
[9] ""
[10] "E: Text5"
[11] "F: Text6"
[12] "G: Text7"
I would like to have, expected output:
[1] "TREATMENT DATA"
[2] "A: Text1"
[3] "B: Text2"
[4] "C: Text3"
[5] "D: Text4"
[6] "E: Text5"
[7] "F: Text6"
[8] "G: Text7"
Background:
I have de facto no experience handling files with R. The basic idea is to get a .txt format from which I can load multiple text files stored in a folder to one dataframe.
1) read.table If we can assume that the only occurrence of - is where shown in the question and if ? does not occur anywhere in the file then this will read in the data regarding every line as a single field and throwing away the header. Since - is the comment character lines with only - are regarded as blank and those will be thrown away. This reads the file into a one columnn data frame and the [[1]] returns that column as a character vector. If you want to keep the header omit header=TRUE.
read.table("myfile", sep = "?", comment.char = "-", header = TRUE)[[1]]
2) grep Another possibility is to read in the file and then remove lines that are empty or contain only - characters.
grep("^-*$", readLines("myfile"), invert = TRUE, value = TRUE)
3) pipe We could process the input using a filter and then pipe that into R. On Windows grep is found in C:\Rtools40\usr\bin if you have Rtools40 installed but if it is not on your path either use the complete path or if you don't have it at all replace grep with findstr. If on UNIX/Linux the escaping may vary according to which shell you are using.
readLines(pipe('grep -v "^-*$" myfile'))
How would I go about extracting, for each row (there are ~56,000 records in an Excel file) in a specific column, only part of a string? I need to keep all text to the left of the last '/' forward slash. The challenge is that not all cells have the same number of '/'. There is always a filename (*.wav) at the end of the last '/', but the number of characters in the filename is not always the same (sometimes 5 and sometimes 6).
Below are some examples of the strings in the cells:
cloch/51.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav
AB_AeolinaL/025-C#.wav
AB_AeolinaL/026-D.wav
AB_violadamourL/rel99999/091-G.wav
AB_violadamourL/rel99999/092-G#.wav
AB_violadamourR/024-C.wav
AB_violadamourR/025-C#.wav
The extracted text should be:
cloch
grand/Grand_bombarde/02-suchy_Grand_bombarde
grand/Grand_bombarde/02-suchy_Grand_bombarde
AB_AeolinaL
AB_AeolinaL
AB_violadamourL/rel99999
AB_violadamourL/rel99999
AB_violadamourR
AB_violadamourR
Can anyone recommend a strategy using R?
You can use the stringr package str_remove(string,pattern) function like:
str = "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav"
str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
Then you can just iterate over all other strings:
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
You have to substract strings using this method:
substr(strings,1,regexpr("\\/[^\\/]*$", strings)-1)
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
Input
strings<-c("cloch/51.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav","AB_AeolinaL/025-C#.wav","AB_AeolinaL/026-D.wav","AB_violadamourL/rel99999/091-G.wav","AB_violadamourL/rel99999/092-G#.wav","AB_violadamourR/024-C.wav","AB_violadamourR/025-C#.wav")
In which this regex regexpr("\\/[^\\/]*$", strings) gives you the position of the last "/"
Assuming that the strings you propose are in a column of a dataframe:
df <- data.frame(x = 1:5, y = c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav"))
# I define a function that separates a string at each "/"
# throws the last piece and reattaches the pieces
cut_str <- function(s) {
st <- head((unlist(strsplit(s, "\\/"))), -1)
r <- paste(st, collapse = "/")
return(r)
}
# through the sapply function I get the desired result
new_strings <- as.vector(sapply(df$y, FUN = cut_str))
new_strings
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
You could use
dirname(strings)
If there is no /, this returns ., which you could remove afterwards if you like, e.g.:
res <- dirname(strings)
res[res=="."] <- ""
``
You could start the match with / followed by 1 or more times any char except a forward slash or a whitespace char using a negated character class [^\\s/]+
Then match .wav at the end of the string using $
Replace the match with an empty string using sub for example.
[^\\s/]+\\.wav$
See the regex matches | R demo
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
sub("/[^\\s/]+\\.wav$", "", strings)
Output
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
Using my trusty firebug and firepath plug-ins I'm trying to scrape some data.
require(XML)
url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works
This works! t now contains "Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"
If I try to capture the first sectional time of 29.4 thusly:
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work
t contains NULL.
Any ideas what I've done wrong? Many thanks.
First off, I can't find that first sectional time of 29.4. The one I see on the page you linked is 24.5 or I'm misunderstanding what you are looking for.
Here's a way of grabbing that one using rvest and SelectorGadget for Chrome:
library(rvest)
html <- read_html(url)
t <- html %>%
html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>%
html_text(trim = T)
> t
[1] "24.5"
This differs a bit from your approach but I hope it helps. Not sure how to properly scrape the meeting time that way, but this at least works:
mt <- html %>%
html_nodes("font > table font") %>%
html_text(trim = T)
> mt
[1] "Meeting Date: 25/05/2008, Sha Tin" "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
[3] "MONEY TALKS HANDICAP" "Race\tTime :"
[5] "(24.5)" "(48.1)"
[7] "(1.10.3)" "Sectional Time :"
[9] "24.5" "23.6"
[11] "22.2"
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"
Looks like the comments just after the <a> may be throwing you off.
<a name="Race1">
<!-- test0 table start -->
<table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
<!--0 table End -->
<!-- test1 table start -->
<br>
<br>
</a>
This seems to work:
t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)
You might want to try something a little less fragile then that long direct path.
Update
If you are after all of the times in the "1st Sec." column: 29.4, 28.7, etc...
t <- xpathSApply(
tree,
"//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
xmlValue
)
Looks for the "1st Sec." column, then jump up to its row, grab every other row's 1st td value.
[1] "29.4 "
[2] "28.7 "
[3] "29.2 "
[4] "29.0 "
[5] "29.3 "
[6] "28.2 "
[7] "29.5 "
[8] "29.5 "
[9] "30.1 "
[10] "29.8 "
[11] "29.6 "
[12] "29.9 "
[13] "29.1 "
[14] "29.8 "
I've removed all the extra whitespace (\r\n\t\t...) for display purposes here.
If you wanted to make it a little more dynamic, you could grab the column value under "1st Sec." or any other column. Replace
/td[1]
with
td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]
Using that, you could update the name of the column, and grab the corresponding values. For all "3rd Sec." times:
"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"
[1] "23.3 "
[2] "23.7 "
[3] "23.3 "
[4] "23.8 "
[5] "23.7 "
[6] "24.5 "
[7] "24.1 "
[8] "24.0 "
[9] "24.1 "
[10] "24.1 "
[11] "23.9 "
[12] "23.9 "
[13] "24.3 "
[14] "24.0 "
I imported some data with no column names, so now I have just over a million rows, and 1 column (instead of 5 columns).
Each row is formatted like this:
x <- "2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
strsplit( x , split = c(" ", " ", "%", " "))
and got
[[1]]
[1] "2012-10-19T16:59:01-07:00" "192.101.136.140"
[3] "<190>Oct" "19"
[5] "2012" "23:59:01:"
[7] "%FWSM-6-305011:" "Built"
[9] "dynamic" "tcp"
[11] "translation" "from"
[13] "Inside:10.2.45.62/56455" "to"
[15] "outside:192.101.136.224/9874"
I know that it has to do with recycling the split argument but I can't seem to figure how to get it how I want:
[[1]]
[1] "2012-10-19T16:59:01-07:00" "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01 "%FWSM-6-305011
[5] Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
Each row has a different message as the fifth element, but after the 4th element I just want to keep the rest of the string together.
Any help would be appreciated.
You can use paste with the collapse argument to combine every element starting with the fifth element.
A <- strsplit( x = "2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874", split = c(" ", " ", "%", " "))
c(A[[1]][1:4], paste(A[[1]][5:length(A[[1]])], collapse=" "))
As #DWin points out, split = c(" ", " ", "%", " ") is not used in order - in other words it's identical to split = c(" ", "%")
I think here you don't need to use strsplit. I use read.table to read the lines using text argument. Then you aggregate columns using paste. Since you have a lot of rows, it is better to do the column aggregation within a data.table.
dt <- read.table(text=x)
library(data.table)
DT <- as.data.table(dt)
DT[ , c('V3','V8') := list(paste(V3,V4,V5),
V8=paste(V8,V9,V10,V11,V12,V13,V14,V15))]
DT[,paste0('V',c(1:3,6:7,8)),with=FALSE]
V1 V2 V3 V6 V7
1: 2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011:
V8
1: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874
Here is a function that I think works in the way that you thought strsplit functioned:
split.seq<-function(x,delimiters) {
break.point<-regexpr(delimiters[1], x)
first<-mapply(substring,x,1,break.point-1,USE.NAMES=FALSE)
second<-mapply(substring,x,break.point+1,nchar(x),USE.NAMES=FALSE)
if (length(delimiters)==1) return(lapply(1:length(first),function(x) c(first[x],second[x])))
else mapply(function(x,y) c(x,y),first, split.seq(second, delimiters[-1]) ,USE.NAMES=FALSE, SIMPLIFY=FALSE)
}
split.seq(x,delimiters)
A test:
x<-rep(x,2)
delimiters=c(" ", " ", "%", " ")
split.seq(x,delimiters)
[[1]]
[1] "2012-10-19T16:59:01-07:00"
[2] "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01: "
[4] "FWSM-6-305011:"
[5] "Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
[[2]]
[1] "2012-10-19T16:59:01-07:00"
[2] "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01: "
[4] "FWSM-6-305011:"
[5] "Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
I am trying to write a list to file as one row and without quotes in R.
Content of the list is:
[1] "X4775495036_J" "X4775495036_F" "X5147722015_F" "X5067554009_F"
[5] "X5067554063_B" "X4954590047_A" "X5067554063_G" "X5067554009_L"
[9] "X5147722015_D" "X5511045011_D" "X5067554063_A" "X4805447025_F"
[13] "X5455362015_K" "X4805447025_L" "X5147722015_B" "X5067554009_G"
[17] "X5147722014_K" "X5067554063_H" "X5147722009_G" "X5067554008_H"
[21] "X5067554054_H" "X4805447016_K" "X5147722014_E" "X4954590051_K"
[25] "X5067554008_E" "X5147722015_H" "X5147722009_H" "X5067554063_D"
[29] "X5147722015_A" "X5511045022_E" "X5067554054_I" "X5067554063_J"
[33] "X5067554007_F" "X4775495036_E" "X4775495036_H" "X4805447025_H"
[37] "X5067554009_I" "X4805447025_K" "X4954590051_C" "X4805447025_E"
[41] "X5067554063_E" "X5147722009_J" "X5067554054_C" "X5067554054_G"
[45] "X4805447016_I" "X5455362015_B" "X5067554009_H" "X5147722014_A"
[49] "X4775495036_I" "X5067554063_L" "X5455362015_J" "X4954590047_J"
[53] "X5067554009_A" "X4954590051_D" "X5455362015_I" "X5511045011_E"
[57] "X5147722014_F"
I want something like this (all elements in one row):
X4775495036_J X4775495036_F X5147722015_F X5067554009_F ...
I have tried with write.table, write but with no result.
Note that you don't have a list, you have a character vector.
cat(your_vector, "\n", file="your_file.txt")
The "\n" is an optional newline at the end.
You could use the ncolumns argument of write:
n <- LETTERS[1:10] # create example values
write(n, "letters.txt", ncolumns=length(n))
Or you could concatenate your names before:
nc <- paste0(n, collapse=" ")
write(nc, "letters.txt")