Checking a very big tab separated file for unique values - r

I have a very large report file, and I want to check a certain column called Sample_id, for unique values. My data looks as follows:
[1] "[Header]"
[2] "GSGT Version\t1.9.4"
[3] "Processing Date\t7/6/2012 11:41 AM"
[4] "Content\t\tGS0005701-OPA.opa\tGS0005702-OPA.opa\tGS0005703-OPA.opa\tGS0005704-OPA.opa"
[5] "Num SNPs\t5858"
[6] "Total SNPs\t5858"
[7] "Num Samples\t132"
[8] "Total Samples\t132"
[9] "[Data]"
[10] "SNP Name\tSample ID\tGC Score\tAllele1 - AB\tAllele2 - AB\tChr\tPosition\tGT Score\tX Raw\tY Raw"
[11] "rs1867749\t106N\t0.8333\tB\tB\t2\t120109057\t0.8333\t301\t378"
[12] "rs1397354\t106N\t0.6461\tA\tB\t2\t215118936\t0.6461\t341\t192"
[13] "rs2840531\t106N\t0.5922\tB\tB\t1\t2155821\t0.6091\t296\t391"
[14] "rs649593\t106N\t0.8709\tA\tB\t1\t37635225\t0.8709\t357\t200"
[15] "rs1517342\t106N\t0.4839\tA\tB\t2\t169218217\t0.4839\t316\t210"
[16] "rs1517343\t106N\t0.5980\tA\tB\t2\t169218519\t0.5980\t312\t165"
[17] "rs1868071\t106N\t0.5518\tA\tB\t2\t30219358\t0.5518\t355\t229"
[18] "rs761162\t106N\t0.6923\tA\tB\t1\t13733834\t0.6923\t315\t257"
[19] "rs911903\t106N\t0.6053\tA\tA\t1\t46982589\t0.6096\t383\t158"
[20] "rs753646\t106N\t0.6676\tA\tB\t1\t208765509\t0.6688\t341\t169"
So my question is how to check the column Sample_ID for unique values with R. I already know its something with unique but how to take the right column with tab separated files?

first read the file:
sample_data <- read.table(file = "filename", sep = "\t", skip = 9, header = TRUE)
then do (spaces in the column names were automatically converted to dots)
unique(sample_data[, "Sample.ID"])

If the data is in an R object, say it's named "Lines", then you would need to apply the sensible solution offered by cafe876 to a textConnection call or use the text= argument which was added to R in recent versions:
samp_dat <- read.table(file = textConnection(Lines), sep = "\t", skip = 9, header=TRUE)
OR:
samp_dat <- read.table(text= Lines, sep = "\t", skip = 9, header=TRUE)
Here's a test case:
Lines <-
c("[Header] ",
"GSGT Version\t1.9.4 ",
"Processing Date\t7/6/2012 11:41 AM ",
"Content\t\tGS0005701-OPA.opa\tGS0005702-OPA.opa\tGS0005703-OPA.opa\tGS0005704-OPA.opa ",
"Num SNPs\t5858 ",
"Total SNPs\t5858 ",
"Num Samples\t132 ",
"Total Samples\t132 ",
"[Data] ",
"SNP Name\tSample ID\tGC Score\tAllele1 - AB\tAllele2 - AB\tChr\tPosition\tGT Score\tX Raw\tY Raw",
"rs1867749\t106N\t0.8333\tB\tB\t2\t120109057\t0.8333\t301\t378 ",
"rs1397354\t106N\t0.6461\tA\tB\t2\t215118936\t0.6461\t341\t192 ",
"rs2840531\t106N\t0.5922\tB\tB\t1\t2155821\t0.6091\t296\t391 ",
"rs649593\t106N\t0.8709\tA\tB\t1\t37635225\t0.8709\t357\t200"
)

Related

How to filter specific vector using matching condition in R

I have two sets of Input vector. Either Policy Number or Contract Number comes as value for these inputs. But at a time only one will come.
I need to extract either of these Number and fit into policy_nr variable. The code works fine If I explicitly extract.
If I am trying to put OR operator and I am getting this Error
invalid 'x' type in 'x || y'
My Input Type 1
kk
[1] "DUPLICATE x"
[2] "ifa’-"
[3] "UPLICAT EXIDE Life"
[4] "Insurance"
[5] "02-01-2020"
[6] "Mr DEV WHITE T"
[7] "Contract Number : 123456"
My Input Type 2
kk
[1] "DUPLICATE x"
[2] "ifa’-"
[3] "UPLICAT EXIDE Life"
[4] "Insurance"
[5] "02-01-2020"
[6] "Mr CRAIG WHITE T"
[7] "Policy Number : 7890"
Code
policy_nr <- str_trim(sub(".*:", "", kk[grepl("Contract Number",kk)] ), side = c("both")) || str_trim(sub(".*:", "", kk[grepl("Policy Number",kk)] ), side = c("both"))
Just concatenate the character vector into one using paste, because the number might be at any element. You can encode the OR operator in a regular expression like this: "(Policy|Contract)"
library(tidyverse)
record1 <- c(
"DUPLICATE x",
"ifa’-",
"UPLICAT EXIDE Life",
"Insurance",
"02-01-2020",
"Mr DEV WHITE T",
"Contract Number : 123456"
)
record2 <- c(
"DUPLICATE x",
"ifa’-",
"UPLICAT EXIDE Life",
"Insurance",
"02-01-2020",
"Mr CRAIG WHITE T",
"Policy Number : 7890"
)
extract_number <- function(x) {
x %>%
paste0(collapse = " ") %>%
str_extract("(Policy|Contract) Number : [0-9]+") %>%
str_extract("[0-9]+") %>%
as.numeric()
}
extract_number(record1)
#> [1] 123456
extract_number(record2)
#> [1] 7890
Created on 2021-12-03 by the reprex package (v2.0.1)
You can use this pattern: '(?<=(Policy|Contract) Number : ).\\d+' to extract numbers after Policy Number or Contract Number
library(tidyverse)
kk1 <- c(
"DUPLICATE x",
"ifa’-",
"UPLICAT EXIDE Life",
"Insurance",
"02-01-2020",
"Mr DEV WHITE T",
"Contract Number : 123456"
)
kk2 <- c(
"DUPLICATE x",
"ifa’-",
"UPLICAT EXIDE Life",
"Insurance",
"02-01-2020",
"Mr CRAIG WHITE T",
"Policy Number : 7890"
)
kk1 %>% paste(collapse = ' ') %>%
str_extract_all('(?<=(Policy|Contract) Number : ).\\d+')
kk2 %>% paste(collapse = ' ') %>%
str_extract_all('(?<=(Policy|Contract) Number : ).\\d+')

How to read a matrix in R with set size

I have a matrix, saved as a file (no extension) looking like this:
Peter Westons NH 54 RTcoef level B matrix from L70 Covstats.
2.61949322E+00 2.27966995E+00 1.68120147E+00 9.88238464E-01 8.38279026E-01
7.41276375E-01
2.27966995E+00 2.31885465E+00 1.53558372E+00 4.87789344E-01 2.90254400E-01
2.56963125E-01
1.68120147E+00 1.53558372E+00 1.26129096E+00 8.18048022E-01 5.66120186E-01
3.23866166E-01
9.88238464E-01 4.87789344E-01 8.18048022E-01 1.38558423E+00 1.21272607E+00
7.20283781E-01
8.38279026E-01 2.90254400E-01 5.66120186E-01 1.21272607E+00 1.65314082E+00
1.35926028E+00
7.41276375E-01 2.56963125E-01 3.23866166E-01 7.20283781E-01 1.35926028E+00
1.74777330E+00
How do I go about reading this in as a fixed 6*6 matrix, skipping the first header? I don't see any options for the amount of columns in read.matrix, I tried with the scan() -> matrix() option but I can't read in the file as the skip parameter in scan() doesn't seem to work. I feel there must be a simple option to do this.
My original file is larger, and has 17 full rows of 5 elements and 1 row of 1 element in this structure, example of what needs to be in one row:
[1] " 2.61949322E+00 2.27966995E+00 1.68120147E+00 9.88238464E-01 8.38279026E-01"
[2] " 7.41276375E-01 5.23588785E-01 1.09559244E-01 -9.58430529E-02 -3.24544839E-02"
[3] " 1.96694874E-02 3.39249911E-02 1.54438478E-02 2.38380549E-03 9.59475077E-03"
[4] " 8.02748175E-03 1.63922615E-02 4.51778592E-04 -1.32080759E-02 -2.06313988E-02"
[5] " -1.56037533E-02 -3.35496588E-03 -4.22450803E-03 -3.17468525E-03 3.23012615E-03"
[6] " -8.68914773E-03 -5.94151619E-03 2.34059840E-04 -2.76737270E-03 -4.90334584E-03"
[7] " 1.53812087E-04 5.69891977E-03 5.33816835E-03 3.32982333E-03 -2.62856968E-03"
[8] " -5.15188677E-03 -4.47782553E-03 -5.49510247E-03 -3.71780229E-03 9.80192203E-04"
[9] " 4.18101180E-03 5.47513662E-03 4.14679058E-03 -2.81461574E-03 -4.67580613E-03"
[10] " 3.41841523E-04 4.07771227E-03 7.06154094E-03 6.61650765E-03 5.97925136E-03"
[11] " 3.92987162E-03 1.72895946E-03 -3.47249017E-03 9.90977857E-03 -2.36066909E-31"
[12] " -8.62803933E-32 -1.32472387E-31 -1.02360189E-32 -5.11800943E-33 -4.16409844E-33"
[13] " -5.11800943E-33 -2.52126889E-32 -2.52126889E-32 -4.16409844E-33 -4.16409844E-33"
[14] " -5.11800943E-33 -5.11800943E-33 -4.16409844E-33 -2.52126889E-32 -2.52126889E-32"
[15] " -2.52126889E-32 -1.58614773E-33 -1.58614773E-33 -2.55900472E-33 -1.26063444E-32"
[16] " -7.93073863E-34 -1.04102461E-33 -3.19875590E-34 -3.19875590E-34 -3.19875590E-34"
[17] " -2.60256152E-34 -1.30128076E-34 0.00000000E+00 1.78501287E-02 -1.14423068E-11"
[18] " 3.00625863E-02"
So the full matrix should be 86*86.
Thanks a bunch
Try this option :
Read the file with readLines removing the first line. ([-1]).
Split values on whitespace and create 1 X 6 matrix from every combination of two rows.
Combine them together in one matrix with do.call(rbind, ..).
rows <- readLines('filename')[-1]
result <- do.call(rbind,
tapply(rows, ceiling(seq_along(rows)/2), function(x)
strsplit(paste0(trimws(x), collapse = ' '), '\\s+')[[1]]))

xpathSApply webscrape returns NULL

Using my trusty firebug and firepath plug-ins I'm trying to scrape some data.
require(XML)
url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works
This works! t now contains "Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"
If I try to capture the first sectional time of 29.4 thusly:
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work
t contains NULL.
Any ideas what I've done wrong? Many thanks.
First off, I can't find that first sectional time of 29.4. The one I see on the page you linked is 24.5 or I'm misunderstanding what you are looking for.
Here's a way of grabbing that one using rvest and SelectorGadget for Chrome:
library(rvest)
html <- read_html(url)
t <- html %>%
html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>%
html_text(trim = T)
> t
[1] "24.5"
This differs a bit from your approach but I hope it helps. Not sure how to properly scrape the meeting time that way, but this at least works:
mt <- html %>%
html_nodes("font > table font") %>%
html_text(trim = T)
> mt
[1] "Meeting Date: 25/05/2008, Sha Tin" "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
[3] "MONEY TALKS HANDICAP" "Race\tTime :"
[5] "(24.5)" "(48.1)"
[7] "(1.10.3)" "Sectional Time :"
[9] "24.5" "23.6"
[11] "22.2"
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"
Looks like the comments just after the <a> may be throwing you off.
<a name="Race1">
<!-- test0 table start -->
<table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
<!--0 table End -->
<!-- test1 table start -->
<br>
<br>
</a>
This seems to work:
t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)
You might want to try something a little less fragile then that long direct path.
Update
If you are after all of the times in the "1st Sec." column: 29.4, 28.7, etc...
t <- xpathSApply(
tree,
"//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
xmlValue
)
Looks for the "1st Sec." column, then jump up to its row, grab every other row's 1st td value.
[1] "29.4 "
[2] "28.7 "
[3] "29.2 "
[4] "29.0 "
[5] "29.3 "
[6] "28.2 "
[7] "29.5 "
[8] "29.5 "
[9] "30.1 "
[10] "29.8 "
[11] "29.6 "
[12] "29.9 "
[13] "29.1 "
[14] "29.8 "
I've removed all the extra whitespace (\r\n\t\t...) for display purposes here.
If you wanted to make it a little more dynamic, you could grab the column value under "1st Sec." or any other column. Replace
/td[1]
with
td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]
Using that, you could update the name of the column, and grab the corresponding values. For all "3rd Sec." times:
"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"
[1] "23.3 "
[2] "23.7 "
[3] "23.3 "
[4] "23.8 "
[5] "23.7 "
[6] "24.5 "
[7] "24.1 "
[8] "24.0 "
[9] "24.1 "
[10] "24.1 "
[11] "23.9 "
[12] "23.9 "
[13] "24.3 "
[14] "24.0 "

knitr vs. interactive R behaviour

I am reposting my problem here, after I noticed that was the approach advised by knitr's author to get more help.
I am a bit puzzle with a .Rmd file that I can proceed line by line in an interactive R session, and also with R CMD BATCH, but that fails when using knit("test.Rmd"). I am not sure where the problem lies, and I tried to narrow the problem down as much as I could. Here is the example (in test.Rmd):
```{r Rinit, include = FALSE, cache = FALSE}
opts_knit$set(stop_on_error = 2L)
library(adehabitatLT)
```
The functions to be used later:
```{r functions}
ld <- function(ltraj) {
if (!inherits(ltraj, "ltraj"))
stop("ltraj should be of class ltraj")
inf <- infolocs(ltraj)
df <- data.frame(
x = unlist(lapply(ltraj, function(x) x$x)),
y = unlist(lapply(ltraj, function(x) x$y)),
date = unlist(lapply(ltraj, function(x) x$date)),
dx = unlist(lapply(ltraj, function(x) x$dx)),
dy = unlist(lapply(ltraj, function(x) x$dy)),
dist = unlist(lapply(ltraj, function(x) x$dist)),
dt = unlist(lapply(ltraj, function(x) x$dt)),
R2n = unlist(lapply(ltraj, function(x) x$R2n)),
abs.angle = unlist(lapply(ltraj, function(x) x$abs.angle)),
rel.angle = unlist(lapply(ltraj, function(x) x$rel.angle)),
id = rep(id(ltraj), sapply(ltraj, nrow)),
burst = rep(burst(ltraj), sapply(ltraj, nrow)))
class(df$date) <- c("POSIXct", "POSIXt")
attr(df$date, "tzone") <- attr(ltraj[[1]]$date, "tzone")
if (!is.null(inf)) {
nc <- ncol(inf[[1]])
infdf <- as.data.frame(matrix(nrow = nrow(df), ncol = nc))
names(infdf) <- names(inf[[1]])
for (i in 1:nc) infdf[[i]] <- unlist(lapply(inf, function(x) x[[i]]))
df <- cbind(df, infdf)
}
return(df)
}
ltraj2sldf <- function(ltr, proj4string = CRS(as.character(NA))) {
if (!inherits(ltr, "ltraj"))
stop("ltr should be of class ltraj")
df <- ld(ltr)
df <- subset(df, !is.na(dist))
coords <- data.frame(df[, c("x", "y", "dx", "dy")], id = as.numeric(row.names(df)))
res <- apply(coords, 1, function(dfi) Lines(Line(matrix(c(dfi["x"],
dfi["y"], dfi["x"] + dfi["dx"], dfi["y"] + dfi["dy"]),
ncol = 2, byrow = TRUE)), ID = format(dfi["id"], scientific = FALSE)))
res <- SpatialLinesDataFrame(SpatialLines(res, proj4string = proj4string),
data = df)
return(res)
}
```
I load the object and apply the `ltraj2sldf` function:
```{r fail}
load("tr.RData")
juvStp <- ltraj2sldf(trajjuv, proj4string = CRS("+init=epsg:32617"))
dim(juvStp)
```
Using knitr("test.Rmd") fails with:
label: fail
Quitting from lines 66-75 (test.Rmd)
Error in SpatialLinesDataFrame(SpatialLines(res, proj4string =
proj4string), (from <text>#32) :
row.names of data and Lines IDs do not match
Using the call directly in the R console after the error occurred works as expected...
The problem is related to the way format produces the ID (in the apply call of ltraj2sldf), just before ID 100,000: using an interactive call, R gives "99994", "99995", "99996", "99997", "99998", "99999", "100000"; using knitr R gives "99994", " 99995", " 99996", " 99997", " 99998", " 99999", "100000", with additional leading spaces.
Is there any reason for this behaviour to occur? Why should knitr behave differently than a direct call in R? I have to admit I'm having hard time with that one, since I cannot debug it (it works in an interactive session)!
Any hint will be much appreciated. I can provide the .RData if it helps (the file is 4.5 Mo), but I'm mostly interested in why such a difference happens. I tried without any success to come up with a self-reproducible example, sorry about that. Thanks in advance for any contribution!
After a comment of baptiste, here are some more details about IDs generation. Basically, the ID is generated at each line of the data frame by an apply call, which in turn uses format like this: format(dfi["id"], scientific = FALSE). Here, the column id is simply a series from 1 to the number of rows (1:nrow(df)). scientific = FALSE is just to ensure that I don't have results such as 1e+05 for 100000.
Based on an exploration of the IDs generation, the problem only occurred for those presented in the first message, i.e. 99995 to 99999, for which a leading space is added. This should not happen with this format call, since I did not ask for a specific number of digit in the output. For instance:
> format(99994:99999, scientific = FALSE)
[1] "99994" "99995" "99996" "99997" "99998" "99999"
However, if the IDs are generated in chunks, it might occur:
> format(99994:100000, scientific = FALSE)
[1] " 99994" " 99995" " 99996" " 99997" " 99998" " 99999" "100000"
Note that the same processed one at a time gives the expected result:
> for (i in 99994:100000) print(format(i, scientific = FALSE))
[1] "99994"
[1] "99995"
[1] "99996"
[1] "99997"
[1] "99998"
[1] "99999"
[1] "100000"
In the end, it's exactly like if the IDs were not prepared one at a time (as I would expect from an apply call by line), but in this case, 6 at a time, and only when close to 1e+05... And of course, only when using knitr, not interactive or batch R.
Here is my session information:
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.2 adehabitatLT_0.3.12 CircStats_0.2-4
[4] boot_1.3-9 MASS_7.3-27 adehabitatMA_0.3.6
[7] ade4_1.5-2 sp_1.0-11 basr_0.5.3
loaded via a namespace (and not attached):
[1] digest_0.6.3 evaluate_0.4.4 formatR_0.8 fortunes_1.5-0
[5] grid_3.0.1 lattice_0.20-15 stringr_0.6.2 tools_3.0.1
Both Jeff and baptiste were were indeed right! This is an option problem, related to the digits argument. I managed to come up with a working minimal example (e.g. in test.Rmd):
Simple reproducible example : df1 is a data frame of 110,000 rows,
with 2 random normal variables + an `id` variable which is a series
from 1 to the number of row.
```{r example}
df1 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
```
From this, we create a `id2` variable using `format` and `scientific =
FALSE` to have results with all numbers instead of scientific
notations (e.g. 100,000 instead of 1e+05):
```{r example-continued}
df1$id2 <- apply(df1, 1, function(dfi) format(dfi["id"], scientific = FALSE))
df1$id2[99990:100010]
```
It works as expected using R interactively, resulting in:
[1] "99990" "99991" "99992" "99993" "99994" "99995" "99996"
[8] "99997" "99998" "99999" "100000" "100001" "100002" "100003"
[15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
However, the results are quite different using knit:
> library(knitr)
> knit("test.Rmd")
[...]
## [1] "99990" "99991" "99992" "99993" "99994" " 99995" " 99996"
## [8] " 99997" " 99998" " 99999" "100000" "100001" "100002" "100003"
## [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
Note the additional leading spaces after 99994. The difference actually comes from the digits option, as rightly suggested by Jeff: R uses 7 by default, while knitr uses 4. This difference affects the output of format, although I don't really understand what's going on here. R-style:
> options(digits = 7)
> format(99999, scientific = FALSE)
[1] "99999"
knitr-style:
> options(digits = 4)
> format(99999, scientific = FALSE)
[1] " 99999"
But it should affect all numbers, not just after 99994 (well, to be honest, I don't even understand why it's adding leading spaces at all):
> options(digits = 4)
> format(c(1:10, 99990:100000), scientific = FALSE)
[1] " 1" " 2" " 3" " 4" " 5" " 6" " 7"
[8] " 8" " 9" " 10" " 99990" " 99991" " 99992" " 99993"
[15] " 99994" " 99995" " 99996" " 99997" " 99998" " 99999" "100000"
From this, I have no idea which is at fault: knitr, apply or format? At least, I came up with a workaround, using the argument trim = TRUE in format. It doesn't solve the cause of the problem, but did remove the leading space in the results...
I added a comment to your knitr GitHub issue with this information.
format() adds the extra whitespace when the digits option is not sufficient to display a value but scientific=FALSE is also specified. knitr sets digits to 4 inside code blocks, which causes the behavior you describe:
options(digits=4)
format(99999, scientific=FALSE)
Produces:
[1] " 99999"
While:
options(digits=5)
format(99999, scientific=FALSE)
Produces:
[1] "99999"
Thanks to Aleksey Vorona and Duncan Murdoch, this bug is now fixed in R-devel!
See: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15411

Splitting a string in R, different split argument elements

I imported some data with no column names, so now I have just over a million rows, and 1 column (instead of 5 columns).
Each row is formatted like this:
x <- "2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
strsplit( x , split = c(" ", " ", "%", " "))
and got
[[1]]
[1] "2012-10-19T16:59:01-07:00" "192.101.136.140"
[3] "<190>Oct" "19"
[5] "2012" "23:59:01:"
[7] "%FWSM-6-305011:" "Built"
[9] "dynamic" "tcp"
[11] "translation" "from"
[13] "Inside:10.2.45.62/56455" "to"
[15] "outside:192.101.136.224/9874"
I know that it has to do with recycling the split argument but I can't seem to figure how to get it how I want:
[[1]]
[1] "2012-10-19T16:59:01-07:00" "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01 "%FWSM-6-305011
[5] Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
Each row has a different message as the fifth element, but after the 4th element I just want to keep the rest of the string together.
Any help would be appreciated.
You can use paste with the collapse argument to combine every element starting with the fifth element.
A <- strsplit( x = "2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874", split = c(" ", " ", "%", " "))
c(A[[1]][1:4], paste(A[[1]][5:length(A[[1]])], collapse=" "))
As #DWin points out, split = c(" ", " ", "%", " ") is not used in order - in other words it's identical to split = c(" ", "%")
I think here you don't need to use strsplit. I use read.table to read the lines using text argument. Then you aggregate columns using paste. Since you have a lot of rows, it is better to do the column aggregation within a data.table.
dt <- read.table(text=x)
library(data.table)
DT <- as.data.table(dt)
DT[ , c('V3','V8') := list(paste(V3,V4,V5),
V8=paste(V8,V9,V10,V11,V12,V13,V14,V15))]
DT[,paste0('V',c(1:3,6:7,8)),with=FALSE]
V1 V2 V3 V6 V7
1: 2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011:
V8
1: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874
Here is a function that I think works in the way that you thought strsplit functioned:
split.seq<-function(x,delimiters) {
break.point<-regexpr(delimiters[1], x)
first<-mapply(substring,x,1,break.point-1,USE.NAMES=FALSE)
second<-mapply(substring,x,break.point+1,nchar(x),USE.NAMES=FALSE)
if (length(delimiters)==1) return(lapply(1:length(first),function(x) c(first[x],second[x])))
else mapply(function(x,y) c(x,y),first, split.seq(second, delimiters[-1]) ,USE.NAMES=FALSE, SIMPLIFY=FALSE)
}
split.seq(x,delimiters)
A test:
x<-rep(x,2)
delimiters=c(" ", " ", "%", " ")
split.seq(x,delimiters)
[[1]]
[1] "2012-10-19T16:59:01-07:00"
[2] "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01: "
[4] "FWSM-6-305011:"
[5] "Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
[[2]]
[1] "2012-10-19T16:59:01-07:00"
[2] "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01: "
[4] "FWSM-6-305011:"
[5] "Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"

Resources