How to separate a text file into columns - r

This is what my text file looks like:
1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12
All these values are continuous and each character indicates something. I am unable to figure out how to separate each value into columns and specify a heading for all these new columns which will be created.
I used this code, however it does not seem to work.
birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngthā€¯))
Any help would be appreciated.

Assuming that you have a clear definition for every column, you can use regular expressions to solve this in no time.
From your column names and example data, I guess that the regular expression that matches each field is:
ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}
We can put all this together in a regular expression that captures each of these fields
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
Note: In R, we need to scape "\" that's why we write \d instead of \d.
That said, here comes the code to solve the problem.
First, you need to read your strings
lines <- readLines("birthweighthw1.txt")
Now, we define our regular expression and use the function str_match from the package stringr to get your data into character matrix.
require(stringr)
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
captured <- str_match(string= lines, pattern= reg.expression)
You can check that the first column in the matrix contains the text matched, and the following columns the data captured. So, we can get rid of the first column
captured <- captured[,-1]
and transform it into a data.frame with appropriate column names
result <- as.data.frame(captured,stringsAsFactors = FALSE)
names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")
Now, every column in result is of type character, you can transform each of them into other types. For example:
require(dplyr)
result <- result %>% mutate(ethnic=as.factor(ethnic),
age=as.integer(age),
smoke=as.factor(smoke),
preweight=as.numeric(preweight),
delweight=as.numeric(delweight),
breastfed=as.factor(breastfed),
brthwght=as.integer(brthwght),
brthlngth=as.numeric(brthlngth)
)

Related

Splicing names in a list, and adding characters to a specific string?

Hello there: I currently have a list of file names (100s) which are separated by multiple "/" at certain points. I would like to find the last "/" in each name and replace it with "/Old". A quick example of what I have tried:
I have managed to do it for a single file name in the list but can't seem to apply it to the whole list.
Test<- "Cars/sedan/Camry"
Then I know I tried finding the last "/" in the name I tried the following :
Last <- tail(gregexpr("/", Test)[[1]], n= 1)
str_sub(Test, Last, Last)<- "/Old"
Which gives me
Test[1] "Cars/sedan/OldCamry"
Which is exactly what I need but I am having troubling applying tail and gregexpr to my list of names so that it does it all at the same time.
Thanks for any help!
Apologies for my poor formatting still adjusting.
If your file names are in a character vector you can use str_replace() from the stringr package for this:
items <- c(
"Cars/sedan/Camry",
"Cars/sedan/XJ8",
"Cars/SUV/Cayenne"
)
stringr::str_replace(items, pattern = "([^/]+$)", replacement = "Old\\1")
[1] "Cars/sedan/OldCamry" "Cars/sedan/OldXJ8" "Cars/SUV/OldCayenne"
Keeping a stringi function as an alternative.
If your dataframe is "df" and your text is in column named "text.
library(stringi)
df %>%
mutate(new_text=stringi::stri_replace_last_fixed(text, '/', '/Old '))

How to match any character existing between a pattern and a semicolon

I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.

Assign names to list elements without titled quotes

I am interested to assign names to list elements. To do so I execute the following code:
file_names <- gsub("\\..*", "", doc_csv_names)
print(file_names)
"201409" "201412" "201504" "201507" "201510" "201511" "201604" "201707"
names(docs_data) <- file_names
In this case the name of the list element appears with ``.
docs_data$`201409`
However, in this case the name of the list element appears in the following way:
names(docs_data) <- paste("name", 1:8, sep = "")
docs_data$name1
How can I convert the gsub() result to receive the latter naming pattern without quotes?
gsub() and paste () seem to produce the same class () object. What is the difference?
Both gsub and paste return character objects. They are different because they are completely different functions, which you seem to know based on their usage (gsub replaces instances of your pattern with a desired output in a string of characters, while paste just... pastes).
As for why you get the quotations, that has nothing to do with gsub and everything to do with the fact that you are naming variables/columns with numbers. Indeed, try
names(docs_data) <- paste(1:8)
and you'll realize you have the same problem when invoking the naming pattern. It basically has to do with the fact that R doesn't want to be confused about whether a number is really a number or a variable because that would be chaos (how can 1 refer to a variable and also the number 1?), so what it does in such cases is change a number 1 into the character "1", which can be given names. For example, note that
> 1 <- 3
Error in 1 <- 3 : invalid (do_set) left-hand side to assignment
> "1" <- 3 #no problem!
So R is basically correcting that for you! This is not a problem when you name something using characters. Finally, an easy fix: just add a character in front of the numbers of your naming pattern, and you'll be able to invoke them without the quotations. For example:
file_names <- paste("file_",gsub("\\..*", "", doc_csv_names),sep="")
Should do the trick (or just change the "file_" into whatever you want as long as it's not empty, cause then you just have numbers left and the same problem)!

Changing column names or exceptions to strsplit

I have a dataframe Genotypes and it has columns of loci labeled D2S1338, D2S1338.1, CSF1PO, CSF1PO.1, Penta.D, Penta.D.1. These names were automatically generated when I imported the Excel spreadsheet into R such that the for the two columns labeled CSF1PO, the column with the first set of alleles was labeled CSF1PO and the second column was labeled CSF1PO.1. This works fine until I get to Penta D which was listed with a space in Excel and imported as Penta.D. When I apply the following code, Penta.D gets combined with Penta.C and Penta.E to give me nonsensical results:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE), function(x) x[1])))
Expected <- sapply(locuses, function(x) 1 - sum(unlist(Freqs[grepl(x, names(Freqs))])^2))
This code works great for all loci except the Pentas because of how they were automatically names. How do I either write an exception for the strsplit at Penta.C, Penta.D, and Penta.E or change these names to PentaC, PentaD, and PentaE so that the above code works as expected? I run the following line:
Genotypes <- transform(Genotypes, rename.vars(Genotypes, from="Penta.C", to="PentaC", info=TRUE))
and it tells me:
Changing in Genotypes
From: Penta.C
To: PentaC
but when I view Genotypes, it still has my Penta loci written as Penta.C. I thought this function would write it back to the original data frame, not just a copy. What am I missing here? Thanks for your help.
The first line of your code is splitting the variable names by . and extracting the first piece. It sounds like you instead want to split by . and extract all the pieces except for the last one:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE),
function(x) paste(x[1:(length(x)-1)], collapse=""))))
Looks like you want to remove ".n" where n is a single digit if and only if it appears at the end of a line.
loci.columns <- read.table(header=F,
text="D2S1338,D2S1338.1,CSF1PO,CSF1PO.1,Penta.D,Penta.D.1",
sep=",")
loci <- gsub("\\.\\d$",replace="",unlist(loci.columns))
loci
# [1] "D2S1338" "D2S1338" "CSF1PO" "CSF1PO" "Penta.D" "Penta.D"
loci <- unique(loci)
loci
# [1] "D2S1338" "CSF1PO" "Penta.D"
In gsub(...), \\. matches ".", \\d matches any digit, and $ forces the match to be at the end of the line.
The basic problem seems like the names are being made "valid" on import by the make.names function
> make.names("Penta C")
[1] "Penta.C"
Avoid R's column re-naming with use of the check.names=FALSE argument to read.table. If you refer explicitly to columns you'll need to provide a back-quoted strings
df$`Penta C`

R - Remove commas from values in a column and place separated values into new rows

I have a column of gene symbols that I have retrieved directly from a database, and some of the rows contain two or more symbols which are comma separated (see example below).
SLC6A13
ATP5J2-PTCD1,BUD31,PTCD1
ACOT7
BUD31,PDAP1
TTC26
I would like to remove the commas, and place the separated symbols into new rows like so:
SLC6A13
ATP5J2-PTCD1
BUD31
PTCD1
ACOT7
BUD3
PDAP1
TTC26
I haven't been able to find a straight forward way to do this in R, does anyone have any suggestions?
You can use this vector result to put into a matrix or a data.frame:
vec <- scan(text="SLC6A13
ATP5J2-PTCD1,BUD31,PTCD1
ACOT7
BUD31,PDAP1
TTC26", what=character(), sep=",")
Read 8 items
vec
[1] "SLC6A13" "ATP5J2-PTCD1" "BUD31" "PTCD1" "ACOT7" "BUD31" "PDAP1"
[8] "TTC26"
Perhaps:
as.matrix(vec)
(The scan function can also read from files. The "text" parameter was only added relatively recently, but it saves typing file=textConnection("...").)
Another option is to use readLines and strsplit :
unlist(strsplit(readLines(textConnection(txt)),','))
"SLC6A13" "ATP5J2-PTCD1" "BUD31" "PTCD1" "ACOT7"
"BUD31" "PDAP1" "TTC26"

Resources