grep-like function in R - r

I'm trying to write a program in R which would take in a .pdb file and give out a .xyz-file.
I'm having problems with erasing some rows that contain useless data. There are around 30-40 thousand rows, from which I would only need about 3000. The rows that contain the useful information start with the word "ATOM".
In unix terminal I would just use the command
grep ATOM < filename.pdb > newfile.xyz
but I have no idea how to achieve the same result with R.
Thank you for your help!

You should be able to use grep, and depending on your specific situation, perhaps substr.
For example
#Random string variable
stringVar <- c("abcdefg", "defg", "eff", "abc")
#find the location of variables starting with "abc"
abcLoc <- grep("abc", substr(stringVar, 1, 3))
#Extract "abc" instances
out <- stringVar[abcLoc]
out
Note that the substr part limits the search to only the first three characters of each element of stringVar (e.g., "abc", "def", etc.). This may not be strictly necessary but I've found it to be very useful at times. For example, if you had an element like "defabc" that you didn't want to include, using substr would ensure it wouldn't be "found" by grep.
Hope it's helpful.

Related

How to modify i in an R loop?

I have several large R objects saved as .RData files: "this.RData", "that.RData", "andTheOther.RData" and so on. I don't have enough memory, so I want to load each in a loop, extract some rows, and unload it. However, once I load(i), I need to strip the ".RData" part of (i) before I can do anything with objects "this", "that", "andTheOther". I want to do the opposite of what is described in How to iterate over file names in a R script? How can I do that? Thx
Edit: I omitted to mention the files are not in the working directory and have a filepath as well. I came across Getting filename without extension in R and file_path_sans_ext takes out the extension but the rest of the path is still there.
Do you mean something like this?
i <- c("/path/to/this.RDat", "/another/path/to/that.RDat")
f <- gsub(".*/([^/]+)", "\\1", i)
f1 <- gsub("\\.RDat", "", f)
f1
[1] "this" "that"
On windows' paths you have to use "\\" instead of "/"
Edit: Explanation. Technically, these are called "regular
expressions" (regexps), not "patterns".
. any character
.* arbitrary number (including 0) of any kind of characters
.*/ arbitrary number of any kind of characters, followed by a
/
[^/] any character but not /
[^/]+ arbitrary number (1 or more) of any kind of characters,
but not /
( and ) enclose groups. You can use the groups when
replacing as \\1, \\2 etc.
So, look for any kind of character, followed by /, followed by
anything but not the path separator. Replace this with the "anything
but not separator".
There are many good tutorials for regexps, just look for it.
A simple way to do this using would be to extract the base name from the filepaths with base::basename() and then remove the file extension with tools::file_path_sans_ext().
paths_to_files <- c("./path/to/this.RData", "./another/path/to/that.RData")
tools::file_path_sans_ext(
basename(
paths_to_files
)
)
## Returns:
## [1] "this" "that"

How to replace all values starting with certain characters in R with one value?

I hope this isn't a duplicate, I was unable to find a question that refers to the exact same issue.
I have a data frame in R, where within one column (let's call it 'Task') there are 170 items named EC1:EC170, I would like to replace them so that they just say 'EC' and don't have a number following them.
The important thing is that this column also has other types of values, that do not start with EC, so I don't just want to change the names of all values in the column, but only those that start with 'EC'.
In linux I would use 'sed' and replace 'EC*' with 'EC', but I don't know how to do that in R.
Rich Scriven's startswith suggestion worked great, I did write df$task instead of just 'task'. Thanks a lot! this is what I used: df$task[startsWith(df$task, "EC")] <- "EC"
I'd recommend regex as well. You're looking for string "EC" followed by 1 to 3 digits and replace these occurrences with "EC":
df$Task = sub("EC\\d{1,3}", "EC", df$Task)

Finding number of occurrences of a word in a file using R functions

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"!
NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by #andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10

R programming, naming output file using variable

I would like to direct output to a file, using a write.csv statement. I am wanting to write 16 different output files, labeling each one with the extension 1 through 16.
Example as written now:
trackfilenums=1:16
for (i in trackfilenums){
calculations etc
write.csv(max.hsi, 'Severity_Index.csv', row.names=F)
}
I would like for the output csv files to be labeled 'Severity_Index_1.csv', 'Severity_Index_2.csv', etc. Not sure how to do this in R language.
Thanks!
Kimberly
You will want to use the paste command:
write.csv(max.hsi, paste0("Severity_Index_", i,".csv"), row.names=F)
Some people like to have file names like Name_01 Name_02 etc instead of Name_1 Name_2 etc. This may, for example, make the alphabetical order more reasonable: with some software, otherwise, 10 would come after 1, 20 after 2, etc.
This kind of numbering can be achieved with sprintf:
sprintf("Severity_Index_%02d.csv", 7)
The interesting part is %02d -- this says that i is an integer value (could actually use %02i as well) that will take at least 2 positions, and leading zero will be used if necessary.
# try also
sprintf("Severity_Index_%03d.csv", 7)
sprintf("Severity_Index_%2d.csv", 7)
To add to the other answers here, I find it's also a good idea to sanitise the pasted string to make sure it is ok for the file system. For that purpose I have the following function:
fsSafe <- function(string) {
safeString <- gsub("[^[:alnum:]]", "_", string)
safeString <- gsub("_+", "_", safeString)
safeString
}
This simply strips out all non-alphabetic and non-numeric characters and replacing them with an underscore.

Using lists in R

Sorry for possibly a complete noob question but I have just started programming with R today and I am stuck already.
I am reading some data from a file which is in the format.
3.482373 8.0093238198371388 47.393873
0.32 20.3131 31.313
What I want to do is split each line then deal with each of the individual numbers.
I have imported the stringr package and using
x = str_split(line, " ")
This produces a list which I would like to index but don't know how.
I have learnt that x[[1:2]] gets the second element but that is about it. Ideally I would like something like
x1 = x[1]
x2 = x[2]
x3 = x[3]
But can't find anyway of doing this.
Thanks in advance
By using unlist you will get a vector instead of a list of vectors, and you will then be able to index it directly :
R> unlist(str_split("foo bar baz", " "))
[1] "foo" "bar" "baz"
But maybe you should read your file directly from read.table or one of its variant ?
And if you are beginning with R, you really should read one of the introduction available if you want to understand subsetting, indexing, etc.
you can wrap your call to str_split with unlist to get the behavior you're looking for.
The usual way to get this in would be to import it into a dataframe (a special sort of list). If file name is "fil.dat"" and is in "C:/dir/"
dfrm <- read.table("C:/dir/fil.dat") # resist the temptation to use backslashes
dfrm[2,2] # would give you the second item on the second row.
By default the field separator in R is "white-space" and that seems to be what you have, so you do not need to supply a sep= argument and the read.table function will attempt to import as numeric. To be on the safe side, you might consider forcing that option with colClasses=rep("numeric", 3) because if it encounters a strange item (such as often produced by Excel dumps), you will get a factor variable and will probably not understand how to recover gracefully.

Resources