Extract the ground truth Language code from the file name - r

Would you please help me.
I need to extract ground truth Language code from a file name:
for example:
get or extract 'en' from the file name: 'Dictionary_en.txt'.
I tried so many times in vain.
Many thanks in advance.

Do you need something like this?:
filename = "Dictionary_en.txt"
gsub("(.*_)|(.txt)", "", filename)
The output will be: "en" in this example. You can make a list of your files, using list.files, and then apply the gsub function.
Best wishes

Related

including parameter values in save.image (R)

I am interested in knowing how to include the value of a parameter in the filename when saving the workspace in R. I use matlab, and i am looking for something similar to this in R:
save(['database_' num2str(parametervalue) '_' num2str(parameter2value) '.mat'])
Thus, is it possible to save different workspaces without changing the name by hand?
Thanks in advance
Try this:
save.image(paste('database_', as.character(parametervalue), '_', as.character(parameter2value), '.mat', sep = ""))
Hope it helps.

gzgrep help in multiple large archives - solaris

In solaris- i need to perform a gzgrep of archives. But i need to filter so not searching ALL the archives- maybe just files with '09.30-12' in the name.. then i want to search IN that particular file or files for a particular expression. I have this close.. but it takes WAY too long as its searching unnecessary files first and matching on those.. then moving onto the October archives and finding what i need in them. I need to basically search any files in which filename contains 'x' then look in those files for text 'y' and output to > fileoutput. Perhaps just change the *.gz to just match on a set of files?? i cannot figure out how though. Any help is MUCH appreciated.
Something like this works- but i get way too much output and it takes way too long.
gzgrep 'firstexpression' *.gz > /fileoutput.file
maybe just files with '09.30-12' in the name..
You could say:
gzgrep 'firstexpression' *09.30-12*.gz > fileoutput.file
or
gzgrep pattern_to_search *filename_pattern*.gz > outfile

parsing xml file manually with r

for some reason, I cannot download the r xml package at work. I have an xml file that has contents like this:
x<-read.table("info.xml")
x
</name></content></item><item id="id-123"><content><name>
</name></content></item><item id="id-456"><content><name>
</name></content></item><item id="id-5559"><content><name>
I need to pick values that start with id and - and the numbers like
id-123, id-456 id-5559, etc
tried this:
str_extract_all(x, "id-[0-9]")
but is only printing id-1, I really need help very quick. Any ideas?
str_extract_all(x, "id-[0-9]+")
The regular expression "id-[0-9]" is missing a "+" at the end.
There may be more issues, but that one jumps out.

Reading a file into R with partly unknown filename

Is there a way to read a file into R where I do not know the complete file name. Something like.
read.csv("abc_*")
In this case I do not know the complete file name after abc_
If you have exactly one file matching your criteria, you can do it like this:
read.csv(dir(pattern='^abc_')[1])
If there is more than one file, this approach would just use the first hit. In a more elaborated version you could loop over all matches and append them to one dataframe or something like that.
Note that the pattern uses regular expressions and thus is a bit different from what you did expect (and what I wrongly assumed at my first shot to answer the question). Details can be found using ?regex
If you have a directory you want to submit, you have do modify the dir command accordingly:
read.csv(dir('path/to/your/file', full.names=T, pattern="^abc"))
The submitted path in your case may be c:\\users\\user\\desktop, and then the pattern as above. full.names=T forces dir() to output a whole path and not only the file name. Try running dir(...) without the read.csv to understand what is happening there.
If you want to give your path as a complete string, it again gets a bit more complicated:
filepath <- 'path/to/your/file/abc_'
read.csv(dir(dirname(filepath), full.names=T, pattern=paste("^", basename(filepath), sep='')))
That process will fail if your filename contains any regular expression keywords. You would have to substitute then with their corresponding escape sequences upfront. But that again is another topic.

How escape or sanatize slash using regex in R?

I'm trying to read in a (tab separted) csv file in R. When I want to read the column including a /, I get an error.
doSomething <- function(dataset) {
a <- dataset$data_transfer.Jingle/TCP.total_size_kb
...
}
The error says, that this object cannot be found. I've tried escaping with backslash but it did not work.
If anybody has got some idea, I'd really appreciate it!
Give
head(dataset)
and watch the name it has been given. Perhaps it would be something like:
dataset$data_transfer.Jingle.TCP.total_size_kb
Two ways:
dataset[["data_transfer.Jingle/TCP.total_size_kb"]]
or
dataset$`data_transfer.Jingle/TCP.total_size_kb`

Resources