reading configuration from text file - r

I have a txt file which has entries
indexUrl=http://192.168.2.105:9200
jarFilePath = /home/soumy/lib
How can I read this file from R and get the value of jarFilePath ?
I need this to set the .jaddClassPath()... I have problem to copying the jar to classpath because of the difference in slashes in windows and linux
in linux I want to use
.jaddClassPath(dir("target/mavenLib", full.names=TRUE ))
but in windows
.jaddClassPath(dir("target\\mavenLib", full.names=TRUE ))
So thinking to read location of jar from property file !!!
If there is anyother alternative please let me know that also

As of Sept 2016, CRAN has the package properties.
It handles = in property values correctly (but does not handle spaces after the first = sign).
Example:
Contents of properties file /tmp/my.properties:
host=123.22.22.1
port=798
user=someone
pass=a=b
R code:
install.packages("properties")
library(properties)
myProps <- read.properties("/tmp/my.properties")
Then you can access the properties like myProps$host, etc., In particular, myProps$pass is a=b as expected.

I do not know whether a package offers a specific interface.
If not, I would first load the data in a data frame using read.table:
myProp <- read.table("path/to/file/filename.txt, header=FALSE, sep="=", row.names=1, strip.white=TRUE, na.strings="NA", stringsAsFactors=FALSE)
sep="=" is obviously the separator, this will nicely separate your property names and values.
row.names=1 says the first column contains your row names, so you can index your data properties this way to retrieve each property you want.
For instance: myProp["jarFilePath", 2] will return "/home/soumy/lib".
strip.white=TRUE will strip leading and trailing spaces you probably don't care about.
One could conveniently convert the loaded data frame into a named vector for a cleaner way to retrieve the property values: myPropVec <- setNames(myProp[[2]], myProp[[1]]).
Then to retrieve a property value from its name: myPropVec["jarFilePath"] will return "/home/soumy/lib" as well.

Related

Saving data with a built name in R

In an R script, I assign a name to some data. The name depends on parameters. I do this using
number<-1
assign(paste("variable", as.character(number), sep=""),2)
The above accomplices the same as variable1<-2. Now I want to save the result for later
save(?,file=paste("variable",as.character(number),".RData",sep=""))
What code can go in the ? slot where it should say variable1 except I need to construct this name using paste or some similar technique. Simply putting get(paste("variable",as.character(number),".RData",sep="")) does not work.
save can also use list as parameter. According to ?save
list - A character vector containing the names of objects to be saved.
Thus, we specify the object name as a string (paste0('variable', number)) for the list argument and file as the one used by OP (or make it more concise with paste0 (as.character is not necessary as integer/numeric gets automatically convert to type character in paste
save(list = paste0('variable', number),
file = paste0("variable", number, ".RData"))
Check for the file created in the working directory
list.files(getwd(), pattern = '\\.RData$')
#[1] "variable1.RData"

R: locating files that their names contain a specific string from a directory and match to my list of wanted files

It's me the newbie again with another messy file and folder situation(thanks to us biologiests): I got this directory containing a huge amount of .txt files (~900,000+), all the files have been previously handed with inconsistent naming format :(
For example, messy files in directory look like these:
ctrl_S978765_uns_dummy_00_none.txt
ctrl_S978765_3S_Cookie_00_none.txt
S59607_3S_goody_3M_V10.txt
ctrlnuc30-100_S3245678_DMSO_00_none.txt
ctrlRAP_S0846567_3S_Dex_none.txt
S6498432_2S_Fulra_30mM_V100.txt
.....
As you see the naming has no reliable consistency. What's important for me is the ID code embedded in them, such as S978765. Now I have got a list (100 ID codes) of these ID codes that I want.
The CSV file containing the list as below, mind you the list does have repetitive ID codes in the row due to different CLnumber value in the second columns:
ID code CLnumber
S978765 1
S978765 2
S306223 1
S897458 1
S514486 2
....
So I want to achieve below task: find all the messy named files using the code IDs by matching to my list. And copy them into a new directory.
I have thought of use list.files() to get all the .txt files and their names, then I got stuck at the next step at matching the code ID names, I know how to do it with one string, say "S978765", but if I do it one by one, this is almost just like manual digging the folder.
How could I feed the ID code names in column1 as a list and compare/match them with the messy file title names in the directory and then copy them into a new folder?
Many thanks,
ML
This works:
library(stringr)
# get this via list.files in your actual code
files <- c("ctrl_S978765_uns_dummy_00_none.txt",
"ctrl_S978765_3S_Cookie_00_none.txt",
"S59607_3S_goody_3M_V10.txt",
"ctrlnuc30-100_S3245678_DMSO_00_none.txt",
"ctrlRAP_S0846567_3S_Dex_none.txt",
"S6498432_2S_Fulra_30mM_V100.txt")
ids <- data.frame(`ID Code` = c("S978765", "S978765", "S306223", "S897458", "S514486"),
CLnumber = c(1, 2, 1, 1, 2),
stringsAsFactors = FALSE)
str_subset(files, paste(ids$ID.Code, collapse = "|"))
#> [1] "ctrl_S978765_uns_dummy_00_none.txt" "ctrl_S978765_3S_Cookie_00_none.txt"
str_subset takes a character vector and returns elements matching some pattern. In this case, the pattern is "S978765|S978765|S306223|S897458|S514486" (created by using paste), which is a regular expression that matches any of the ID codes separated by |. So we take files and keep only the elements that have a match in ID Code.
There are many other ways to do this, which may or may not be more clear. For example, you could pass ids$ID.Code directly to str_subset instead of constructing a regular expression via paste, but that would throw a warning about object lengths every time, which could get confusing (or cause problems if you get used to ignoring it and then ignore it in a different context where it matters). Another method would be to use purrr and keep, but while that might be a little bit more clear to write, it would be a lot more inefficient since it would mean making multiple passes over the files vector -- not relevant in this context, but possibly very relevant if you suddenly need to do this for hundreds of thousands of files and IDs.
You could use regex to extract the ID codes from the file name.
Here, I have used the pattern "S" followed by 5 or more numbers. Once we extract the ID_codes, we can compare them with the ones which we have in csv.
Assuming the csv is called df and the column name is ID_Codes we can use %in% to filter them.
We can then use file.copy to move files from one folder to another folder.
all_files <- list.files(path = '/Path/To/Folder', full.names = TRUE)
selected_files <- all_files[sub('.*(S\\d{5,}).*', '\\1', basename(all_files))
%in% unique(df$ID_Codes)]
file.copy(selected_files, 'new_path/for/files')

Using R to read all files in a specific format and with specific extension

I want to read all the files in the xlsx format starting with a string named "csmom". I have used list.files function. But I do not know how to set double pattern. Please see the code. I want to read all the files starting csmom string and they all should be in .xlsx format.
master1<-list.files(path = "C:/Users/Admin/Documents/csmomentum funal",pattern="^csmom")
master2<-list.files(path = "C:/Users/Admin/Documents/csmomentum funal",pattern="^\\.xlsx$")
#jay.sf solution works for creating a regular expression to pull out the condition that you want.
However, generally speaking if you want to cross two lists to find the subset of elements that are contained in both (in your case the files that satisfy both conditions), you can use intersect().
intersect(master1, master2)
Will show you all the files that satisfy pattern 1 and pattern 2.

R can't fix Umlaut encoding after trimming white spaces

I'm working with data from many different sources, so I'm creating a name bridge and a function to make it easier to join tables. One of the sources uses an umlaut for a value and (I think) the excel csv isn't UTF-8 encoded, so I'm getting strange results.
Since I can't control how the other source compiles their data, I'd like to make a universal function that fixes all the weird encoding rules. I'll use Dennis Schröder as an example name.
One particular source uses the Umlaut, and when I read it in with read.csv and view the table in RStudio, it shows up as Dennis Schr<f6>der. However, if I index the particular table to his value (table[i,j]), the console reads Dennis Schr\xf6der
So in my name-bridge csv, I made a row to map all Dennis Schr\xf6der to Dennis Schroder. I read this name bridge in (with the condition allowEscapes = TRUE), and he shows up exactly the same in my name-bridge table. Great! I should be able to left_join this to the other source to change the name to just Dennis Schroder.
But unfortunately the names still don't map unless I Don't trim strings (I have to trim strings in general because other sources introduce white spaces). Here's the general function I use to fix names. The dataframe is the other source's table, VarUse is the name-column that I want to fix from dataframe, and correctionTable is my name-bridge.
nameUpdate <- dataframe %>%
mutate(name = str_trim(VarUse, 'both')) %>%
left_join(correctionTable, by = c('name' = 'WrongName'))
When I dig into the results of this mapping, I get the following:
correctionTable[14,1] is my name-bridge input of "Dennis Schr\xf6der".
nameUpdate[29,3] is the original name variable from the other source which reads "Dennis Schr\xf6der".
nameUpdate[29,19] is the mutated name variable from the other source after using str_trim, which also reads "Dennis Schr\xf6der".
However, for some reason the str_trim version is not equal to the name-bridge, so it won't map:
In writing this (non-reproducible, sorry) example, I've figured out a work-around by using a combo of str_trim and by not using it, but at this point I'm just confused why the name doesn't get fixed after I use str_trim. The values look exactly the same.

Reading a file into R with partly unknown filename

Is there a way to read a file into R where I do not know the complete file name. Something like.
read.csv("abc_*")
In this case I do not know the complete file name after abc_
If you have exactly one file matching your criteria, you can do it like this:
read.csv(dir(pattern='^abc_')[1])
If there is more than one file, this approach would just use the first hit. In a more elaborated version you could loop over all matches and append them to one dataframe or something like that.
Note that the pattern uses regular expressions and thus is a bit different from what you did expect (and what I wrongly assumed at my first shot to answer the question). Details can be found using ?regex
If you have a directory you want to submit, you have do modify the dir command accordingly:
read.csv(dir('path/to/your/file', full.names=T, pattern="^abc"))
The submitted path in your case may be c:\\users\\user\\desktop, and then the pattern as above. full.names=T forces dir() to output a whole path and not only the file name. Try running dir(...) without the read.csv to understand what is happening there.
If you want to give your path as a complete string, it again gets a bit more complicated:
filepath <- 'path/to/your/file/abc_'
read.csv(dir(dirname(filepath), full.names=T, pattern=paste("^", basename(filepath), sep='')))
That process will fail if your filename contains any regular expression keywords. You would have to substitute then with their corresponding escape sequences upfront. But that again is another topic.

Resources