Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Actually I have a lot of txt file in a folder and I make a list and then I put them all together. So far so good. Lets say I have files name like this "1a" "1b" "2a" "3b" etc I get a column from each file and make a data frame at the end.
What I cannot do now, is to make the files names as the column name of my final data frame. Lets say I get a column from "1a" I want to name it as 1a in my final data frame.
Is there anyway to do it?
Here is the names
> head(filelist)
[1] "./1a.txt" "./1b.txt" "./2a.txt" "./2b.txt" "./3a.txt" "./3b.txt"
You probably don't want to begin with numbers as your names here is what I would suggest:
# create example vector of file names for example
myFiles <- c("./1a.txt", "./1b.txt", "./2a.txt",
"./2b.txt", "./3a.txt", "./3b.txt")
# get a vector of filenames
myFiles <- list.files(<filePath>)
# paste the word file in front:
myFiles <- paste0("file.", gsub("\\./(.*)\\.txt$", "\\1", myFiles))
# add names to your data.frame columns:
names(df) <- myFiles
The regular expression "\./(.*)\.txt$" can be broken down as follows:
\. tells the regex engine to match the literal dot "." In regex, "." by itself is the useful, yet dangerous "match any character."
"/" and "txt" are literals: match those characters.
"$" is an anchor that forces the match to the end of the string.
"()" is a capturing parentheses: it tells the engine to save that piece for later.
".*" within the parentheses says match anything in between the adjacent ("\./" and "\.txt$") subexpressions.
the "\1" says to return the bit of text in the capturing parentheses.
For more on the wonderful world of regular expressions, take a look here. Also, this site, which is linked in the SO link is where I learned much of what I use.
You will have to make sure that the orders of the names and the order of the columns match, but from your description, it sounds like you have this already.
If the list that contains the files is a named list, it should be event easier:
names(df) <- paste0("file.", names(fileList))
Related
There is a vector of strings that looks like the following (text with two or more substrings in quotation marks):
vec <- 'ab"cd"efghi"j"kl"m"'
The text within the first pair of quotation marks (cd) contains a useful identifier (cd is the desired output). I have been studying how to use regular expressions but I haven't learned how to find the first and second occurrences of something like quotation marks.
Here's how I have been getting cd:
tmp <- strsplit(vec,split="")[[1]]
paste(tmp[(which(tmp=='\"')[1]+1):(which(tmp=='\"')[2]-1)],collapse="")
"cd"
My question is, is there another way to find "cd" using regular expressions? in order to learn more how to use them. I prefer base R solutions but will accept an answer using packages if that's the only way. Thanks for your help.
Match everything except " then capture everything upto next " and replace captured group by itself.
gsub( '[^"]*"([^"]*).*', '\\1', vec)
[1] "cd"
For detailed explanation of regex you can see this demo
How to rename file name from Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs to Pbs_d7_s2.fcs
For multiple files keeping in mind that _juliam_08July2020_02_1_0_live singlets is not the same for all files?
It's a bit unclear what you're asking for, but it looks like you only want to keep the chunks before the third underscore. If so, you can tackle this with regular expressions. The regular expression
str_extract(input_string, "^(([^_])+_){3}")
will take out the first 3 blocks of characters (that aren't underscores) that end in underscores. The first ^ "anchors" the match to the beginning of the string, the "[^_]+_" matches any number of non-underscore characters before an underscore. The {3} does the preceding operation 3 times.
So for "Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs" you'll end up with "Pbs_d7_s2_". Now you just replace the last underscore with ".fcs" like so
str_replace(modified string, "_$", ".fcs")
The $ "anchors" the characters that precede it to the end of the string so in this case it's replacing the last underscore. The full sequence is
string1<- "Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs"
str_extract(string1, "^(([^_])+_){3}") %>%
str_replace("_$",".fcs")
[1] "Pbs_d7_s2.fcs"
Now let's assume your filenames are in a vector named stringvec.
output <- vector("character",length(stringvec))
for (i in seq_along(stringvec)) {
output[[i]] <- str_extract(stringvec[[i]],"^(([^_])+_){3}")%>%
str_replace("_$",".fcs")
}
output
I'm making some assumptions here - namely that the naming convention is the same for all of your files. If that's not true you'll need to find ways to modify the regex search pattern.
I recommend this answer How do I rename files using R? for replacing vectors of file names. If you have a vector of original file names you can use my for loop to generate a vector of new names, and then you can use the information in the link to replace one with the other. Perhaps there are other solutions not involving for loops.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
In the below example, i want to replace the string "lastName:JordanlastName:Jordan" with "lastName:Jordan" i.e, when the pattern repeats i want to stop. i want to do this for every record. How to do this in R?
lastName:Portnoy
lastName:JordanlastName:JordanlastName:Jordan
lastName:JordanlastName:JordanlastName:Jordan
lastName:CliffordlastName:CliffordlastName:Clifford
lastName:WalkerlastName:Walker
lastName:Portnoy
# Read in the example data:
x <- unname(unlist(c(read.table(text="lastName:Portnoy
lastName:JordanlastName:JordanlastName:Jordan
lastName:JordanlastName:JordanlastName:Jordan
lastName:CliffordlastName:CliffordlastName:Clifford
lastName:WalkerlastName:Walker
lastName:Portnoy", stringsAsFactors=FALSE))))
# Delete everything after the first occurrence of the pattern:
sub('(?<=[a-z])lastName[A-Za-z:]+', '', x, perl=TRUE)
[1] "lastName:Portnoy" "lastName:Jordan" "lastName:Jordan"
[4] "lastName:Clifford" "lastName:Walker" "lastName:Portnoy"
This replaces every occurrence of "lastName" and the following characters and colons with nothing ('') if and only if there was a letter before it.
Details
sub() has three mandatory arguments: pattern, replacement, and x. I've also used the optional perl=TRUE argument because the pattern I used is a Perl-style regular expression. I've told sub() to look in the character vector x for the pattern '(?<=[a-z])lastName[A-Za-z:]+' and replace it with '', or nothing (equivalent to deleting those characters). The (?<=[a-z]) part of the pattern is called a "look-behind assertion." That means the pattern matches 'lastName[A-Za-z:]+' if and only if it finds a letter immediately preceding that pattern. 'lastName[A-Za-z:]+' looks for the exact characters "lastName" followed immediately by one or more characters in the set of uppercase letters, lowercase letters, and the colon character. It matches everything until it finds a character that is not in that set.
I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.
EG:
"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)
"Fifth District" --> "Fifth" (removes District and space before District)
SPSS syntax:
COMPUTE county=REPLACE(county,' Parish','').
There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.
I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.
Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.
Thank you for looking.
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"
Legend:
^ Start of pattern.
() Group (or token).
\w* One or more occurrences of word character more than 1 times.
.* one or more occurrences of any character except new line \n.
$ end of pattern.
\1 Returns group from regexp
Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.
string <- c("Arcadia Parish", "Fifth District")
bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")
trimws( sub(bad_regex, "", string) )
# [1] "Arcadia" "Fifth"
dataframename$varname <- gsub(" Parish","", dataframename$varname)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a folder which contains some 2000 CSVs with file names that contain character '[ ]' in it - e.g.: [Residential]20151001_0000_1.csv
I want to:
Remove '[]' from names so that we have file name as:
Residential_20151001_0000_1.csv
and place new files within a new folder.
The read all the files from that new folder in one data frame (without header) after skipping first row from each file.
Also extract 20151001 as date (e.g. 2015-10-01) in a new vector as list such that the new vector is:
File Name Date
Residential_20151001_0000_1.csv 2015-10-01
This code will answer your first question albeit with a small change in logic.
Firstly, lets create a backup of all the csv containing [] by copying them to another folder. For eg - If your csvs were in directory "/Users/xxxx/Desktop/Sub", we will copy them in the folder Backup.
Therefore,
library(stringr)
library(tools)
setwd("/Users/xxxx/Desktop/Sub")
dir.create("Backup")
files<-data.frame(file=list.files(path=".", pattern = "*.csv"))
for (f in files)
file.copy(from= file.path("/Users/xxxx/Desktop/Sub", files$file), to= "/Users/xxxx/Desktop/Sub/Backup")
This has now copied all the csv files to folder Backup.
Now lets rename the files in your original working directory by removing the "[]".
I have taken a slightly longer route by creating a dataframe with the old names and new names to make things easier for you.
Name<-file_path_sans_ext(files$file)
files<-cbind(files, Name)
files$Name<-gsub("\\[", "",files$Name)
files$Name<-gsub("\\]", "_",files$Name)
files$Name<-paste(files$Name,".csv",sep="")
This dataframe looks like:
files
file Name
1 [Residential]20150928_0000_4.csv Residential_20150928_0000_4.csv
2 [Residential]20151001_0000_1.csv Residential_20151001_0000_1.csv
3 [Residential]20151101_0000_3.csv Residential_20151101_0000_3.csv
4 [Residential]20151121_0000_2.csv Residential_20151121_0000_2.csv
5 [Residential]20151231_0000_5.csv Residential_20151231_0000_5.csv
Now lets rename the files to remove the "[]". The idea here is to replace file with Name:
for ( f in files$file)
file.rename(from=file.path("/Users/xxxx/Desktop/Sub", files$file),
to=file.path("/Users/xxxx/Desktop/Sub",files$Name))
You've renamed your files now. If you run: list.files(path=".", pattern = "*.csv") You will get the new files:
"Residential_20150928_0000_4.csv"
"Residential_20151001_0000_1.csv"
"Residential_20151101_0000_3.csv"
"Residential_20151121_0000_2.csv"
"Residential_20151231_0000_5.csv"
Try it!
In order:
After googling r replace part of string I found: R - how to replace parts of variable strings within data frame. This should get you up and running for this issue.
For skipping the first line, read the documentation of read.csv. There you will find the skip argument.
Have a look at the strftime/strptime functions. Alternatively, have a look at lubridate.