Extracting identifiers without matching files in a folder - r

How to extract the identifiers which do not have corresponding files being generated?
Identifiers which are given as input for generation fo files:
fileIden <- c('a-1','a-2','a-3','b-1','b-2','c-1','d-1','d-2','d-3','d-4')
Checking the files generated:
files <- list.files(".")
files
# [1] "a-2.csv" "a-3.csv" "b-1.csv" "c-1.csv" "d-3.csv"
# Generated here for reproducibility.
# files <- c("a-2.csv", "a-3.csv", "b-1.csv", "c-1.csv", "d-3.csv")
Expected files if all the process is completely successful
fileExp <- paste(fileIden, ".csv", sep = "")
# [1] "a-1.csv" "a-2.csv" "a-3.csv" "b-1.csv" "b-2.csv" "c-1.csv" "d-1.csv" "d-2.csv" "d-3.csv" "d-4.csv"
Any expected files are missing?
fileMiss <- fileExp[!fileExp %in% files]
# [1] "a-1.csv" "b-2.csv" "d-1.csv" "d-2.csv" "d-4.csv"
Expected output
# "a-1" "b-2" "d-1" "d-2" "d-4"
I am sure that there is an easy process directly to get the above output without creating the files: fileExp, fileMiss. Could you please guide me there?

You can do this :
fileIden <- c('a-1','a-2','a-3','b-1','b-2','c-1','d-1','d-2','d-3','d-4')
file <- c("a-2.csv", "a-3.csv" ,"b-1.csv", "c-1.csv", "d-3.csv")
setdiff(fileIden, trimws(gsub("\\.csv","", file)))
Another approach:
setdiff(fileIden, stringr::str_extract(file,"(.*)(?=\\.csv)"))
Logic:
setdiff finds the difference between two vectors, gsub replaces the ".csv" with nothing , we club them together to find the difference between those vectors.
Output:
#[1] "a-1" "b-2" "d-1" "d-2" "d-4"

a less elegant approach
result <- ifelse(fileIden %in% substr(file, 1, 3), "", fileIden)
result[result != ""]

Related

Beginner using pipes

I am a beginner and I'm trying to find the most efficient way to change the name of the first column for many CSV files that I will be creating. Once I have created the CSV files, I am loading them into R as follows:
data <- read.csv('filename.csv')
I have used the names() function to do the name change of a single file:
names(data)[1] <- 'Y'
However, I would like to find the most efficient way of combining/piping this name change to read.csv so the same name change is applied to every file when they are opened. I tried to write a 'simple' function to do this:
addName <- function(data) {
names(data)[1] <- 'Y'
data
}
However, I do not yet fully understand the syntax for writing a function and I can't get this to work.
Note
If you were expecting your original addName function to "mutate" an existing object like so
x <- data.frame(Column_1 = c(1, 2, 3), Column_2 = c("a", "b", "c"))
# Try (unsuccessfully) to change title of "Column_1" to "Y" in x.
addName(x)
# Print x.
x
please be aware that R passes by value rather than by reference, so x itself would remain unchanged:
Column_1 Column_2
1 1 a
2 2 b
3 3 c
Any "mutation" would be achieved by overwriting x with the return value of the function
x <- addName(x)
# Print x.
x
in which case x itself would obviously be changed:
Y Column_2
1 1 a
2 2 b
3 3 c
Answer
Anyway, here's a solution that compactly incorporates pipes (%>% from the magrittr package) and a custom function. Please note that without the linebreaks and comments, which I have added for clarity, this could be condensed to only a few lines of code.
# The dplyr package helps with easy renaming, and it includes the magrittr pipe.
library(dplyr)
# ...
filenames <- c("filename1.csv", "filename2.csv", "filename3.csv")
# A function to take a CSV filename and give back a renamed dataset taken from that file.
addName <- function(filename) {
return(# Read in the named file as a data.frame.
read.csv(file = filename) %>%
# Take the resulting data.frame, and rename its first column as "Y";
# quotes are optional, unless the name contains spaces: "My Column"
# or `My Column` are needed then.
dplyr::rename(Y = 1))
}
# Get a list of all the renamed datasets, as taken by addName() from each of the filenames.
all_files <- sapply(filenames, FUN = addName,
# Keep the list structure, in which each element is a
# data.frame.
simplify = FALSE,
# Name each list element by its filename, to help keep track.
USE.NAMES = TRUE)
In fact, you could easily rename any columns you desire, all in one fell swoop:
dplyr::rename(Y = 1, 'X' = 2, "Z" = 3, "Column 4" = 4, `Column 5` = 5)
This will read a vector of filenames, change the name of the first column of each one to "Y" and store all of the files in a list.
filenames <- c("filename1.csv","filename2.csv")
addName <- function(filename) {
data <- read.csv(filename)
names(data)[1] <- 'Y'
data
}
files <- list()
for (i in 1:length(filenames)) {
files[[i]] <- addName(filenames[i])
}

Remove the unmatched csv file between two folders before reading csv file

I have two folders and also they have the certain pattern in file name
In "post"folder has 5files "aab.csv, bbc.csv, cfd.csv, f.csv, g.csv"
In "comment"folder has 4files "aab_comment.csv, bbc_comment.csv, cfd_comment.csv, dgh_comment.csv"
They are big data file. So, before reading these files, I want to only read the matched files. Not unmatched file that the front word is not same as each other.
For example, in "post" folder, aab, bbc, cfd and in "comment" folder aab_comment, bbc_coment, cfd_comment's front word are same. So, I want to make only 3 files "aab.csv, bbc.csv, cfd.csv" in file list of post folder.
How can I make the modified_post_list (aab.csv, bbc.csv, cfd.csv)?
Below is what I tried until now.
post_dir <- c:/post/
comment_dir <- c:/comment/
post <- list.files(post_dir)
#> aab.csv',' bbc.csv', 'cfd.csv', 'f.csv', 'efg.csv', 'fgg.csv', 'gda.csv'
comment <- list.files(comment_dir)
#> 'abc_comment.csv', 'bcc_comment.csv', 'efg_comment.csv', 'fgg_comment.csv'
You can use sub to extract the front word of the file names and %in% to find the matches:
x <- sub("(.*)\\..*", "\\1", post)
y <- sub("(.*)_.*", "\\1", comment)
post[x %in% y]
#[1] "aab.csv" "bbc.csv" "cfd.csv"
comment[y %in% x]
#[1] "aab_comment.csv" "bbc_comment.csv" "cfd_comment.csv"
Data:
post <- c("aab.csv", "bbc.csv", "cfd.csv", "f.csv", "g.csv")
comment <- c("aab_comment.csv", "bbc_comment.csv", "cfd_comment.csv", "dgh_comment.csv")

Regex to filter, then determine latest date

Say I have a directory with four files:
someText.abcd.xyz.10Sep16.csv
someText.xyz.10Sep16.csv
someText.abcd.xyz.23Oct16.csv
someText.xyz.23Oct16.csv
This is how the names are formatted. I cannot change them, and the format will remain the same except the dates will change. All of the names begin with someText. Next, there is either a four-letter code (abcd) or a three latter code (xyz). If the file name has a four letter code, it will always have a three-letter code after it. Finally there is a date value.
I have two tasks. First, I need to filter out the files that have the "abcd" component. This will always be a four-character code that appears after the someText. in the name. Is there a way to right a regex expression to remove these values?
That leaves two files:
someText.xyz.10Sep16.csv
someText.xyz.23Oct16.csv
I need only the file with the later date. Is there a second regex I could do to extract the dates, find the latest, and then keep only that date? I'm doing this to get the file set down to four:
myDir <- "\\\\myDir\\folder\\"
files <- list.files(path = myDir, pattern = "\\.csv$")
Here's a vector with the file names if someone wants to try it out:
files <- c("someText.abcd.xyz.10Sep16.csv", "someText.xyz.10Sep16.csv", "someText.abcd.xyz.23Oct16.csv", "someText.xyz.23Oct16.csv")
Here's my attempt at a simple base R answer
# regex subset
files <- files[!grepl("^.*?\\.[[:alpha:]]{4}\\.", files)]
# get date
dates <- unlist(lapply(strsplit(files, "\\."), "[[", 3))
files[which.max(as.Date(dates, format = "%d%b%y"))]
# [1] "someText.xyz.23Oct16.csv"
I think this should be robust enough to work reliably. I used dplyr to pass the results through and manipulate them, and lubridate for a convenient date extraction (dmy). Almost forgot: you need to load magrittr to get the %$% pipe.
I split the file names by the "."s, then slide over the results if they are missing the four-letter code section. Bind them into a data.frame for easy filtering etc. Here, filter for those missing the four-letter section, then select the one with the latest date.
strsplit(files, "\\.") %>%
setNames(files) %>%
lapply(function(x){
if(length(x) == 4){
x[3:5] <- x[2:4]
x[2] <- "noCode"
}
rbind(x) %>%
as.data.frame()
}) %>%
bind_rows(.id = "fileName") %>%
mutate(date = dmy(V4)) %>%
filter(V2 == "noCode") %$%
c(fileName[which.max(date)])
returns: "someText.xyz.23Oct16.csv"
I am sure that this can be made more compact, but here is a base R answer:
# file names
file_names =c(
"someText.abcd.xyz.10Sep16.csv",
"someText.xyz.10Sep16.csv",
"someText.abcd.xyz.23Oct16.csv",
"someText.xyz.23Oct16.csv"
)
# the pattern to be tested
reg_file_names = regexec(
pattern = "^someText\\.[a-z]{4}\\.[a-z]{3}\\.(.*).csv$",
file_names
)
# parse out the matched dates, and look for the maximum
file_names[
which.max(
sapply(
regmatches(
x = file_names, m = reg_file_names
),
function(match) {
as.Date(
ifelse(
length(match) == 0,
NA,
match[2]
),
format = "%d%b%y"
)
}
)
)
]
The regular expression that you need is fairly straightforward, and the rest of the code is just to handle the cases where there is no match, and to format the dates so that they can be compared.

How to add an attribute to grep () Output

What I'm trying to do is following:
use a grep() function to search for a pattern (a list of numbers, which I called "toMatch") in a data.frame ("News"). So, what I want it to do is search for those numbers in the news and return the matches (in the form "number", "corresponding news"). Unfortunately, I could so far only get a list of the corresponding news as a result. Any idea how I can add an attribute with the corresponding number from the match to the output? (in a way create a key-value pairs as an output)
Here a simple short example of my code:
News <- c ("AT000000STR2 is schwierig", "AT", "ATI", "AT000000STR1")
toMatch <- c("AT000000STR1","AT000000STR2","DE000000STR1","DE000000STR2")
matches <- unique (grep(paste(toMatch,collapse="|"),News, value=TRUE))
matches
And here the result:
> matches
[1] "AT000000STR2 is schwierig" "AT000000STR1" `
What I would like to have is a list or better yet Excel file, looking like this:
AT000000STR2 "AT000000STR2 is schwierig"
AT000000STR1 "AT000000STR1"
Help is much appreciated.
Something like this might be of help:
#name toMatch with its names
names(toMatch) <- toMatch
#create a list with the format you request
myl <-
lapply(toMatch, function(x) {
grep(x, News, value=TRUE)
})
#or in a more compact way as #BenBolker says in the comments below
#myl <- lapply(toMatch, grep, x=News, value=TRUE)
#remove the unmatched
myl[lapply(myl,length)>0]
Output:
$AT000000STR1
[1] "AT000000STR1"
$AT000000STR2
[1] "AT000000STR2 is schwierig"
Your current approach returns the unique matches, but then you have no way of linking them to the relevant 'toMatch'.
This might be a start for you: using lapply we create a list of matches for all elements of toMatch, and then bind those together with toMatch.
matched <- lapply(toMatch, function(x){grep(x,News,value=T)})
#turn unfound matches to missings. You can remove these, but I don't like
#generating implicit missings
matched[sapply(matched,length)==0]<-NA
res <- cbind(toMatch,matched)
res
toMatch matched
[1,] "AT000000STR1" "AT000000STR1"
[2,] "AT000000STR2" "AT000000STR2 is schwierig"
[3,] "DE000000STR1" NA
[4,] "DE000000STR2" NA
writing to csv is then trivial:
write.csv(res,"yourfile.csv")

change the sequence of numbers in a filename using R

I am sorry, I could not find an answer to this question anywhere and would really appreciate your help.
I have .csv files for each hour of a year. The filename is written in the following way:
hh_dd_mm.csv (e.g. for February 1st 00:00--> 00_01_02.csv). In order to make it easier to sort the hours of a year I would like to change the filename to mm_dd_hh.csv
How can I write in R to change the filename from the pattern HH_DD_MM to MM_DD_HH?
a <- list.files(path = ".", pattern = "HH_DD_MM")
b<-paste(pattern="MM_DD_HH")
file.rename(a,b)
Or you could do:
a <- c("00_01_02.csv", "00_02_02.csv")
gsub("(\\d{2})\\_(\\d{2})\\_(\\d{2})(.*)", "\\3_\\2_\\1\\4", a)
#[1] "02_01_00.csv" "02_02_00.csv"
Not sure if this is the best solution, but seem to work
a <- c("00_01_02.csv", "00_02_02.csv")
b <- unname(sapply(a, function(x) {temp <- strsplit(x, "(_|[.])")[[1]] ; paste0(temp[[3]], "_", temp[[2]], "_", temp[[1]], ".", temp[[4]])}))
b
## [1] "02_01_00.csv" "02_02_00.csv"
You can use chartr to create the new file name. Here's an example..
> write.csv(c(1,1), "12_34_56")
> list.files()
# [1] "12_34_56"
> file.rename("12_34_56", chartr("1256", "5612", "12_34_56"))
# [1] TRUE
> list.files()
# [1] "56_34_12"
In chartr, you can replace the elements of a string, so long as it doesn't change the number of characters in the original string. In the above code, I basically just swapped "12" with "56", which is what it looks like you are trying to do.
Or, you can write a short string swapping function
> strSwap <- function(x) paste(rev(strsplit(x, "[_]")[[1]]), collapse = "_")
> ( files <- c("84_15_45", "59_95_21", "31_51_49",
"51_88_27", "21_39_98", "35_27_14") )
# [1] "84_15_45" "59_95_21" "31_51_49" "51_88_27" "21_39_98" "35_27_14"
> sapply(files, strSwap, USE.NAMES = FALSE)
# [1] "45_15_84" "21_95_59" "49_51_31" "27_88_51" "98_39_21" "14_27_35"
You could also so it with the substr<- assignment function
> s1 <- substr(files,1,2)
> substr(files,1,2) <- substr(files,7,8)
> substr(files,7,8) <- s1
> files
# [1] "45_15_84" "21_95_59" "49_51_31" "27_88_51" "98_39_21" "14_27_35"

Resources