I'm hoping someone can give me some advice on importing and parsing .eml files in r. I have a folder with around 1000 .eml files containing text which includes entries like the one below:
Return-Path: < fake.name#stuff.com>
What I would like to do is import all of these files in to a data.frame or data.table in r, and parse out the email addresses in to a separate field.
I think I've seen something like this done before with text files and using grep.
Any tips would be very much appreciated.
I started with an mbox file that I downloaded from gmail. Broke it down into a bunch of individual messages in eml format. Then from each file pulled out the lines I need and assembled them into a data frame.
library(tm.plugin.mail)
mbf <- "mboxfile"
convert_mbox_eml(mbf, "emlfile2")
maildir <- "emlfile2"
mailfiles <- dir(maildir, full.names=TRUE)
readmsg <- function(fname) {
l <- readLines(fname)
subj <- grep("Subject: ", l, value=TRUE)
subj <- gsub("Subject: ", "", subj)
date <- grep("Date: ", l, value=TRUE)
date <- gsub("Date: ", "", date)
text1 <- tail(l, 3)[1]
text2 <- tail(l, 3)[2]
return(c(subj, date, text1, text2))
}
mdf <- do.call(rbind, lapply(mailfiles, readmsg))
Related
I have a list of approximately 500 csv files each with a filename that consists of a six-digit number followed by a year (ex. 123456_2015.csv). I would like to append all files together that have the same six-digit number. I tried to implement the code suggested in this question:
Import and rbind multiple csv files with common name in R but I want the appended data to be saved as new csv files in the same directory as the original files are currently saved. I have also tried to implement the below code however the csv files produced from this contain no data.
rm(list=ls())
filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test")
NAPS_ID <- gsub('.+?\\([0-9]{5,6}?)\\_.+?$', '\\1', filenames)
Unique_NAPS_ID <- unique(NAPS_ID)
n <- length(Unique_NAPS_ID)
for(j in 1:n){
curr_NAPS_ID <- as.character(Unique_NAPS_ID[j])
NAPS_ID_pattern <- paste(".+?\\_(", curr_NAPS_ID,"+?)\\_.+?$", sep = "" )
NAPS_filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test", pattern = NAPS_ID_pattern)
write.csv(do.call("rbind", lapply(NAPS_filenames, read.csv, header = TRUE)),file = paste("C:/Users/smithma/Desktop/PM25_test/MERGED", "MERGED_", Unique_NAPS_ID[j], ".csv", sep = ""), row.names=FALSE)
}
Any help would be greatly appreciated.
Because you're not doing any data manipulation, you don't need to treat the files like tabular data. You only need to copy the file contents.
filenames <- list.files("C:/Users/smithma/Desktop/PM25_test", full.names = TRUE)
NAPS_ID <- substr(basename(filenames), 1, 6)
Unique_NAPS_ID <- unique(NAPS_ID)
for(curr_NAPS_ID in Unique_NAPS_ID){
NAPS_filenames <- filenames[startsWith(basename(filenames), curr_NAPS_ID)]
output_file <- paste0(
"C:/Users/nwerth/Desktop/PM25_test/MERGED_", curr_NAPS_ID, ".csv"
)
for (fname in NAPS_filenames) {
line_text <- readLines(fname)
# Write the header from the first file
if (fname == NAPS_filenames[1]) {
cat(line_text[1], '\n', sep = '', file = output_file)
}
# Append every line in the file except the header
line_text <- line_text[-1]
cat(line_text, file = output_file, sep = '\n', append = TRUE)
}
}
My changes:
list.files(..., full.names = TRUE) is usually the best way to go.
Because the digits appear at the start of the filenames, I suggest substr. It's easier to get an idea of what's going on when skimming the code.
Instead of looping over the indices of a vector, loop over the values. It's more succinct and less likely to cause problems if the vector's empty.
startsWith and endsWith are relatively new functions, and they're great.
You only care about copying lines, so just use readLines to get them in and cat to get them out.
You might consider something like this:
##will take the first 6 characters of each file name
six.digit.filenames <- substr(filenames, 1,6)
path <- "C:/Users/smithma/Desktop/PM25_test/"
unique.numbers <- unique(six.digit.filenames)
for(j in unique.numbers){
sub <- filenames[which(substr(filenames,1,6) == j)]
data.for.output <- c()
for(file in sub){
##now do your stuff with these files including read them in
data <- read.csv(paste0(path,file))
data.for.output <- rbind(data.for.output,data)
}
write.csv(data.for.output,paste0(path,j, '.csv'), row.names = F)
}
I would like to work on several csv files to make some comparisons, so I wrote this code to read the different csv files I have:
path <- "C:\\data\\"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
My csv files are something like this:
Start Time,End Time,Total,Diffuse,Direct,Reflected
04/09/14 00:01:00,04/09/14 00:01:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
04/09/14 00:02:00,04/09/14 00:02:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
04/09/14 00:03:00,04/09/14 00:03:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
(...)
Using my code, R separate correctly all files, but for each of them it creates a table adding a more extra space at the beginning:
|Start Time |End Time |Total |Diffuse |Direct |Reflected
04/09/14 00:01:00|04/09/14 00:01:00|2.221220E-003|5.797364E-004|0.000000E+000|1.641484E-003|NA
...
How can I fix it?
Moreover, considering that the original name of each file is really long, is it possible to name each data.frame using the last letters of the file? Or just a cardinal number?
I would suggest using the data.table package - it's faster and for non-blank columns in the end, it converts those to NA (in my experience). Here's some code I wrote for a simialr task:
read_func <- function(z) {
dat <- fread(z, stringsAsFactors = FALSE)
names(dat) <- c("start_time", "end_time", "Total", "Diffuse", "Direct", "Reflect")
dat$start_tme <- as.POSIXct(strptime(dat$start_tme,
format = "%d/%m/%y %H:%M:%S"), tz = "Pacific/Easter")
patrn <- "([0-9][0-9][0-9])\\.csv"
dat$type <- paste("Dataset",gsub(".csv", "", regmatches(z,regexpr(patrn, z))),sep="")
return(as.data.table(dat))
}
path <- ".//Data/"
file_list <- dir(path, pattern = "csv")
file_names <- unname(sapply(file_list, function(x) paste(path, x, sep = "")))
data_list <- lapply(file_names, read_func)
dat <- rbindlist(data_list, use.names = TRUE)
rm(path, file_list, file_names)
This will give you a list with each item as the data.table from the corresponding file name. I assumed that all file names have a three digit number before the extension which I used to assign a variable type to each data.table. You can change patrn to match your specific use case. This way, when you combine all of them into a single data.table dat, you can always sort/filter based on type. For example, if youwanted to plot diffuse vs direct for Dataset158 and datase222, you could do the following:
ggplot(data = dat[type == 'Dataset158' | type == 'Dataset222'],
aes(x = Diffuse, y = Direct)) + geom_point()
Hope this helps!
You're having a problem because your csv files have a blank column at the end... which makes your data end in a comma:
04/09/14 00:01:00,04/09/14 00:01:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
This leads R to think your data consists of 7 columns rather than 6. The correct solution is to resave all your csv files correctly. Otherwise, R will see 7 columns but only 6 columnnames, and will logically think that the first column is rownames. Here you can apply the patch we came up with #konradrudolph:
library(tibble)
df %>% rownames_to_column() %>% setNames(c(colnames(.)[-1], 'DROP')) %>% select(-DROP)
where df is the data from the csv. But patches like this can lead to unexpected results... better save the csv files correctly.
I am trying to get concatenate text files from url but i don't know how to do this with the html and the different folders?
This is the code i tried, but it only lists the text files and has a lot of html code like this How do I fix this so that I can combine the text files into one csv file?
library(RCurl)
url <- "http://weather.ggy.uga.edu/data/daily/"
dir <- getURL(url, dirlistonly = T)
filenames <- unlist(strsplit(dir,"\n")) #split into filenames
#append the files one after another
for (i in 1:length(filenames)) {
file <- past(url,filenames[i],delim='') #concatenate for urly
if (i==1){
cp <- read_delim(file, header=F, delim=',')
}
else{
temp <- read_delim(file,header=F,delim=',')
cp <- rbind(cp,temp) #append to existing file
rm(temp)# remove the temporary file
}
}
here is a code snippet that I got to work for me. I like to use rvest over RCurl, just because that's what I've learned. In this case, I was able to use the html_nodes function to isolate each file ending in .txt. The result table has the times saved as character strings, but you could fix that later. Let me know if you have any questions.
library(rvest)
library(readr)
url <- "http://weather.ggy.uga.edu/data/daily/"
doc <- xml2::read_html(url)
text <- rvest::html_text(rvest::html_nodes(doc, "tr td a:contains('.txt')"))
# define column types of fwf data ("c" = character, "n" = number)
ctypes <- paste0("c", paste0(rep("n",11), collapse = ""))
data <- data.frame()
for (i in 1:2){
file <- paste0(url, text[1])
date <- as.Date(read_lines(file, n_max = 1), "%m/%d/%y")
# Read file to determine widths
columns <- fwf_empty(file, skip = 3)
# Manually expand `solar` column to be 3 spaces wider
columns$begin[8] <- columns$begin[8] - 3
data <- rbind(data, cbind(date,read_fwf(file, columns,
skip = 3, col_types = ctypes)))
}
I downloaded data from the internet. I wanted to extract the data and create a data frame. You can find the data in the following filtered data set link: http://www.esrl.noaa.gov/gmd/dv/data/index.php?category=Ozone&type=Balloon . At the bottom of the site page from the 9 filtered data sets you can choose any station. Say Suva, Fiji (SUV):
I have written the following code to create a data frame that has Launch date as part of the data frame for each file.
setwd("C:/Users/")
path = "~C:/Users/"
files <- lapply(list.files(pattern = '\\.l100'), readLines)
test.sample<-do.call(rbind, lapply(files, function(lines){
data.frame(datetime = as.POSIXct(sub('^.*Launch Date : ', '', lines[grep('Launch Date :', lines)])),
# and the data, read in as text
read.table(text = lines[(grep('Sonde Total', lines) + 1):length(lines)]))
}))
The files are from FTP server. The pattern of the file doesn't look familiar to me even though I tried it with .txt, it didn't work. Can you please tweak the above code or any other code to get a data frame.
Thank you in advance.
I think the problem is that the search string does not match "Launch Date :" does not match what is in the files (at least the one I checked).
This should work
lines <- "Launch Date : 11 June 1991"
lubridate::dmy(sub('^.*Launch Date.*: ', '', lines[grep('Launch Date', lines)]))
Code would probably be easier to debug if you broke the problem down into steps rather than as one sentence
I took the following approach:
td <- tempdir()
setwd(td)
ftp <- 'ftp://ftp.cmdl.noaa.gov/ozwv/Ozonesonde/Suva,%20Fiji/100%20Meter%20Average%20Files/'
files <- RCurl::getURL(ftp, dirlistonly = T)
files <- strsplit(files, "\n")
files <- unlist(files)
dat <- list()
for (i in 1:length(files)) {
download.file(paste0(ftp, files[i]), 'data.txt')
df <- read.delim('data.txt', sep = "", skip = 17)
ld <- as.character(read.delim('data.txt')[9, ])
ld <- strsplit(ld, ":")[[1]][2]
df$launch.date <- stringr::str_trim(ld)
dat[[i]] <- df ; rm(df)
}
I have 330 files that i would like to rename using R. I saved the original names and the new names in a .csv file. I used a script which does not give an error but it does not change the names.
Here is a sample of the new names:(df1)
D:\Modis_EVI\Original\EVI_Smoothed\ MODIS_EVI_20010101.tif
D:\Modis_EVI\Original\EVI_Smoothed\ MODIS_EVI_20010117.tif
D:\Modis_EVI\Original\EVI_Smoothed\ MODIS_EVI_20010201.tif
And a sample of the original names:(df2)
D:\Modis_EVI\Original\EVI_Smoothed\ MODIS.2001001.yL1600.EVI.tif
D:\Modis_EVI\Original\EVI_Smoothed\ MODIS.2001033.yL1600.EVI.tif
D:\Modis_EVI\Original\EVI_Smoothed\ MODIS.2001049.yL1600.EVI.tif
Then here is the script i'm using:
csv_dir <- "D:\\"
df1 <- read.csv(paste(csv_dir,"New_names.csv",sep=""), header=TRUE, sep=",") # read csv
hdfs <- df1$x
hdfs <- as.vector(hdfs)
df2 <- read.csv(paste(csv_dir,"smoothed.csv",sep=""), header=TRUE, sep=",") # read csv
tifs <- df2$x
tifs <- as.vector(tifs)
for (i in 1:length(hdfs)){
setwd("D:\\Modis_EVI\\Original\\EVI_Smoothed\\")
file.rename(from =tifs[i], to = hdfs[i])
}
Any advice please?
I think you mix up the old and the new files, and you are trying to use rename the new file (names), which do not exist, to the old file names. This might work
file.rename(from =hdfs[i], to = tifs[i])
A general approach would go like this:
setwd("D:\\Modis_EVI\\Original\\EVI_Smoothed\\")
fin <- list.files(pattern='tif$')
fout <- gsub("_EVI_", ".", fin)
fout <- gsub(".tif", "yL1600.EVI.tif", fout)
for (i in 1:length(fin)){
file.rename(from=fin[i], to= fout[i])
}
To fix your script (do you really need .csv files?)
setwd("D:\\Modis_EVI\\Original\\EVI_Smoothed\\")
froms <- read.csv("d:/New_names.csv", stringsAsFactors=FALSE)
froms <- as.vector(froms$x)
First check if they exist:
all(file.exists(froms))
Perhaps you need to trim the names (remove whitespace) -- that is what the examples you give suggest
library(raster)
froms <- trim(froms)
all(file.exists(froms))
If they exist
tos <- read.csv("d:/smoothed.csv", stringsAsFactors=FALSE)
tos <- as.vector(tos$x)
# tos <- trim(tos)
for (i in 1:length(froms)) {
file.rename(froms[i], tos[i])
}