I'm new to R and trying to generate a lot of graphs from one file, with headers between different data sets.
I have a tab-delimited plaintext file, formatted like this:
Header: Boston city data
Month Data1 Data2 Data3
1 1.5 9.1342 8.1231
2 12.3 12.31 1.129
3 (etc...)
Header: Chicago city data
Month Data1 Data2 Data3
1 1.5 9.1342 8.1231
2 12.3 12.31 1.129
...
I would like to create a graph of month vs Data1, month vs Data2, and month vs Data2, for each city.
I know in python, I could iterate through each line, do something different if the line starts with 'Header', and then somehow process the numbers. I would like to simply do this:
for (data block starting with header) in inf:
data = read.delim()
barplot(data, main=header, ylab="Data1", xlab="Month")
# repeat for Data2, Data3
but I'm not sure how to actually iterate through the file, or if I should just split up my file by city into lots of small files, then run through a list of small files to read.
You could use a combination of gsub, grep and strsplit:
## get city name
nameSet <- function(x) {
return(gsub(pattern="Header: (.*) city data", replacement="\\1", x=x))
}
## extract monthly numbers
singleSet <- function(x) {
l <- lapply(x, function(y) {
## split single line by spaces
s <- strsplit(y, "[[:space:]]+")
## turn characters into doubles
return(as.double(s[[1]]))
})
## turn list into a matrix
m <- do.call(rbind, l)
return(m)
}
## read file
con <- file("data.txt", "r")
lines <- readLines(con)
close(con)
## determine header lines and calculate begin/end lines for each dataset
headerLines <- grep(pattern="^Header", x=lines)
beginLines <- headerLines+2
endLines <- c(headerLines[-1]-1, length(lines))
## layout plotting region
par(mfrow=c(length(beginLines), 3))
## loop through all datasets
for (i in seq(along=headerLines)) {
city <- nameSet(lines[headerLines[i]])
data <- singleSet(lines[beginLines[i]:endLines[i]])
for (j in 2:ncol(data)) {
barplot(data[,j], main=city, xlab="Month", ylab=paste("Data", j-1))
}
}
par(mfrow=c(1, 1))
Here is a slightly modified version of the function referred to in my comment.
read.funkyfile = function(funkyfile, expression, ...) {
temp = readLines(funkyfile)
temp.loc = grep(expression, temp)
temp.loc = c(temp.loc, length(temp)+1)
temp.nam = gsub("[[:punct:]][[:space:]]", "",
grep(expression, temp, value=TRUE))
temp.nam = gsub(expression, "", temp.nam)
temp.out = vector("list")
for (i in 1:length(temp.nam)) {
temp.out[[i]] = read.table(textConnection(
temp[seq(from = temp.loc[i]+1,
to = temp.loc[i+1]-1)]),
...)
names(temp.out)[i] = temp.nam[i]
}
temp.out
}
Assuming your file is named "File.txt", load the function and read in the data like this. You can add any of the arguments to read.table that you need to:
temp = read.funkyfile("File.txt", "Header", header=TRUE, sep="\t")
Now, plot:
# to plot everything on one page (used for this example), uncomment the next line
# par(mfcol = c(length(temp), 1))
lapply(names(temp), function(x) barplot(as.matrix(temp[[x]][-1]),
beside=TRUE, main=x,
legend=TRUE))
# dev.off() or par(mfcol = c(1, 1)) if par was modified
Here's what your small sample data look like with par(mfcol = c(length(temp), 1)):
Related
I have 280 *.csv files in a directory. Each file has 3 columns and 1000 rows. I want to estimate Pearson's correlation between column 2 and 3 of each file and put the correlation value in the first cell of column 4, and also all 280 correlation values in a separate file. How can I do this in R?
I have tried several codes including the one below which although I know is incorrect, I do not know how to write. Please help.
files <- list.files(path="mydirectory", pattern="*.csv", full.names=TRUE,
recursive=FALSE)
function(files)
lapply(files,function(x){
x <- read.csv(files, header = TRUE)
out <- function(cor(files[,2:3])
write.csv(out, sep = "\t", quote = FALSE, row.names = FALSE)
})
As for the first part, that's easy. You can calculate the correlations in an lapply loop and write them to a new file:
lapply(files, function(f) {
# Read CSV data
csv_data <- read.csv(f, header=TRUE)
# Calculate correlation
csv_data[, 4] <- cor(csv_data[, 2], csv_data[, 3])
# Create a new filename by replacing the ending of the
# input file (.csv) with (_cor.csv)
newfile <- gsub("\\.csv$", "_cor.csv", f)
write.csv(csv_data, file = newfile, quote = FALSE)
})
Since R wants columns in data.frames to have the same number of rows, this will fill every row of the 4th column with the correlation value. I would roll with this, but if you have a lot of data this can waste storage. Here's a not very elegant solution to only have the correlation in the first row:
lapply(files, function(f) {
# Read CSV data
csv_data <- read.csv(f, header=TRUE)
# Calculate correlation
csv_data[, 4] <- cor(csv_data[, 2], csv_data[, 3])
# Now delete duplicate values of cor
csv_data[2:nrow(csv_data), 4] <- NA
# Create a new filename by replacing the ending of the
# input file (.csv) with (_cor.csv)
newfile <- gsub("\\.csv$", "_cor.csv", f)
# Now when we write, we tell R to write an empty string when it encounters
# missing values
write.csv(csv_data, file = newfile, quote = FALSE, na = "")
})
Also:
You do not need to call function() when you use functions that already exist (like lapply() or cor()). You only need to use that when you want to define a new function yourself.
If you want to have the output in a single data.frame try:
my_df <- do.call(rbind,
lapply(files, function(f) {
# Read CSV data
csv_data <- read.csv(f, header=TRUE)
# Calculate correlation
data.frame(File=f, Correlation=cor(csv_data[, 2], csv_data[, 3]))
})
)
I have a large dataframe my_df in R containing 1983000 records. The following lines of sample code take the chunk of 1000 rows starting from 25001, do some processing, and write the processed data into a file to the local disk.
my_df1 <- my_df[25001:26000,]
my_df1$end <- as.POSIXct(paste(my_df1$end,"23:59",sep = ""))
my_df1$year <- lubridate::year(my_df1$start)
str_data <- my_df1
setwd("path_to_local_dir/data25001_26000")
write.table(str_data, file = "data25001-26000.csv",row.names = F,col.names = F,quote = F)
and so on like this:
my_df2 <- my_df[26001:27000,]
...
I would like automate this task such that the chunks of 1000 records are processed and written to a new directory. Any advise on how this could be done?
Consider generalizing your process in a function, data_to_disk, and call function with an iterator method like lapply passing a sequence of integers with seq() for each subsequent thousand. Also, incorporate a dynamic directory creation (but maybe dump all 1,000+ files in one directory instead of 1,000+ dirs?).
data_to_disk <- function(num) {
str_data <- within(my_df[num:(num + 999)], {
end <- as.POSIXct(paste0(end, "23:59"))
year <- lubridate::year($start)
})
my_dir <- paste0("path_to_local_dir/data", num, "_", num + 999)
if(!dir.exists(my_dir)) dir.create(my_dir)
write.table(str_data, file = paste0(my_dir, "/", "data", num, "-", num + 999, ".csv"),
row.names = FALSE, col.names = FALSE, quote = FALSE)
return(my_df)
}
seqs <- seq(25001, nrow(my_df), by=1000)
head(seqs)
# [1] 25001 26001 27001 28001 29001 30001
tail(seqs)
# [1] 1977001 1978001 1979001 1980001 1981001 1982001
# LIST OF 1,958 DATA FRAMES
df_list <- lapply(seqs, data_to_disk)
Here is my code doing the sliced loop:
step1 = 1000
runto = nrow(my_df)
nsteps = ceiling(runto/step1)
for( part in seq_len(nsteps) ) { # part = 1
cat( part, 'of', nsteps, '\n')
fr = (part-1)*step1 + 1
to = min(part*step1, runto)
my_df1 = my_df[fr:to,]
# ...
write.table(str_data, file = paste0("data",fr,"-",to,".csv"))
}
rm(part, step1, runto, nsteps, fr, to)
You can add a grouping variable to your data first (e.g., to identify every 1000 rows), then use d_ply() to split the data and write to file.
df <- data.frame(var=runif(1000000))
df$fold <- cut(seq(1,nrow(df)),breaks=100,labels=FALSE)
df %>% filter(fold<=2) %>% # only writes first two files
d_ply(.,.(fold), function(i){
# make filenames 'data1.csv', 'data2.csv'
write_csv(i,paste0('data',distinct(i,fold),'.csv'))
})
This is similar to #Parfait but takes a lot of stuff out of the function. Specifically, it creates a copy of the entire dataset and then performs the time manipulation functions.
my_df1 <- my_df
my_df1$end <- as.POSIXct(paste(my_df1$end,"23:59",sep = ""))
my_df1$year <- lubridate::year(my_df1$start)
lapply(seq(25001, nrow(my_df1), by = 1000),
function(i) write.table(my_df1[i:i+1000-1,]
, file = paste0('path_to_logal_dir/data'
, i, '-', i+1000-1, '.csv')
,row.names = F,col.names = F,quote = F)
)
For me, I'd probably just do:
write.table(my_df1, file = ...)
and be done with it. I don't see the advantages of splitting it up - 1 million rows really isn't that many.
I have created a series of commands in R that get a job done using a specific URL. I would like to iterate the series of commands over a list of URLS that reside in a separate text file. How do I call the list into the commands one at a time?
I do not know what the proper terminology for this programming action. I've looked into scripting and batch programming but this is not what I want to do.
# URL that comes from list
URL <- "http://www.urlfromlist.com"
# Load URL
theurl <- getURL(URL,.opts = list(ssl.verifypeer = FALSE) )
# Read the tables
tables <- readHTMLTable(theurl)
# Create a list
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
# Convert the list to a data frame
df <- do.call(rbind.data.frame, tables)
# Save dataframe out as a csv file
write.csv(df2, file = dynamicname, row.names=FALSE)
The above code is what I am doing. The first variable needs to be a different URL each time from a list - rinse and repeat. Thanks!
UPDATED CODE - this is still not writing out any files but runs.
# Function to pull tables from list of URLs
URLfunction<- function(x){
# URL that comes from list
URL <- x
# Load URL
theurl <- RCurl::getURL(URL,.opts = list(ssl.verifypeer = FALSE) )
# Read the tables
tables <- XML::readHTMLTable(theurl)
# Create a list
tables <- rlist::list.clean(tables, fun = is.null, recursive = FALSE)
# Convert the list to a data frame
df <- do.call(rbind,tables)
# Split date and time column out
df2 <- separate(df, "Date / Time", c("Date", "Time"), sep = " ")
# Fill the missing column with text, in this case shapename
shapename <- qdapRegex::ex_between(URL, "ndxs", ".html")
df2$Shape <- shapename
# Save dataframe out as a csv file
write.csv(result, paste0(shapename, '.csv', row.names=FALSE))
return(df2)
}
URL <- read.csv("PATH", header = FALSE)
purrr::map_df(URL, URLfunction) ## Also tried purrr::map_df(URL[,1], URLfunction)
If i understand your question correctly,
my answer could be work with your problem.
Used library
library(RCurl)
library(XML)
library(rlist)
library(purrr)
Define function
URLfunction<- function(x){
# URL that comes from list
URL <- x
# Load URL
theurl <- RCurl::getURL(URL,.opts = list(ssl.verifypeer = FALSE) )
# Read the tables
tables <- XML::readHTMLTable(theurl)
# Create a list
tables <- rlist::list.clean(tables, fun = is.null, recursive = FALSE)
# Convert the list to a data frame
df <- do.call(rbind,tables)
# Save dataframe out as a csv file
return(df)
}
Assume you have a data like below
( I am not sure what data looks like you have )
URL <- c("https://stackoverflow.com/questions/56139810/how-to-call-a-script-in-another-script-in-r",
"https://stackoverflow.com/questions/56122052/labelling-points-on-a-highcharter-scatter-chart/56123057?noredirect=1#comment98909916_56123057")
result<- purrr::map(URL, URLfunction)
result <- do.call(rbind, result)
Write.csv is last step
If you want write.csv by each URL , plz move in to URLfunction
write.csv(result, file = dynamicname, row.names=FALSE)
Aditional
List version
URL <- list("https://stackoverflow.com/questions/56139810/how-to-call-a-script-in-another-script-in-r",
"https://stackoverflow.com/questions/56122052/labelling-points-on-a-highcharter-scatter-chart/56123057?noredirect=1#comment98909916_56123057")
result<- purrr::map_df(URL, URLfunction)
>result
asked today yesterday
1 viewed 35 times <NA>
2 active today <NA>
3 viewed <NA> 34 times
4 active <NA> today
CSV
URL <- read.csv("PATH",header = FALSE)
result<- purrr::map_df(URL[,1], URLfunction)
>result
asked today yesterday
1 viewed 35 times <NA>
2 active today <NA>
3 viewed <NA> 34 times
4 active <NA> today
Add edited version of your code.
URLfunction<- function(x){
# URL that comes from list
URL <- x
# Load URL
theurl <- RCurl::getURL(URL,.opts = list(ssl.verifypeer = FALSE) )
# Read the tables
tables <- XML::readHTMLTable(theurl)
# Create a list
tables <- rlist::list.clean(tables, fun = is.null, recursive = FALSE)
# Convert the list to a data frame
df <- do.call(rbind,tables)
# Split date and time column out
df2 <- tidyr::separate(df, "Date / Time", c("Date", "Time"), sep = " ")
# Fill the missing column with text, in this case shapename
shapename <- unlist(qdapRegex::ex_between(URL, "ndxs", ".html"))
# qdapRegex::ex_between returns list type, when it added to df2 it couldn't be saved.
# So i added 'unlist'
df2$Shape <- shapename
# Save dataframe out as a csv file
write.csv(df2, paste0(shapename, '.csv'), row.names=FALSE)
# Here are two error.
# First, You maked the data named 'df2' not 'result'. So i changed result -->df2
# Second, row.names is not the 'paste0' attributes, it is 'write.csv's attributes.
return(df2)
}
After defining above function,
URL = c("nuforc.org/webreports/ndxsRectangle.html",
"nuforc.org/webreports/ndxsRound.html")
RESULT = purrr::map_df(URL, URLfunction) ## Also tried purrr::map_df(URL[,1], URLfunction)
Finally, i get the result below
1. Rectangle.csv, Round.csv files on your desktop(Saved path).
2. Returning row binded data frame looks like below (2011 x 8)
> RESULT[1,]
Date Time City State Shape Duration
1 5/2/19 00:20 Honolulu HI Rectangle 3 seconds
Summary
1 Several of rectangles connected in different LED like colors. Such as red, green, blue, etc. ;above Waikiki. ((anonymous report))
Posted
1 5/9/19
I need to read many files into R, do some clean up, and then combine them into one data frame. The files all basically start like this:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.07.11 09:47:35 =~=~=~=~=~=~=~=~=~=~=~=
up
Upload #18
Reader: S1 Site: AA
--------- upload 18 start ---------
Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap
E,2016-07-05,11:45:44.17,"upload 17 complete"
D,2016-07-05,11:46:24.69,00:00:00.87,HA,900_226000745055,A2,8,1102
D,2016-07-05,11:46:43.23,00:00:01.12,HA,900_226000745055,A2,10,143
The row with column headers is "Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap". Data should have 9 columns. The problem is that the number of rows above the header string is different for every file, so I cannot simply use skip = 5. I also only need lines that begin with "D,", everything else is messages, not data.
What is the best way to read in my files, ensuring that I have 9 columns and skipping all the junk?
I have been using the read_csv function from the readr() package because thus far it has produced the fewest formatting issues. But, I am open to any new ideas including a way to read in just lines that begin with "D,". I toyed with using read.table and skip = grep("Type," readLines(i)), but it doesn't seem to find the header string correctly. Here's my basic code:
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA", skip = 35)
# do clean-up stuff
datalist[[i]] <- d
}
One other basic R solution is the following: You read in the file by lines, get the indices of rows, that begin with "D" and the header row. After, you simply split these lines by "," and put it in a data.frame and assign the names from the header row to it.
lines <- readLines(i)
dataRows <- grep("^D,", lines)
names <- unlist(strsplit(lines[grep("Type,", lines)], split = ","))
data <- as.data.frame(matrix(unlist(strsplit(lines[dataRows], ",")), nrow = length(dataRows), byrow=T))
names(data) <- names
Output:
Type Date Time Duration Type Tag ID Ant Count Gap
1 D 2016-07-05 11:46:24.69 00:00:00.87 HA 900_226000745055 A2 8 1102
2 D 2016-07-05 11:46:43.23 00:00:01.12 HA 900_226000745055 A2 10 143
You can use a custom function to loop over each file and filter only those which start with D in the type column and bind them all together at the end. Drop the bind_rows if you want them as separate lists.
load_data <-function(path) {
require(dplyr)
setwd(path)
files <- dir()
read_files <- function(x) {
data_file <- read.csv(paste(path, "/", x, ".csv", sep = ""), stringsAsFactors = FALSE, na.strings=c("","NA"))
row.number <- grep("^Type$", data_file[,1])
colnames(data_file) <- data_file[row.number,]
data_file <- data_file[-c(1:row.number+1),]
data_file <- data_file %>%
filter(grepl("^D", Type))
return(data_file)
}
data <- lapply(files, read_files)
}
list_of_file <- bind_rows(load_data("YOUR_FOLDER_PATH"))
If your header row always begins with the word Type, you can simply omit the skip option from your initial read, and then remove any rows before the header row. Here's some code to get you started (not tested):
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
# do clean-up stuff
datalist[[i]] <- d
}
If you want to keep the header, you can use:
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
header <- d01[headerRow,] # Get names from header row.
setNames( d01, header ) # Assign names.
# do clean-up stuff
datalist[[i]] <- d
}
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...