merging multiple csv's in R - r

Hi i was merging csv downloaded from NSE Bhavcopy. different dates have different no of cols. Say in 26-12-2006 it had 998 rows & 27-12-2006 it has 1003 rows. It has 8 cols. I do the cbind to create a & b with just 2 cols, Symbol, close price. I name the col using colnames so that for merging i can merge by SYMBOL.
Questions:
1) When i use merge function with by = "SYMBOL", all = F; i was surprised to see resulting c having 1011 rows. where ever i read, merging with all = F it should become 998 rows or max 1003 rows. I also analyzed the data and found there were 5 different symbols in 27-12-2006 & 3 different symbols in 26-12-2006. So when we merge by "SYMBOL" will new symbols from both rows will be added? or it will merge only with earlier existing a row?
2) NSEmerg is a function using a for loop to read new file every time & merge with existing c file. I have about 1535 files having data from 2006 Dec till 2013 Apr. However i was not able to merge more than 12 files as it throws error vector size of 12 MB cannot be allowed. It also shows warning messages saying memory allocation of 1535 MB used up. Also at 12th file i found nrow of c to be 1508095 implying loop running infinitely. Of all the 1535 files, highest row was at 1435. Even if we add all stocks delisted, no traded on specific date, i believe it might not cross 2200 stocks. Why this shows nrow of 1.5 Million??
3) Is there any better way of merging csv? I am in stack overflow for first time else i would have attached say 10 files.
Code:
a <- read.csv("C://Users/home/desktop/061226.csv", stringsAsFactors = F, header = T)
b <- read.csv("C://Users/home/desktop/061227.csv", stringsAsFactors = F, header = T)
a_date <- a[2,1]
b_date <- b[2,1]
a <- cbind(a[,2],a[,6])
b <- cbind(b[,2], b[,6])
colnames(a) <- c("SYMBOL", a_date)
colnames(b) <- c("SYMBOL", b_date)
c <- merge(a,b,by = "SYMBOL", all = F)
NSEmerg <- function(x,y) {
y_date <- y[2,1]
y <- cbind(y[,2], y[,6])
colnames(y) <- c("SYMBOL", y_date)
c <- merge(c, y, by = "SYMBOL", all = F)
}
filenames = list.files(path = "C:/Users/home/Documents/Rest data", pattern = "*csv")
for (i in 1:length(filenames)){
y <- read.csv(filenames[i], header = T, stringsAsFactors = F)
c <- NSEmerg(c,y)
}
write.csv(c, file = "NSE.csv")

Are you sure you want to cbind and not rbind? To answer your last question. First you list all the .csv files in your map:
listfiles <- list.files(path="C:/Users/home/desktop", pattern='\\.csv$', full.names=TRUE)
Next use do.call to read in the different csv files and combine them with rbind.
df <- do.call(rbind, lapply(listfiles , read.csv))

You'd probably be better off just using a perl one-liner:
perl -pe1 file1 file2 file3 ... > newfile
and then you can cut the columns you need out
cut -f1,2 -d"," newfile > result

Related

Multiple text into dataframe in R

I have 50 txt files all with multiple words like this
View(file1.txt)
one
two
three
four
cuatro
View(file2)
uno
five
seis
dos
Each file has only one row of words and different lengths.
I want to create a dataframe in R that has the content of each file into a column and the column name is the file name.
file1 file2 ...........etc
1 one uno
2 two five
3 three seis
4 four dos
5 cuatro
So far I have loaded all the files into a list like this:
files<- lapply(list.files(pattern = "\\.txt$"),read.csv,header=F)
> class(files)
[1] "list"
df <- data.frame(matrix(unlist(files), ncol= length(files)))
which is definitely close but wrong because there are not holes (and some columns should have more data than others) and its also not automatically naming the columns.
Hope someone can help!
Try this, get filenames, read them in, get the maximum number of rows, then extend the number of rows. Finally, convert to data.frame:
f <- list.files(pattern = "\\.txt$", full.names = TRUE)
names(f) <- tools::file_path_sans_ext(basename(f))
res <- lapply(f, read.table)
maxRow <- max(sapply(res, nrow))
data.frame(lapply(res, function(i) i[seq(maxRow), ]))
# file1 file2
# 1 one uno
# 2 two five
# 3 three seis
# 4 four dos
# 5 cuatro <NA>
The idea is to get file with the max length, and use that length to complete the others (with fewer lengths) filling up with NA in order to make it possible to work with multiple vectors.
You can achieve that with different approaches. Here it's a way to do that.
files <- sapply(list.files(pattern = "\\.txt$"), readLines)
max_len <- max(sapply(files_data, length))
df <- data.frame(sapply(seq_along(files), function(i) {
len <- length(files[[i]])
if(len < max_len) {
files[[i]] <- append(files[[i]], rep(NA, max_len - len))
} else {
files[[i]]
}
}))
names(df) <- basename(tools::file_path_sans_ext(names(files)))

skip rows until certain 'value' is found - bulk CSV upload, inconsistent pattern of row numbers

I am trying to bulk load and merge two dozen CSV files in R. I am adding a column to my data frame to identify each file by its file name in a column called 'file_name'.
Each file has a number of rows on top that are superfluous. Originally I thought it was a consistent pattern of 16 rows. I leveraged the 'skiprow' command within the read_csv function to address that issue.
However, after closer inspection I discovered that the pattern of 16 rows is inconsistent from file to file. The only consistent pattern I found was that all the 'good data' in each file comes right after a line that says "Attendee Details"
I am trying to figure out a way to skip all the rows before and including the row containing the text "Attendee Details".
Here is my code so far:
list_of_files <- list.files(pattern='*.csv')
df <- list_of_files %>%
setNames(nm = .) %>%
map_df(~read_csv(.x, col_types = cols(), col_names = FALSE), .id = "file_name")
Here is what my data looks like:
Here is what I am hoping to load for each file while also skipping over that header row:
I already tried to redesign/configure the code snippets I found in this thread - unfortunately I hit errors and am a bit stuck on where to go from here.
Skip over all lines in a data file before and including a regular string in a loop in R
for (x in list.files(pattern="*.csv", recursive=TRUE)) {
all_content <- readLines(x)
skip = all_content[-c(1:grep("Attendee Details",all_content))]
input <- read.table(textConnection(skip))
df <- rbind(df, input)
}
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 13 elements
Any pointers? Also, I know it is preferred to post a reproducible example here - I couldn't quite figure out how to do that either.
1) As a pointer:
did you initialize df, as in the example? (cbind expects same number of columns, which you apparently do not have here)
are there files without data after the header string? (you might want to exclude those before performing cbind, so maybe read them into a list first)
do the files have varying number of columns after the header string? (same idea as before, either exclude those or use fill = TRUE)
Here is one possible solution with data.table, that also illustrates a reproducible example (but may not consider specifics of you files, obviously):
library(data.table)
makeHeader <- function(x){
fn <- paste0("myfile_", sprintf("%02d", x), ".csv")
cat("Attendee report\n", file=fn, append = FALSE)
cat("Report Generated,01/01/2020\n", file=fn, append = TRUE)
cat("Topic,topic ", x, "\n", file=fn, append = TRUE)
cat("Attendee Details\n", file=fn, append = TRUE)
}
# generate files:
set.seed(1)
bigdt <- data.table(col1 = sample(1:12, 1.2e4, replace = TRUE),
col2 = sample(LETTERS[1:26], 1.2e4, replace = TRUE),
col3 = sample(20:50, 1.2e4, replace = TRUE))
biglist <- split(bigdt, ceiling(seq_len(dim(bigdt)[1])/1e3))
rm(bigdt)
invisible(lapply(seq_along(biglist), makeHeader))
invisible(lapply(seq_along(biglist),
function(x) fwrite(biglist[[x]],
file=paste0("myfile_", sprintf("%02d", x), ".csv"), append=TRUE, col.names = TRUE)))
# read files and combine; add column for file name
files <- list.files(pattern="myfile_.*.csv")
DT <- lapply(files, function(x) fread(x, skip="Attendee Details", sep=","))
names(DT) <- files
rbindlist(DT, idcol = "file", fill=TRUE)
#> file col1 col2 col3
#> 1: myfile_01.csv 9 N 48
#> 2: myfile_01.csv 4 V 33
#> 3: myfile_01.csv 7 X 22
#> 4: myfile_01.csv 1 W 49
#> 5: myfile_01.csv 2 X 34
#> ---
#> 11996: myfile_12.csv 10 S 29
#> 11997: myfile_12.csv 5 T 25
#> 11998: myfile_12.csv 8 B 34
#> 11999: myfile_12.csv 11 P 43
#> 12000: myfile_12.csv 1 O 37
Created on 2020-05-31 by the reprex package (v0.3.0)

R: Read in random rows from file using fread or equivalent?

I have a very large multi-gigabyte file which is too costly to load into memory. The ordering of the rows in the file, however, are not random. Is there a way to read in a random subset of the rows using something like fread?
Something like this, for example?
data <- fread("data_file", nrows_sample = 90000)
This github post suggests one possibility is to do something like this:
fread("shuf -n 5 data_file")
This does not work for me, however. Any ideas?
Using the tidyverse (as opposed to data.table), you could do:
library(readr)
library(purrr)
library(dplyr)
# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start,
# giving us a total of 9000 rows in the final
start_at <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))
# sort the index sequentially
start_at <- start_at[order(start_at)]
# Read in 10 rows at a time, starting at your random numbers,
# binding results rowwise into a single data frame
sample_of_rows <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) )
If your data file happens to be a text file this solution using the package LaF could be useful:
library(LaF)
# Prepare dummy data
mat <- matrix(sample(letters,10*1000000,T), nrow = 1000000)
dim(mat)
#[1] 1000000 10
write.table(mat, "tmp.csv",
row.names = F,
sep = ",",
quote = F)
# Read 90'000 random lines
start <- Sys.time()
random_mat <- sample_lines(filename = "tmp.csv",
n = 90000,
nlines = 1000000)
random_mat <- do.call("rbind",strsplit(random_mat,","))
Sys.time() - start
#Time difference of 1.135546 secs
dim(random_mat)
#[1] 90000 10

R: How to change data in a column across multiple files. Help understanding lapply

I have a folder with about 160 files that are formatted with three columns: onset time, variable1 'x', and variable 2 'y'. Onset is listed in R as a string, but it is a time variable which is Hour:Minute:Second:FractionalSecond. I need to remove the fractional second. If I could round that would be great, but it would be okay to just remove the fractional second using something like substr(file$onset,1,8).
My files are named in a format similar to File001 File002 File054 File1001
onset X Y
00:55:17:95 3 3
00:55:29:66 3 4
00:55:31:43 3 3
01:00:49:24 3 3
01:02:00:03
I am trying to use lapply. lapply seems simple, but I'm having a hard time figuring it out. The code written below returns an error that the final line doesn't have 3 elements. For my final output it is important that my last line only have the value for onset.
lapply(files, function(x) {
t <- read.table(x, header=T) # load file
t$onset<-substr(t$onset,1,8)
out <- function(t)
# write to file
write.table(out, "filepath", sep="\t", quote=F, row.names=F, col.names=T)
})
First create a data frame of all text files, then you can apply strptime and format functions for the same vector to remove the fractional second.
filelist <- list.files(pattern = "\\.txt")
alltxt.files <- list() # create a list to populate with table data (if you wind to bind all the rows together)
count <- 1
for (file in filelist) {
dat <- read.table(file,header = T)
alltxt.files[[count]] <- dat # creat a list of rows from txt files
count <- count + 1
}
allfiles <- do.call(rbind.data.frame, alltxt.files)
allfiles$onset <- strptime(allfiles$onset,"%H:%M:%S")
allfiles$onset <- format(allfiles$onset,"%H:%M:%S")

how can i read a csv file containing some additional text data

I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz
You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz
I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.

Resources