Multiple text into dataframe in R - r

I have 50 txt files all with multiple words like this
View(file1.txt)
one
two
three
four
cuatro
View(file2)
uno
five
seis
dos
Each file has only one row of words and different lengths.
I want to create a dataframe in R that has the content of each file into a column and the column name is the file name.
file1 file2 ...........etc
1 one uno
2 two five
3 three seis
4 four dos
5 cuatro
So far I have loaded all the files into a list like this:
files<- lapply(list.files(pattern = "\\.txt$"),read.csv,header=F)
> class(files)
[1] "list"
df <- data.frame(matrix(unlist(files), ncol= length(files)))
which is definitely close but wrong because there are not holes (and some columns should have more data than others) and its also not automatically naming the columns.
Hope someone can help!

Try this, get filenames, read them in, get the maximum number of rows, then extend the number of rows. Finally, convert to data.frame:
f <- list.files(pattern = "\\.txt$", full.names = TRUE)
names(f) <- tools::file_path_sans_ext(basename(f))
res <- lapply(f, read.table)
maxRow <- max(sapply(res, nrow))
data.frame(lapply(res, function(i) i[seq(maxRow), ]))
# file1 file2
# 1 one uno
# 2 two five
# 3 three seis
# 4 four dos
# 5 cuatro <NA>

The idea is to get file with the max length, and use that length to complete the others (with fewer lengths) filling up with NA in order to make it possible to work with multiple vectors.
You can achieve that with different approaches. Here it's a way to do that.
files <- sapply(list.files(pattern = "\\.txt$"), readLines)
max_len <- max(sapply(files_data, length))
df <- data.frame(sapply(seq_along(files), function(i) {
len <- length(files[[i]])
if(len < max_len) {
files[[i]] <- append(files[[i]], rep(NA, max_len - len))
} else {
files[[i]]
}
}))
names(df) <- basename(tools::file_path_sans_ext(names(files)))

Related

I would like to automate the reading of PDF documents into R using pdf_text

I currently have a code to extract certain details within a PDF document. However, as i have thousands of other PDF documents to extract information from, I would like to automate this process. I am using the pdf_text option to read PDFs into R. My code looks something like this:
library(pdftools)
x <- pdf_text("Test.pdf")
y1 <- str_split(x, "\r")
#pdf output contains a total of 7 lists
a <- y1 [[4]]
b <- c(a[4],a[11:13]) #Obtain only rows 4, 11 to 13 from list 4
n2 <- y1[[3]]
n3 <- c(n2[3]) #Obtain only rows 3 from list 3
n <- y1[[5]]
n1 <- c(n[3]) #Obtain only rows 3 from list 5
c <- y1[[6]]
d <- c(c[4:18]) #Obtain only rows 4 to 18 from list 6
e <- c(n3,b,d,n1) #Combining all necessary information into one list
z <- substr(s[1:21], start = 15, stop = 200) #to remove white spaces between quotes
Name <- z[1]
InterestedParty <- z[2]
TotalOwnBefore <- substr(z[11], start = 97, stop = 120)
Ownership <- list(NM = Name, Party = InterestedParty, OwnBefore = TotalOwnBefore)
write.csv(Ownership, file="MyData.csv")
The above code allows me to output a file for a single company. However, I have thousands other PDFs ("Test_1.pdf" to "Test_1000.pdf") to be read. Is there a way to automate the reading of the PDF files into R with pdf_text? Would also be great if there's a way for me to store all results into a single file instead of one firm per file.
I have since managed to automate the process using a for loop as follows:
for (i in 1:1000){
x <- paste("Test_",i,".pdf", sep="")
y <- pdf_text(print(x))
total <- strsplit(y, "\r")
print(y1)
}

R: How to change data in a column across multiple files. Help understanding lapply

I have a folder with about 160 files that are formatted with three columns: onset time, variable1 'x', and variable 2 'y'. Onset is listed in R as a string, but it is a time variable which is Hour:Minute:Second:FractionalSecond. I need to remove the fractional second. If I could round that would be great, but it would be okay to just remove the fractional second using something like substr(file$onset,1,8).
My files are named in a format similar to File001 File002 File054 File1001
onset X Y
00:55:17:95 3 3
00:55:29:66 3 4
00:55:31:43 3 3
01:00:49:24 3 3
01:02:00:03
I am trying to use lapply. lapply seems simple, but I'm having a hard time figuring it out. The code written below returns an error that the final line doesn't have 3 elements. For my final output it is important that my last line only have the value for onset.
lapply(files, function(x) {
t <- read.table(x, header=T) # load file
t$onset<-substr(t$onset,1,8)
out <- function(t)
# write to file
write.table(out, "filepath", sep="\t", quote=F, row.names=F, col.names=T)
})
First create a data frame of all text files, then you can apply strptime and format functions for the same vector to remove the fractional second.
filelist <- list.files(pattern = "\\.txt")
alltxt.files <- list() # create a list to populate with table data (if you wind to bind all the rows together)
count <- 1
for (file in filelist) {
dat <- read.table(file,header = T)
alltxt.files[[count]] <- dat # creat a list of rows from txt files
count <- count + 1
}
allfiles <- do.call(rbind.data.frame, alltxt.files)
allfiles$onset <- strptime(allfiles$onset,"%H:%M:%S")
allfiles$onset <- format(allfiles$onset,"%H:%M:%S")

Using R to list and mark multiple csv files with characters from the title of those files, and put those in a dataframe

I have a large number of files that are all numbered and labeled from a CTD cast. These files all contain 3 columns, for bottle number fired, Depth, and Conductivity, and 3 rows, one for each water bottle fired.
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547
These files are named after the cast number as such "OS1505xxx.csv", where the xxx is the cast number. I would like to take the data from multiple casts, label the data with the cast number(which I presume would go in another column for each bottle sample), and then merge that data together in one dataframe.
1,68.93,0.2123,001
2,14.28,0.3139,001
3,8.683,0.3547,001
1,109.5,0.2062,002
2,27.98,0.4842,002
3,5.277,0.3705,002
One other thing, some files only have 1 or 2 bottles fired, While others also have 4 bottles fired. I tried finding files with only 3 rows and making a list of the filenames repeated three times, and then mergeing that with the binded csv files that had three rows into a dataframe but I am very new to R and couldn't figure it out. Any help is appreciated.
This gets all of them into one data frame in order (001-100), and from there you can export it however you want.
df <- data.frame(matrix(ncol = 4, nrow = 1))
colnames(df) <- c("V1", "V2", "V3", "file")
for(i in 1:100) {
file_name <- paste("OS1505",as.name(sprintf("%03d", i)),".csv",sep="")
if(file.exists(file_name)) {
print("match found")
df_tmp <- read.csv(file_name, header = FALSE, sep = ",",fill = TRUE)
df_tmp$file <- sprintf("%03d", i)
df <- rbind(df, df_tmp)
}
}
Try this:
files <- list.files(pattern="OS1505")
lst <- lapply(files, read.csv)
ids <- substr(files, 7,9)
for(i in 1:length(lst)) lst[[i]][,4] <- ids[i]
do.call(rbind, lst)
# X V1 V2 V3
#1 1 1 68.930 001
#2 2 2 14.280 001
#3 3 3 8.683 001
#4 1 1 109.500 002
#5 2 2 27.980 002
#6 3 3 5.277 002
We start by first creating two dummy files to try and save them as csv files to test. I named them in a way to match your files. (i.e. "OS1505001.csv"):
file1 <- read.table(text="
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547", sep=',')
file2 <- read.table(text="
1,109.5,0.2062
2,27.98,0.4842
3,5.277,0.3705", sep=',')
write.csv(file1, "OS1505001.csv")
write.csv(file2, "OS1505002.csv")
Going through the code, files checks the directory for any files that have OS1505 in them. There are two files that match that description "OS1505001.csv" "OS1505002.csv". We bring those two files into R with read.csv. It is wrapped in lapply so that the process can happen to all of the files in the files vector at once and saved in a list called lst. Now ids is a way to grab the id numbers from the file names. In a for loop we assign each id to the 4th column of the data frames. Lastly, do.call brings it all together with the rbind function.

cbind data without specifying inputs

I have a piece of code, I want to cbind my data. The catch is I will not always have eighta data files to cbind. I would like to keep the code below and import just five, if I have five.
The reason is this. I will will always have between 1 - 100 dataframes to cbind, I dont want always manually tell R to cbind one or 100. I want to just have cbind (1 :100) and always cbind what needs to be cbind.
finaltable<- cbind(onea, twoa, threea, foura, fivea, sixa, sevena, eighta)
Without more data, here's a contrived example. First, I'll make some example files with the same number of rows in each:
filenames <- paste0(c('onea', 'twoa', 'threea', 'foura'), '.csv')
for (fn in filenames)
write.csv(matrix(runif(5), nc = 1), file = fn, row.names = FALSE)
Let's first dynamically derive a list of filenames to process. (This code is assuming that the previous lines making these files did not happen.)
(filenames <- list.files(pattern = '*.csv'))
## [1] "foura.csv" "onea.csv" "threea.csv" "twoa.csv"
This is the "hard" part, reading the files:
(ret <- do.call(cbind, lapply(filenames,
function(fn) read.csv(fn, header = TRUE))))
## V1 V1 V1 V1
## 1 0.9091705 0.4934781 0.7607488 0.4267438
## 2 0.9692987 0.4349523 0.6066990 0.9134305
## 3 0.6444404 0.8639983 0.1473830 0.9844336
## 4 0.7719652 0.1492200 0.7731319 0.9689941
## 5 0.9237107 0.6317367 0.2565866 0.1084299
For proof of concept, here's the same thing but operating on a subset of the vector of filenames, showing that the length of the vector is not a concern:
(ret <- do.call(cbind, lapply(filenames[1:2],
function(fn) read.csv(fn, header = TRUE))))
## V1 V1
## 1 0.9091705 0.4934781
## 2 0.9692987 0.4349523
## 3 0.6444404 0.8639983
## 4 0.7719652 0.1492200
## 5 0.9237107 0.6317367
You may want/need to redefine the names of the columns (with names(ret) <- filenames, for example), but you can always reference the columns by numbered indexing (e.g., ret[,2]) without worrying about names.

merging multiple csv's in R

Hi i was merging csv downloaded from NSE Bhavcopy. different dates have different no of cols. Say in 26-12-2006 it had 998 rows & 27-12-2006 it has 1003 rows. It has 8 cols. I do the cbind to create a & b with just 2 cols, Symbol, close price. I name the col using colnames so that for merging i can merge by SYMBOL.
Questions:
1) When i use merge function with by = "SYMBOL", all = F; i was surprised to see resulting c having 1011 rows. where ever i read, merging with all = F it should become 998 rows or max 1003 rows. I also analyzed the data and found there were 5 different symbols in 27-12-2006 & 3 different symbols in 26-12-2006. So when we merge by "SYMBOL" will new symbols from both rows will be added? or it will merge only with earlier existing a row?
2) NSEmerg is a function using a for loop to read new file every time & merge with existing c file. I have about 1535 files having data from 2006 Dec till 2013 Apr. However i was not able to merge more than 12 files as it throws error vector size of 12 MB cannot be allowed. It also shows warning messages saying memory allocation of 1535 MB used up. Also at 12th file i found nrow of c to be 1508095 implying loop running infinitely. Of all the 1535 files, highest row was at 1435. Even if we add all stocks delisted, no traded on specific date, i believe it might not cross 2200 stocks. Why this shows nrow of 1.5 Million??
3) Is there any better way of merging csv? I am in stack overflow for first time else i would have attached say 10 files.
Code:
a <- read.csv("C://Users/home/desktop/061226.csv", stringsAsFactors = F, header = T)
b <- read.csv("C://Users/home/desktop/061227.csv", stringsAsFactors = F, header = T)
a_date <- a[2,1]
b_date <- b[2,1]
a <- cbind(a[,2],a[,6])
b <- cbind(b[,2], b[,6])
colnames(a) <- c("SYMBOL", a_date)
colnames(b) <- c("SYMBOL", b_date)
c <- merge(a,b,by = "SYMBOL", all = F)
NSEmerg <- function(x,y) {
y_date <- y[2,1]
y <- cbind(y[,2], y[,6])
colnames(y) <- c("SYMBOL", y_date)
c <- merge(c, y, by = "SYMBOL", all = F)
}
filenames = list.files(path = "C:/Users/home/Documents/Rest data", pattern = "*csv")
for (i in 1:length(filenames)){
y <- read.csv(filenames[i], header = T, stringsAsFactors = F)
c <- NSEmerg(c,y)
}
write.csv(c, file = "NSE.csv")
Are you sure you want to cbind and not rbind? To answer your last question. First you list all the .csv files in your map:
listfiles <- list.files(path="C:/Users/home/desktop", pattern='\\.csv$', full.names=TRUE)
Next use do.call to read in the different csv files and combine them with rbind.
df <- do.call(rbind, lapply(listfiles , read.csv))
You'd probably be better off just using a perl one-liner:
perl -pe1 file1 file2 file3 ... > newfile
and then you can cut the columns you need out
cut -f1,2 -d"," newfile > result

Resources