Load all R data files from specific folder - r

I've got a lot of Rdata files which I want to combine in one dataframe.
My files, as an example, are:
file1.RData
file2.RData
file3.RData
All the datafiles have the structure: datafile$a and datafile$b. From all of the files above I would like to load take the variable $aand add this to and already existing dataframe called md. My problem isn't loading the files into the global environment, but processing the data in the RData file.
My code so far, which obviously doesn't work.
library(dplyr)
files <- list.files("correct directory", pattern="*.RData")
This returns the correct list of files.
I also know I need to lapply over a function.
lapply(files, myFun)
My problem is in the function. What I've got at the moment:
myFun <- function(files) {
load(files)
df <- data.frame(datafile$a)
md <- bind_rows(md, df)
}
The code above doesn't work, any idea how I get this to work?

Try
library(dplyr)
bind_rows(lapply(files, myFun))
# a
#1 1
#2 2
#3 3
#4 4
#5 5
#6 1
#7 2
#8 3
#9 4
#10 5
#11 6
#12 7
#13 8
#14 9
#15 10
#16 11
#17 12
#18 13
#19 14
#20 15
where
myFun <- function(files) {
load(files)
df <- data.frame(a= datafile$a)
}
data
datafile <- data.frame(a=1:5, b=6:10)
save(datafile, file='file1.RData')
datafile <- data.frame(a=1:15, b=16:30)
save(datafile, file='file2.RData')
files <- list.files(pattern='file\\d+.RData')
files

Related

efficient rbind alternative with applied function

To take a step back, my ultimate goal is to read in around 130,000 images into R with a pixel size of HxW and then to make a dataframe/datatable containing the rgb of each pixel of each image on a new row. So the output will be something like this:
> head(train_data, 10)
image_no r g b pixel_no
1: 00003e153.jpg 0.11764706 0.1921569 0.3098039 1
2: 00003e153.jpg 0.11372549 0.1882353 0.3058824 2
3: 00003e153.jpg 0.10980392 0.1843137 0.3019608 3
4: 00003e153.jpg 0.11764706 0.1921569 0.3098039 4
5: 00003e153.jpg 0.12941176 0.2039216 0.3215686 5
6: 00003e153.jpg 0.13333333 0.2078431 0.3254902 6
7: 00003e153.jpg 0.12549020 0.2000000 0.3176471 7
8: 00003e153.jpg 0.11764706 0.1921569 0.3098039 8
9: 00003e153.jpg 0.09803922 0.1725490 0.2901961 9
10: 00003e153.jpg 0.11372549 0.1882353 0.3058824 10
I currently have a piece of code to do this in which I apply a function to get the rgb for each pixel of a specified image, returning the result in a dataframe:
#function to get rgb from image file paths
get_rgb_table <- function(link){
img <- readJPEG(toString(link))
# Creating the data frame
rgb_image <- data.frame(r = as.vector(img[1:H, 1:W, 1]),
g = as.vector(img[1:H, 1:W, 2]),
b = as.vector(img[1:H, 1:W, 3]))
#add pixel id
rgb_image$pixel_no <- row.names(rgb_image)
#add image id
train_rgb <- cbind(sub('.*/', '',link),rgb_image)
colnames(train_rgb)[1] <- "image_no"
return(train_rgb)
}
I call this function on another dataframe which contains the links to all the images:
train_files <- list.files(path="~/images/", pattern=".jpg",all.files=T, full.names=T, no.. = T)
train <- data.frame(matrix(unlist(train_files), nrow=length(train_files), byrow=T))
The train dataframe looks like this:
> head(train, 10)
link
1 C:/Documents/image/00003e153.jpg
2 C:/Documents/image/000155de5.jpg
3 C:/Documents/image/00021ddc3.jpg
4 C:/Documents/image/0002756f7.jpg
5 C:/Documents/image/0002d0f32.jpg
6 C:/Documents/image/000303d4d.jpg
7 C:/Documents/image/00031f145.jpg
8 C:/Documents/image/00053c6ba.jpg
9 C:/Documents/image/00057a50d.jpg
10 C:/Documents/image/0005d01c8.jpg
I finally get the result I want with the following loop:
for(i in 1:length(train[,1])){
train_data <- rbind(train_data,get_rgb_table(train[i,1]))
}
However, this last bit of code is very inefficient. An optimization of how the function is applied and and/or the rbind would help. I think the function get_rgb_table() itself is quick but the problem is with the loop and the rbind. I have tried using apply() but can't manage to do this on each row and put the result in one dataframe without running out of memory. Any help on this would be great. Thanks!
This is very difficult to answer given the vagueness of the question, but I'll make a reproducible example of what I think you're asking and will give a solution.
Say I have a function that returns a data frame:
MyFun <- function(x)randu[1:x,]
And I have a data frame df that will act an input to the function.
# a b
# 1 1 21
# 2 2 22
# 3 3 23
# 4 4 24
# 5 5 25
# 6 6 26
# 7 7 27
# 8 8 28
# 9 9 29
# 10 10 30
From your question, it looks like only one column will be used as input. So, I apply the function to each row of this data frame using lapply then I bind the results together using do.call and rbind like this:
do.call(rbind, lapply(df$a, MyFun))

Append data to different sheets in an excel in R

I have a dataframe like
All_DATA
ID Name Age
1 xyz 10
2 pqr 20
5 abc 15
6 pqr 19
8 xyz 10
9 pqr 12
10 abc 20
54 abc 41
Right now I have code which works for subsetting the data based on Name and the putting them into different excel ,but Now I want it in same excel in different sheets.
Here is the code for putting them into different excel
library("xlsx")
library("openxlsx")
All_DATA = read.xlsx("D:/test.xlsx")
data.list=list()
for(i in unique(All_DATA$Name)){
data.list[[i]] = subset(All_DATA,NAME==i)
write.xlsx( data.list[[i]],file=paste0("D:/Admin/",i,".xlsx"),row.names=F)
}
Is there any way by which a single excel file with data on multiple sheets can be generated.
Thanks
Domnick
You can use
write.xlsx(data.list[[i]], file="file.xlsx", sheetName=paste0("Sheet_",i,".xlsx"), row.names = F)

Ignoring specific characters in read table of list

I have the following list:
> str1<-'cor [1] 0.8832846 0.8880517 0.8881286 0.8845148 0.8832846 0.8880517 0.8818238 0.8767492 0.8876672 0.8822851 0.8854375 0.8850531 0.8835153
[14] 0.8832846 0.8908965 0.8803629'
I use the following command:
> df1 <- read.table(text=scan(text=str1, what='', quiet=TRUE), header=TRUE)
However, [1] and [14] are included in df1. What can I change in df1 in order to ignore all [x] (where x is a number?
We can remove the square brackets including the numbers inside with gsub, scan and then read.table as in the OP's post.
read.table(text=scan(text=gsub('\\[\\d+\\]', '', str1),
what='', quiet=TRUE), header=TRUE)
# cor
#1 0.8832846
#2 0.8880517
#3 0.8881286
#4 0.8845148
#5 0.8832846
#6 0.8880517
#7 0.8818238
#8 0.8767492
#9 0.8876672
#10 0.8822851
#11 0.8854375
#12 0.8850531
#13 0.8835153
#14 0.8832846
#15 0.8908965
#16 0.8803629
Or without using scan as #Richard Scriven mentioned
read.table(text=gsub('\\s+(\\[\\d+\\]\\s+)?', '\n', str1), header=TRUE)

R reading values of numeric field in file wrongly

R is reading the values from a file wrongly. One can check if this statement is true with the following example:
A sample picture/snapshot which explains the problem areas is here
(1) Copy paste the following 10 numbers into a test file (sample.csv)
1000522010609612
1000522010609613
1000522010609614
1000522010609615
1000522010609616
1000522010609617
971000522010609612
1501000522010819466
971000522010943717
1501000522010733490
(2) Read these contents into R using read.csv
X <- read.csv("./test.csv", header=FALSE)
(3) Print the output
print(head(X, n=10), digits=22)
The output I got was
V1
1 1000522010609612.000000
2 1000522010609613.000000
3 1000522010609614.000000
4 1000522010609615.000000
5 1000522010609616.000000
6 1000522010609617.000000
7 971000522010609664.000000
8 1501000522010819584.000000
9 971000522010943744.000000
10 1501000522010733568.000000
The problem is that rows 7,8,9,10 are not correct (check the sample 10 numbers that we considered before).
What could be the problem? Is there some setting that I am missing with my R - terminal?
You could try
library(bit64)
x <- read.csv('sample.csv', header=FALSE, colClasses='integer64')
x
# V1
#1 1000522010609612
#2 1000522010609613
#3 1000522010609614
#4 1000522010609615
#5 1000522010609616
#6 1000522010609617
#7 971000522010609612
#8 1501000522010819466
#9 971000522010943717
#10 1501000522010733490
If you load the bit64, then you can also try fread from data.table
library(data.table)
x1 <- fread('sample.csv')
x1
# V1
#1: 1000522010609612
#2: 1000522010609613
#3: 1000522010609614
#4: 1000522010609615
#5: 1000522010609616
#6: 1000522010609617
#7: 971000522010609612
#8: 1501000522010819466
#9: 971000522010943717
#10: 1501000522010733490

Manipulating multiple files in R

I am new to R and am looking for a code to manipulate hundreds of files that I have at hand. They are .txt files with a few rows of unwanted text, followed by columns of data, looking something like this:
XXXXX
XXXXX
XXXXX
Col1 Col2 Col3 Col4 Col5
1 36 37 35 36
2 34 34 36 37
.
.
1500 34 35 36 35
I wrote a code (below) to extract selected rows of columns 1 and 5 of an individual .txt file, and would like to do a loop for all the files that I have.
data <- read.table(paste("/Users/tan/Desktop/test/01.txt"), skip =264, nrows = 932)
selcol<-c("V1", "V5")
write.table(data[selcol], file="/Users/tan/Desktop/test/01ed.txt", sep="\t")
With the above code, the .txt file now looks like this:
Col1 Col5
300 34
.
.
700 34
If possible, I would like to combine all the Col5 of the .txt files with one of Column 1 (which is the same for all txt files), so that it looks something like this:
Col1 Col5a Col5b Col5c Col5d ...
300 34 34 36 37
.
.
700 34 34 36 37
Thank you!
Tan
Alright - I think I hit on all your questions here, but let me know if I missed something. The general process that we will go through here is:
Identify all of the files that we want to read in and process in our working directory
Use lapply to iterate over each of those file names to create a single list object that contains all of the data
Select your columns of interest
Merge them together by the common column
For the purposes of the example, consider I have four files named file1.txt through file4.txt that all look like this:
x y y2
1 1 2.44281173 -2.32777987
2 2 -0.32999022 -0.60991623
3 3 0.74954561 0.03761497
4 4 -0.44374491 -1.65062852
5 5 0.79140012 0.40717932
6 6 -0.38517329 -0.64859906
7 7 0.92959219 -1.27056731
8 8 0.47004041 2.52418636
9 9 -0.73437337 0.47071120
10 10 0.48385902 1.37193941
##1. identify files to read in
filesToProcess <- dir(pattern = "file.*\\.txt$")
> filesToProcess
[1] "file1.txt" "file2.txt" "file3.txt" "file4.txt"
##2. Iterate over each of those file names with lapply
listOfFiles <- lapply(filesToProcess, function(x) read.table(x, header = TRUE))
##3. Select columns x and y2 from each of the objects in our list
listOfFiles <- lapply(listOfFiles, function(z) z[c("x", "y2")])
##NOTE: you can combine steps 2 and 3 by passing in the colClasses parameter to read.table.
#That code would be:
listOfFiles <- lapply(filesToProcess, function(x) read.table(x, header = TRUE
, colClasses = c("integer","NULL","numeric")))
##4. Merge all of the objects in the list together with Reduce.
# x is the common columns to join on
out <- Reduce(function(x,y) {merge(x,y, by = "x")}, listOfFiles)
#clean up the column names
colnames(out) <- c("x", sub("\\.txt", "", filesToProcess))
Results in the following:
> out
x file1 file2 file3 file4
1 1 -2.32777987 -0.671934857 -2.32777987 -0.671934857
2 2 -0.60991623 -0.822505224 -0.60991623 -0.822505224
3 3 0.03761497 0.049694686 0.03761497 0.049694686
4 4 -1.65062852 -1.173863215 -1.65062852 -1.173863215
5 5 0.40717932 1.189763270 0.40717932 1.189763270
6 6 -0.64859906 0.610462808 -0.64859906 0.610462808
7 7 -1.27056731 0.928107752 -1.27056731 0.928107752
8 8 2.52418636 -0.856625895 2.52418636 -0.856625895
9 9 0.47071120 -1.290480033 0.47071120 -1.290480033
10 10 1.37193941 -0.235659079 1.37193941 -0.235659079

Resources