Combining CSV files in R [duplicate] - r

I Have multiple csv files that i have already read into R. Now I want to append all these into one file. I tried few things but getting different errors. Can anyone please help me with this?
TRY 1:
mydata <- rbind(x1,x2,x3,x4,x5,x6,x7,x8)
WHERE XI,X2....X8 Are the CSV files I read into R, error I am getting is
ERROR 1 :In [<-.factor(*tmp*, ri, value = c(NA, NA, NA, NA, NA, NA, NA, :
invalid factor level, NA generated
TRY 2: Then I try this in another way :
mydata1<- c(x1,x2,x3,x4,x5,x6,x7,x8)
> mydata2 <- do.call('rbind',lapply(mydata1,read.table,header=T))
Error 2: in FUN(X[[i]], ...) :
'file' must be a character string or connection
can anyone please help me know what is the right way to do this?

How to import all files from a single folder at once and bind by row (e.g., same format for each file.)
library(tidyverse)
list.files(path = "location_of/data/folder_you_want/",
pattern="*.csv",
full.names = T) %>%
map_df(~read_csv(.))
If there is a file that you want to exclude then
list.files(path = "location_of/data/folder_you_want/",
pattern="*.csv",
full.names = T) %>%
.[ !grepl("data/folder/name_of_file_to_remove.csv", .) ] %>%
map_df(~read_csv(.))

Sample CSV Files
Note
CSV files to be merged here have
- equal number of columns
- same column names
- same order of columns
- number of rows can be different
1st csv file abc.csv
A,B,C,D
1,2,3,4
2,3,4,5
3,4,5,6
1,1,1,1
2,2,2,2
44,44,44,44
4,4,4,4
4,4,4,4
33,33,33,33
11,1,11,1
2nd csv file pqr.csv
A,B,C,D
1,2,3,40
2,3,4,50
3,4,50,60
4,4,4,4
5,5,5,5
6,6,6,6
List FILENAMES of CSV Files
Note
The path below E:/MergeCSV/ has just the files to be merged. No other csv files. So in this path, there are only two csv files, abc.csv and pqr.csv
## List filenames to be merged.
filenames <- list.files(path="E:/MergeCSV/",pattern="*.csv")
## Print filenames to be merged
print(filenames)
## [1] "abc.csv" "pqr.csv"
FULL PATH to CSV Files
## Full path to csv filenames
fullpath=file.path("E:/MergeCSV",filenames)
## Print Full Path to the files
print(fullpath)
## [1] "E:/MergeCSV/abc.csv" "E:/MergeCSV/pqr.csv"
MERGE CSV Files
## Merge listed files from the path above
dataset <- do.call("rbind",lapply(filenames,FUN=function(files){ read.csv(files)}))
## Print the merged csv dataset, if its large use `head()` function to get glimpse of merged dataset
dataset
# A B C D
# 1 1 2 3 4
# 2 2 3 4 5
# 3 3 4 5 6
# 4 1 1 1 1
# 5 2 2 2 2
# 6 44 44 44 44
# 7 4 4 4 4
# 8 4 4 4 4
# 9 33 33 33 33
# 10 11 1 11 1
# 11 1 2 3 40
# 12 2 3 4 50
# 13 3 4 50 60
# 14 4 4 4 4
# 15 5 5 5 5
# 16 6 6 6 6
head(dataset)
# A B C D
# 1 1 2 3 4
# 2 2 3 4 5
# 3 3 4 5 6
# 4 1 1 1 1
# 5 2 2 2 2
# 6 44 44 44 44
## Print dimension of merged dataset
dim(dataset)
## [1] 16 4

The accepted answer above generates the error shown in the comments because the do.call requires the "fullpath" parameter. Use the code as shown to use in the directory of your choice:
dataset <- do.call("rbind",lapply(fullpath,FUN=function(files){ read.csv(files)}))

You can use a combination of lapply(), and do.call().
## cd to the csv directory
setwd("mycsvs")
## read in csvs
csvList <- lapply(list.files("./"), read.csv, stringsAsFactors = F)
## bind them all with do.call
csv <- do.call(rbind, csvList)
You can also use fread() function from the data.table package and rbindlist() instead for a performance boost.

Related

R tidyr regex: extract ordered numbers from character column

Suppose I have a data frame like this
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
It looks like this
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
Is there a simple way, possibly using the Tidyverse to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0. Essentially I would like the final result to be like this
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
I tried using stuff like
str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")
but I got so lost.
We can use regex lookaround in str_extract to extract one or more digits (\\d+) followed by a space and 'vis' or 'data files' into two columns
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
In the first case, the pattern matches one or more digits (\\d+) followed by a regex lookaround ((?=) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'
You could use the package unglue to get a readable solution as you have a limited amount of possible patterns, then replace NAs by 0 :
library(unglue)
patterns <-
c("This script outputs {viz} visualization{=s{0,1}} and {files} data file{=s{0,1}}.",
"This script outputs {viz} visualization{=s{0,1}}.",
"This script outputs {files} data file{=s{0,1}}.")
res <- unglue_unnest(df, x, patterns, convert = TRUE)
res[is.na(res)] <- 0
res
#> viz files
#> 1 10 0
#> 2 1 0
#> 3 0 5
#> 4 0 1
#> 5 0 0
#> 6 9 28
#> 7 1 1
A base R approach ...
df$viz <- as.numeric(sub(".*This script outputs (\\d+).*", "\\1", df$x))
df$files <- as.numeric(sub(".*(\\d+) data file.*", "\\1", df$x))
df[is.na(df)] <- 0
df
# x viz files
# 1 This script outputs 10 visualizations. 10 0
# 2 This script outputs 1 visualization. 1 0
# 3 This script outputs 5 data files. 5 5
# 4 This script outputs 1 data file. 1 1
# 5 This script doesn't output any visualizations or data files 0 0
# 6 This script outputs 9 visualizations and 28 data files. 9 28
# 7 This script outputs 1 visualization and 1 data file. 1 1

Split a data frame by rows and save as csv

I just have a data frame and want to split the data frame by rows, assign the several new data frames to new variables and save them as csv files.
a <- rep(1:5,each=3)
b <-rep(1:3,each=5)
c <- data.frame(a,b)
# a b
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 2 2
7 3 2
8 3 2
9 3 2
10 4 2
11 4 3
12 4 3
13 5 3
14 5 3
15 5 3
I want to split c by column a. i.e all rows are 1 in column a are split from c and assign it to A and save A as A.csv. The same to B.csv with all 2 in column a.
What I can do is
A<-c[c$a%in%1,]
write.csv (A, "A.csv")
B<-c[c$a%in%2,]
write.csv (B, "B.csv")
...
If I have 1000 rows and there will be lots of subsets, I just wonder if there is a simple way to do this by using for loop?
The split() function is very useful to split data frame. Also, you can use lapply() here - it should be more efficient than a loop.
dfs <- split(c, c$a) # list of dfs
# use numbers as file names
lapply(names(dfs),
function(x){write.csv(dfs[[x]], paste0(x,".csv"),
row.names = FALSE)})
# or use letters (max 26!) as file names
names(dfs) <- LETTERS[1:length(dfs)]
lapply(names(dfs),
function(x){write.csv(dfs[[x]],
file = paste0(x,".csv"),
row.names = FALSE)})
for(i in seq_along(unique(c$a))){
write.csv(c[c$a == i,], paste0(LETTERS[i], ".csv"))}
You should consider, however, what happens if you have more than 26 subsets. What will those files be named?

R Error: readColumns ...: attempt to apply non-function in xlsx package

I wanted to read all the files and load multiple sheets from an excel file to R, use the xlsx package.
I pasted the code below:
filelist <- list.files(pattern = "\\.xls") # list all the xlsx files
library(xlsx)
allxlsx.files <- list() # create a list to populate with xlsx data
for (file in filelist) {
wb <- loadWorkbook(file)
sheets <- getSheets(wb)
sheet <- sheets[['_']] # get sheets with field section reading
res <- readColumns(sheet, 1, 2, 114, 120, colClasses=c("character", "numeric"))
}
traceback()
1: readColumns(sheet, 1, 2, 114, 120, colClasses = c("character",
"numeric")) at #6
Can someone enlighten me how to proceed?
I think you are subsetting the sheets incorrectly.
You can use grep on names of sheets to get all the file names with "_".
I have created and used a single xlsx file with hypothetical data having 5 sheets with names as below for demonstration.
> names(sheets)
[1] "Sheet_1" "Sheet2" "Sheet_3" "Sheet4" "sheet_4"
Getting the required sheets can be done using
sheet = sheets[grep("_",names(sheets))]
You can check it by using
> names(sheet)
[1] "Sheet_1" "Sheet_3" "sheet_4"
So your final code will look like following
filelist <- "sheeetLoadTrial1.xlsx" # single xlsx files
library(xlsx)
allxlsx.sheets <- list() # create a list to populate with xlsx sheet data
for (file in filelist) {
wb <- loadWorkbook(file)
sheets <- getSheets(wb)
sheet = sheets[grep("_",names(sheets))]
for(i in c(1:length(sheet))){
res <- readColumns(sheet[[i]], 1, 2,1,8,header = F)
allxlsx.sheets[[i]] = res
}
names(allxlsx.sheets) <- names(sheet)
}
after this your final required list will be
> allxlsx.sheets
$Sheet_1
X1 X2
1 1 2
2 2 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
$Sheet_3
X1 X2
1 1 2
2 2 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
$sheet_4
X1 X2
1 1 2
2 2 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
For more than one file you can just append the allxlsx.sheets to allxlsx.files list.

Read csv file in R

I am trying to read a .csv file in R.
My file looks like this-
A,B,C,D,E
1,2,3,4,5
6,7,8,9,10
.
.
.
number of rows.
All are strings. First line is the header.
I am trying to read the file using-
mydata=read.csv("devices.csv",sep=",",header = TRUE)
But mydata is assigned X observations of 1 variable. Where X is number of rows. The whole row becomes a single column.
But I want every header field in different column. I am not able to understand the problem.
If there are quotes ("), by using the code in the OP's post
str(read.csv("devices.csv",sep=",",header = TRUE))
#'data.frame': 2 obs. of 1 variable:
#$ A.B.C.D.E: Factor w/ 2 levels "1,2,3,4,5","6,7,8,9,10": 1 2
We could remove the " with gsub after reading the data with readLines and then use read.table
read.csv(text=gsub('"', '', readLines('devices.csv')), sep=",", header=TRUE)
# A B C D E
#1 1 2 3 4 5
#2 6 7 8 9 10
Another option if we are using linux would be to remove quotes with awk and pipe with read.csv
read.csv(pipe("awk 'gsub(/\"/,\"\",$1)' devices.csv"))
# A B C D E
#1 1 2 3 4 5
#2 6 7 8 9 10
Or
library(data.table)
fread("awk 'gsub(/\"/,\"\",$1)' devices.csv")
# A B C D E
#1: 1 2 3 4 5
#2: 6 7 8 9 10
data
v1 <- c("A,B,C,D,E", "1,2,3,4,5", "6,7,8,9,10")
write.table(v1, file='devices.csv', row.names=FALSE, col.names=FALSE)
The code which you've written should work unless your csv file is corrupted.
Check giving absolute path of devices.csv
To test: data[1] will give you column 1 results
Or, You can try it this way too
data = read.table(text=gsub('"', '', readLines('//fullpath to devices.csv//')), sep=",", header=TRUE)
Good Luck!

Using loop variables

I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7

Resources