Reading specific column of multiple files in R - r

I have used the following code to read multiple .csv files in R:
Assembly<-t(read.table("E:\\test\\exp1.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Assembly","f"))[1:4416,"Assembly",drop=FALSE])
Top1<-t(read.table("E:\\test\\exp2.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top1","f"))[1:4416,"Top1",drop=FALSE])
Top3<-t(read.table("E:\\test\\exp3.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top3","f"))[1:4416,"Top3",drop=FALSE])
Top11<-t(read.table("E:\\test\\exp4.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top11","f"))[1:4416,"Top11",drop=FALSE])
Assembly1<-t(read.table("E:\\test\\exp5.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Assembly1","f"))[1:4416,"Assembly1",drop=FALSE])
Area<-t(read.table("E:\\test\\exp6.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Area","f"))[1:4416,"Area",drop=FALSE])
data<-rbind(Assembly,Top1,Top3,Top11,Assembly1,Area)
So the entire data is in the folder "test" in E drive. Is there a simpler way in R to read multiple .csv data with a couple of lines of code or some sort of function call to substitute what has been made above?

(Untested code; no working example available) Try: Use the list.files function to generate the correct names and then use colClasses as argument to read.csv to throw away the first 4 columns (and since that vector is recycled you will alss throw away the 6th column):
lapply(list.files("E:\\test\\", patt="^exp[1-6]"), read.csv,
colClasses=c(rep("NULL", 4), "numeric"), nrows= 4416)
If you want this to be returned as a dataframe, then wrap data.frame around it.

Related

Picking last row data only from 2000 csv in the same directory and make single dataframe through R

Using R, I want to pick last row data only from over 2000 csv in the same directory
and make single dataframe.
Directory = "C:\data
File name, for example '123456_p' (6 number digit)
Each csv has different number of rows, but has the same number of columns (10 columns)
I know the tail and list function, but over 2000 dataframes, inputting manually is time wasting.
Is there any way to do this with loop through R?
As always, I really appreciate your help and support
There are four things you need to do here:
Get all the filenames we want to read in
Read each in and get the last row
Loop through them
Bind them all together
There are many options for each of these steps, but let's use purrr for the looping and binding, and base-R for the rest.
Get all the filenames we want to read in
You can do this with the list.files() function.
filelist = list.files(pattern = '.csv')
will generate a vector of filenames for all CSV files in the working directory. Edit as appropriate to specify the pattern further or target a different directory.
Read each in and get the last row
The read.csv() function can read in each file (if you want it to go faster, use data.table::fread() instead), and as you mentioned tail() can get the last row. If you build a function out of this it will be easier to loop over, or change the process if it turns out you need another step of cleaning.
read_one_file = function(x) {
tail(read.csv(x), 1)
}
Loop through them
Bind them all together
You can do both of these steps at once with map_df() in the purrr package.
library(purrr)
final_data = map_df(filelist, read_one_file)

Loading multiple csv files at once and keep the column names unchanged

I have a folder containing different csv files. Below is the picture showing the csv files. I would like to import all of them at once and name them in one go. Also, I would like to keep the column names unchanged.
Here is what I tried:
#Loading the data
filenames <- list.files(path="C:/Users/Juste/Desktop/Customs Data",
pattern="Imports 201+.*csv")
filelist <- lapply(filenames, read.csv)
#assigning names to data.frames
names(filelist) <- paste0("Imports_201",2:length(filelist))
#note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))
When I tried this, it only imports the first five csv files, it leaves out “Imports 2017_anonymised”. Also the column names change the format. For example, column “Best country” becomes “Best.country”. How can I import all of the csv files and keep the column names unchanged?
You could try map() from the purrr package and read_csv() from the readr package (note that it is written with an underscore). This way your column names don't get changed.
library(purrr)
library(readr)
map(filenames, read_csv)
or if you automatically want to concatenate the dataframes use
map_df(filenames, read_csv)
Sorry, I can't add comments because I don't currently have enough reputation on here to do so. However, I think your regex might be a little off for the import. Try pattern = "^Imports\\s+201\\d_anonymised\\.csv$".
Regarding the "."s in the column names, I believe that by default, R's core data import commands add these where there are spaces. Otherwise you'll need to use backticks each time you want to refer to a column with space in its name. You could try setting check.names = F in you read.csv() function, as this is what calls make.names() to sanitize the column names upon data import. type ?make.names to see what it's doing.

R / openxlsx / Finding the first non-empty cell in Excel file

I'm trying to write data to an existing Excel file from R, while preserving the formatting. I'm able to do so following the answer to this question (Write from R into template in excel while preserving formatting), except that my file includes empty columns at the beginning, and so I cannot just begin to write data at cell A1.
As a solution I was hoping to be able to find the first non-empty cell, then start writing from there. If I run read.xlsx(file="myfile.xlsx") using the openxlsx package, the empty columns and rows are automatically removed, and only the data is left, so this doesn't work for me.
So I thought I would first load the worksheet using wb <- loadWorkbook("file.xlsx") so I have access to getStyles(wb) (which works). However, the subsequent command getTables returns character(0), and wb$tables returns NULL. I can't figure out why this is? Am I right in that these variables would tell me the first non-empty cell?
I've tried manually removing the empty columns and rows preceding the data, straight in the Excel file, but that doesn't change things. Am I on the right path here or is there a different solution?
As suggested by Stéphane Laurent, the package tidyxl offers the perfect solution here.
For instance, I can now search the Excel file for a character value, like my variable names of interest ("Item", "Score", and "Mean", which correspond to the names() of the data.frame I want to write to my Excel file):
require(tidyxl)
colnames <- c("Item","Score","Mean")
excelfile <- "FormattedSheet.xlsx"
x <- xlsx_cells(excelfile)
# Find all cells with character values: return their address (i.e., Cell) and character (i.e., Value)
chars <- x[x$data_type == "character", c("address", "character")]
starting.positions <- unlist(
chars[which(chars$character %in% colnames), "address"]
)
# returns: c(C6, D6, E6)

How to convert list of strings to list of objects or list of dataframes in R

I have written a program in R that takes all of the .csv files in a folder and imports them as data frames with the naming convention "main1," "main2," "main3" and so on for each data frame. The number of files in the folder may vary, so I was hoping the convention would make it easier to join the files later by being able to paste together the number of records. I successfully coded a way to find the folder and identify all of the files, as well as the total number of files.
agencyloc <- dirname(file.choose())
setwd(agencyloc)
listagencyfiles <- list.files(pattern = "*.csv")
numagencies <- 1:length(listagencyfiles)
I then created the individual dataframes without issue. I am not including this because it is long and does not relate to my problem. The problem is when I try to rbind these dataframes into one large dataframe, it says "Input to rbindlist must be a list of data.tables." Since there will be varying numbers of files, I can't just hard code this in, it has to be something similar to this. I tried the following, but it creates a list of strings and not a list of objects:
allfiles <- paste0("main", 1:length(numagencies))
However, this outputs a list of strings that can't be used to bind the fiels. Is there a way to change the data type from character strings to objects so that this will work when executed:
finaltable <- rbindlist(allfiles)
What I am looking for would almost be the opposite of as.character(objectname) if that makes any sense. I need to go from character to object instead of object to character.

Importing a .csv with headers that show up as first row of data

I'm importing a csv file into R. I read a post here that said in order to get R to treat the first row of data as headers I needed to include the call header=TRUE.
I'm using the import function for RStudio and there is a Code Preview section in the bottom right. The default is:
library(readr)
existing_data <- read_csv("C:/Users/rruch/OneDrive/existing_data.csv")
View(existing_data)
I've tried placing header=TRUE in the following places:
read_csv(header=TRUE, "C:/Users...)
existing_data.csv", header=TRUE
after 2/existing_data.csv")
Would anyone be able to point me in the right direction?
You should use col_names instead of header. Try this:
library(readr)
existing_data <- read_csv("C:/Users/rruch/OneDrive/existing_data.csv", col_names = TRUE)
There are two different functions to read csv files (actually far more than two): read.csv from utils package and read_csv from readr package. The first one gets header argument and the second one col_names.
You could also try fread function from data.table package. It may be the fastest of all.
Good luck!
It looks like there is one variable name that is correctly identified as a variable name (notice your first column). I would guess that your first row only contains the variable "Existing Product List", and that your other variable names are actually contained in the second row. Open the file in Excel or LibreOffice Calc to confirm.
If it is indeed the case that all of the variable names you've listed (including "Existing Product List") are in the first row, then you're in the same boat as me. In my case, the first row contains all of my variables, however they appear as both variable names and the first row of observations. Turns out the encoding is messed up (which could also be your problem), so my solution was simply to remove the first row.
library(readr)
mydat = read_csv("my-file-path-&-name.csv")
mydat = mydat[-1, ]

Resources