I have a number of .txt files, with the data comma separated. There are no headers. Each contains the same information, but by different years: the name, the gender and the number of names.
I can read them all in in one rbind okay, but I lose the year information - the year is contained only in the file name... y1920.txt, y1995.txt, y2002.txt and so on.
I am very new to R.
To rbind them, I used do.call(file, rbind), where file is the list of data.frames.
Plyr has a nice workflow for this, assuming your files are all in the current working directory:
library(plyr)
years <- ldply(list.files(pattern="y\\d{4}\\.txt"),
function(file){
data <- read.csv(file, header=F);
data$date <- gsub("y","",gsub("\\.txt","", file));
data})
If you want to specify your files instead, e.g. files <- c("y1995.txt", "y1996.txt"), you can replace the first argument to ldply (list.files(...)) with files instead.
Related
I am trying to write a simple for loop to read in a number of .csv files. I have looked at list.files(pattern="data.*csv") which I do not think solves my problem.
I have a link to the data which looks like the following;
read.csv("C:/Users/user/Desktop/data/Year1/beer/beer.csv")
I have many years worth of data, I am trying to write something which changes Year1 to (for i in 1:15)...
Secondly I have many products and for now I am only interested in importing data for all years for the beer product, so I am trying to create a separate vector of products i.e. products <- c("beer", "bread", "milk"), which I can load in at a later stage.
The format of the product folders are all the same so, milk may be C:/Users/user/Desktop/data/Year1/milk/milk.csv. Also the file names are the same across all years, so milk.csv in year 1 is also called milk.csv in year 7 for example.
I can paste what I have currently
Leveraging the data that you previously posted in Loading Multiple Files into R at the same time with similar file names, here is one way to subset the result of list.files() or dir() for specific products.
We will subset the list to those containing products beer or milk.
aFileList <- c("Year1/beer/beer.csv",
"Year1/blades/blades.csv",
"Year1/carbbev/carbbev.csv",
"Year1/cigets/cigets.csv",
"Year1/mayo/mayo.csv",
"Year1/milk/milk.csv",
"Year1/mustketc/mustketc.csv",
"Year2/beer/beer.csv",
"Year2/blades/blades.csv",
"Year2/carbbev/carbbev.csv",
"Year2/cigets/cigets.csv",
"Year2/mayo/mayo.csv",
"Year2/milk/milk.csv",
"Year2/mustketc/mustketc.csv")
aFileList[grep("beer|milk",aFileList)]
The grep() function returns a vector of index numbers for elements of the input vector that contain the tokens requested in the regular expression that is the first argument to grep(). This is used to subset the original vector of file names.
...and the output:
> aFileList[grep("beer|milk",aFileList)]
[1] "Year1/beer/beer.csv" "Year1/milk/milk.csv" "Year2/beer/beer.csv" "Year2/milk/milk.csv"
>
If you use this technique, then you can use lapply() to read the files, per my answer to Loading Multiple Files into R at the same time with similar file names, eliminating the need for a for() loop.
There are so many ways to merge multiple CSV files in a folder into one. Here are a few thoughts...
setwd("C:/your_path_here/CSV Files/")
fnames <- list.files()
csv <- lapply(fnames, read.csv)
result <- do.call(rbind, csv)
filedir <- setwd("C:/your_path_here")
file_names <- dir(filedir)
your_data_frame <- do.call(rbind,lapply(file_names,read.csv))
filedir <- setwd("C:/your_path_here")
file_names <- dir(filedir)
your_data_frame <- do.call(rbind, lapply(file_names, read.csv, skip = 1, header = FALSE))
To get every file from multiple folders, into one single folder, try the following....
xcopy *.ext destination /s
where ext identifies the type of file you want to copy, and destination where you want it copied to. For instance, to copy all of your *.docx files to D:\alldocx, type xcopy *.docx d:\alldocx /s.
I have 500 csv. files with data that looks like:
sample data
I want to extract one cell (e.g. B4 or 0.477) per a csv file and combine those values into a single csv. What are some recommendations on how to do this easily?
You can try something like this
all.fi <- list.files("/path/to/csvfiles", pattern=".csv", full.names=TRUE) # store names of csv files in path as a string vector
library(readr) # package for read_lines and write_lines
ans <- sapply(all.fi, function(i) { eachline <- read_lines(i, n=4) # read only the 4th line of the file
ans <- unlist(strsplit(eachline, ","))[2] # split the string on commas, then extract the 2nd element of the resulting vector
return(ans) })
write_lines(ans, "/path/to/output.csv")
I can not add a comment. So, I will write my comment here.
Since your data is very large and it is very difficult to load it individually, then try this: Importing multiple .csv files into R. It is similar to the first part of your problem. For second part, try this:
You can save your data as a data.frame (as with the comment of #Bruno Zamengo) and then you can use select and merge functions in R. Then, you can easily combine them in single csv file. With select and merge functions you can select all the values you need and them combine them. I used this idea in my project. Do not forget to use lapply.
I am trying to make a function that outputs a dataframe from 8 different CSV files. They all have the same variables and same sort of data. The only difference in them is the year. I have tried to write out the function, but I can't seem to make it work. I am thinking a lapply woulf work, but I am not sure how to incorporate it.
These are the instructions:
Write a function named 'air' that takes a 'year' argument and returns a data.frame containing that data for that year, suppressing the automatic conversion to factors.
path <- "C:/Users/Lacy Macc/Downloads/adviz/"
files <- list.files(path=path, pattern="*.csv")
for(y in files)
air <- function(year){
if (!exists(""))
}
}
If the filenames of each file varied, you might need to use list.files and search through the filenames to identify one matching the year. But with a fixed filename scheme, all you need to do is insert the year at the appropriate point in the filename:
path <- "C:/Users/Lacy Macc/Downloads/adviz/"
year <- 2013
file_path <- paste0(path, "ad_viz_plotval_data-", year, ".csv")
I have left out the full details of how to convert this into a function that takes in the year as I suspect this might be a homework Q.
I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)
I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.