I have 3 text files each of which has 14 similar columns. I want to first read these 3 files (data frames) and then combine them into one data frame. Following is what I have tried after finding some help in R mailing list:
file_name <- list.files(pattern='sEMA*') # CREATING A LIST OF FILE NAMES OF FILES HAVING 'sEMA' IN THEIR NAMES
NGSim <- lapply (file_name, read.csv, sep=' ', header=F, strip.white=T) # READING ALL THE TEXT FILES
This piece of code can read the files altogether but does not combine them into one data frame. I have tried data.frame(NGSim) but R gives an error: cannot allocate vector of size 4.2 Mb. How can I combine the files in one single data frame?
Like this:
do.call(rbind, NGSim)
library(plyr)
rbind.fill(NGSim)
or,
ldply(NGSim)
If file size is an issue that's the case you may want to the use data.table functions instead of less efficient base functions like read.csv().
library(data.table)
NGSim <- data.frame(rbindlist(lapply(list.files(pattern='sEMA*'),fread)))
Related
I have 500 csv. files with data that looks like:
sample data
I want to extract one cell (e.g. B4 or 0.477) per a csv file and combine those values into a single csv. What are some recommendations on how to do this easily?
You can try something like this
all.fi <- list.files("/path/to/csvfiles", pattern=".csv", full.names=TRUE) # store names of csv files in path as a string vector
library(readr) # package for read_lines and write_lines
ans <- sapply(all.fi, function(i) { eachline <- read_lines(i, n=4) # read only the 4th line of the file
ans <- unlist(strsplit(eachline, ","))[2] # split the string on commas, then extract the 2nd element of the resulting vector
return(ans) })
write_lines(ans, "/path/to/output.csv")
I can not add a comment. So, I will write my comment here.
Since your data is very large and it is very difficult to load it individually, then try this: Importing multiple .csv files into R. It is similar to the first part of your problem. For second part, try this:
You can save your data as a data.frame (as with the comment of #Bruno Zamengo) and then you can use select and merge functions in R. Then, you can easily combine them in single csv file. With select and merge functions you can select all the values you need and them combine them. I used this idea in my project. Do not forget to use lapply.
I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)
I am using sapply(tk_choose.files) to produce an interactive window where I can choose which .csv files (multiple) to import. I then do some basic data manipulation so that the mean of one particular column can be plotted using ggplot.
So far my code looks something like this:
>tfiles <- data.frame(sapply(sapply(tk_choose.files(caption="Choose T files
(hold CTRL to select multiple files)"), read.table, header=TRUE, sep=","), c))
>rfiles <- data.frame(sapply(sapply(tk_choose.files(caption="Choose R files
(hold CTRL to select multiple files)"), read.table, header=TRUE, sep=","), c))
I have then calculated the mean of a particular column for both tfiles and rfiles so that I could plot 100-tfiles-rfiles.
While this is working fine for one set of data, I would like to now import more sets of data, preferably also using sapply(tk_choose.files). Essentially I need to get t/rfiles1, t/rfiles2...and repeat the data manipulation process after that, so that I could get a plot of multiple sets of data. I have no idea how to do this without having to copy and paste my code!
Sorry if this is a stupid question, I am very new to R so I am really stuck, your help is greatly appreciated!
Assuming that the files in the working directory are as follow:
all.files<-list.files(pattern="\\.csv")
all.files
[1] "R01.csv" "R02.csv" "R03.csv" "R04.csv" "T01.csv" "T02.csv" "T03.csv" "T04.csv"
And you wish to call tfiles1 as merged data of T01 and T02; tfiles2 as merged data of T03 and T04
T <- grep("T", all.files, value=T)
T
[1] "T01.csv" "T02.csv" "T03.csv" "T04.csv"
t.list <- list(T[1:2], T[3:4])
all.T <- lapply(t.list, function(x)ldply(x, read.csv))
for (i in 1:length(all.T)) assign(paste0("tfiles", i), all.T[[i]]) #this will produce tfiles1 and tfiles2 in your R environment.
I have a number of .txt files, with the data comma separated. There are no headers. Each contains the same information, but by different years: the name, the gender and the number of names.
I can read them all in in one rbind okay, but I lose the year information - the year is contained only in the file name... y1920.txt, y1995.txt, y2002.txt and so on.
I am very new to R.
To rbind them, I used do.call(file, rbind), where file is the list of data.frames.
Plyr has a nice workflow for this, assuming your files are all in the current working directory:
library(plyr)
years <- ldply(list.files(pattern="y\\d{4}\\.txt"),
function(file){
data <- read.csv(file, header=F);
data$date <- gsub("y","",gsub("\\.txt","", file));
data})
If you want to specify your files instead, e.g. files <- c("y1995.txt", "y1996.txt"), you can replace the first argument to ldply (list.files(...)) with files instead.
I'm having trouble finding the documentation to answer my seemingly straightforward question.
For simplicity's sake, I have a list of 3 dataframes of differing numbers of rows.
mylist<-list()
mylist[[1]]<-c(1:10)
mylist[[2]]<-c(2:15)
mylist[[3]]<-c(20:54)
I'd like to write each element of the list to a separate sheet in an excel workbook, which I presumably can do with WriteXLS (?).
When I call
WriteXLS("mylist", ExcelFileName="mylist.xls")
Error in WriteXLS("mylist", ExcelFileName = "mylist.xls") :
One or more of the objects named in 'x' is not a data frame or does not exist
... does WriteXLS not support lists? If not, how do I get around this efficiently? I will be writing files as part of a large simulation.
I always create temporary data frames (rectangular arrays) ...
sheet0 <- data.frame(array.no.1) # This is usually set of descriptions of the sheets
sheet1 <- data.frame(array.no.2) # The data
sheet2 <- data.frame(array.no.3) # More information
myxls <- c(sheet0="Index",sheet1="Results",sheet2="Notes")
WriteXLS(names(myxls),ExcelFileName="my.xls",SheetNames=myxls)
The documentation says "data.frames" so that led me to this solution.