I'm new to R and I can't make this work with the information I'm finding.
I have many .txt files in a folder, each of them containing data from one subject. The files have identical columns, but the number of rows for each file varies. In addition, the column headers only start in row 9. What I want to do is
import the .txt files into RStudio in one go while skipping the first 8 rows, and
merging them all together into one data frame by their columns, so that the final data frame is a data set containing the data from all subjects in long format.
I managed to do 1 (I think) using the easycsv package and the following code:
fread_folder(directory = "C:/Users/path/to/my/files",
extension = "TXT",
sep = "auto",
nrows = -1L,
header = "auto",
na.strings = "NA",
stringsAsFactors = FALSE,
verbose=getOption("datatable.verbose"),
skip = 8L,
drop = NULL,
colClasses = NULL,
integer64=getOption("datatable.integer64"),# default:"integer64"
dec = if (sep!=".") "." else ",",
check.names = FALSE,
encoding = "unknown",
quote = "\"",
strip.white = TRUE,
fill = FALSE,
blank.lines.skip = FALSE,
key = NULL,
Names=NULL,
prefix=NULL,
showProgress = interactive(),
data.table=FALSE
)
That worked, however now my problem is that the data frames have been named after the very long path to my files and obviously after the txt files (without the 7 though). So they are very long and unwieldy and contain characters that they probably shouldn't, such as spaces.
So now I'm having trouble merging the data frames into one, because I don't know how else to refer to the data frames other than the default names that have been given to them, or how to rename them, or how to specify how the data frames should be named when importing them in the first place.
The code below looks for what files are in your directory, uses those names to get the file as a variable, and then uses rbindlist to combined the tables into a single table. Hope that helps. It assumes each .csv or .txt file in the directory has been pulled into the current environment as a separate data.table.
for (x in (list.files(directory))) {
# Remove the .txt extension from the filename to get the table name
if (grepl(".txt",x)) {
x = gsub(".txt","",x)
}
thisTable <- get(x) # use "get" to pull in the string as a variable
# now just combined into a single dataframe
if (exists("combined")) {
combined = rbindlist(list(combined,thisTable))
} else {
combined <- thisTable
}
}
The following should work well. However, without sample data or a more clear description of what you want it's hard to know for certain if this if what you are looking to accomplish.
#set working directory
setwd("C:/Users/path/to/my/files")
#read in all .txt files but skip the first 8 rows
Data.in <- lapply(list.files(pattern = "\\.txt$"),read.csv,header=T,skip=8)
#combines all of the tables by column into one
Data.in <- do.call(rbind,Data.in)
Related
I have a problem with one task where I have to load some data set, and I have to make sure that missing values are read in properly and that column names are unambiguous.
The format of .txt file:
At the end, data set should contain only country column and median age.
I tried using read.delim, precisely this chunk:
rawdata <- read.delim("rawdata_343.txt", sep = "", stringsAsFactors = FALSE, header = TRUE)
And when I run it, I get this:
It confuses me that if country has multiple words (Turks and Caicos Islands) it assigns every word to another column.
Since I am still a beginner in R, any suggestion would be very helpful for me. Thanks!
Three points to note about your input file: (1) the first two lines at the top are not tabular and should be skipped with skip = 2, (2) your column separators are tabs and this should be specified with sep = "\t", and (c) you have no headers, so header = FALSE. Your command should be: -
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2)
UPDATE: A fourth point is that the first column includes row numbers, so row.names = 1. This also addresses the follow-up comment.
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2, row.names = 1)
It looks like your delimiter that you are specifying in the sep= argument is telling R to consider spaces as the column delimiter. Looking at your data as a .txt file, there is no apparent delimiter (like commas that you would find in a typical .csv). If you can put the data in a tabular form in something like a .csv or .xlsx file, R is much better at reading that data as expected. As it is, you may struggle to get the .txt format to read in a tabular fashion, which is what I assume you want.
P.s. you can use read.csv() if you do end up putting the data in that format.
I have 15 data frames, which were automatically imported from csv files, each consists of 3 columns, that's why the column names are the same V1, V2 and V3. I would like to distinguish them when I'll get them joined together, so I ' asking for an idea how to rename each columns automatically like e.g. DataFrameName_V1 or something similar.
I have searched a lot for this code,but haven't find any solution yet.
Btw I have this code for the csv import:
file_list <- list.files(pattern="*.csv")
for (i in 1:length(file_list))
{
assign(file_list[i],
read.csv(paste(file_list[i]), na.strings=c("NA", "NULL", ""), header=FALSE, sep=",", stringsAsFactors = FALSE, encoding = "UTF-8"))
}
I want to construct a data frame by reading in a csv file for each day in the month. My daily csv files contain columns of characters, doubles, and integers of the same number of rows. I know the maximum number of rows for any given month and the number of columns remains the same for each csv file. I loop through each day of a month with fileListing, which contains the list of csv file names (say, for January):
output <- matrix(ncol=18, nrow=2976)
for ( i in 1 : length( fileListing ) ){
df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
# each df is a data frame with 96 rows and 18 columns
# now insert the data from the ith date for all its rows, appending as you go
for ( j in 1 : 18 ){
output[ , j ] = df[[ j ]]
}
}
Sorry for having revised my question as I figured out part of it (duh), but should I use rbind to progressively insert data at the bottom of the data frame, or is that slow?
Thank you.
BSL
You can read them into a list with lapply, then combine them all at once:
data <- lapply(fileListing, read.csv, header = FALSE, stringsAsFactors = FALSE, row.names = NULL)
df <- do.call(rbind.data.frame, data)
First define a master dataframe to hold all of the data. Then as each file read, append the data onto the master.
masterdf<-data.frame()
for ( i in 1 : length( fileListing ) ){
df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
# each df is a data frame with 96 rows and 18 columns
masterdf<-rbind(masterdf, df)
}
At the end of the loop, masterdf will contain all of the data. This code code can be improved but for the size of the dataset this should be quick enough.
If the data is fairly small relative to your available memory, just read the data in and don't worry about it. After you have read in all the data and done some cleaning, save the file using save() and have your analysis scripts read in that file using load(). Separating reading/cleaning scripts from analysis clips is a good way to reduce this problem.
A feature to speed up the reading of read.csv is to use the nrow and colClass arguments. Since you say that you know that number of rows in each file, telling R this will help speed up the reading. You can extract the column classes using
colClasses <- sapply(read.csv(file, nrow=100), class)
then give the result to the colClass argument.
If the data is getting close to being too large, you may consider processing individual files and saving intermediate versions. There are a number of related discussions to managing memory on the site that cover this topic.
On memory usage tricks:
Tricks to manage the available memory in an R session
On using the garbage collector function:
Forcing garbage collection to run in R with the gc() command
Hi so I have a data in the following format
101,20130826T155649
------------------------------------------------------------------------
3,1,round-0,10552,180,yellow
12002,1,round-1,19502,150,yellow
22452,1,round-2,28957,130,yellow,30457,160,brake,31457,170,red
38657,1,round-3,46662,160,yellow,47912,185,red
and I have been reading them and cleaning/formating them by this code
b <- read.table("sid-101-20130826T155649.csv", sep = ',', fill=TRUE, col.names=paste("V", 1:18,sep="") )
b$id<- b[1,1]
b<-b[-1,]
b<-b[-1,]
b$yellow<-B$V6
and so on
There are about 300 files like this, and ideally they will all compiled without the first two lines, since the first line is just id and I made a separate column to identity these data. Does anyone know how to read these table quickly and clean and format the way I want then compile them into a large file and export them?
You can use lapply to read all the files, clean and format them, and store the resulting data frames in a list. Then use do.call to combine all of the data frames into single large data frame.
# Get vector of files names to read
files.to.load = list.files(pattern="csv$")
# Read the files
df.list = lapply(files.to.load, function(file) {
df = read.table(file, sep = ',', fill=TRUE, col.names=paste("V", 1:18,sep=""))
... # Cleaning and formatting code goes here
df$file.name = file # In case you need to know which file each row came from
return(df)
})
# Combine into a single data frame
df.combined = do.call(rbind, df.list)
I would like to write a multiple dataframe "neighbours_dataframe" in a single CSV file :
I use this line to write the multiple dataframe to multiple file :
for(i in 1:vcount(karate)){
write.csv(neighbours_dataframe[[i]], file = as.character(V(karate3)$name[i]),row.names=FALSE)}
if I use this code:
for(i in 1:vcount(karate)){
write.csv(neighbours_dataframe[[i]], file = "karate3.csv",row.names=FALSE)}
this would give me just the last dataframe in the csv file :
I was wondering , How could I have a single CSV file which have all the dataframe in the way that the column header of the first dataframe just written to the csv file and all other data frame copied in a consecutive manner ?
thank you in advance
Two methods; the first is likely to be a little faster if neighbours_dataframe is a long list (though I haven't tested this).
Method 1: Convert the list of data frames to a single data frame first
As suggested by jbaums.
library(dplyr)
neighbours_dataframe_all <- rbind_all(neighbours_dataframe)
write.csv(neighbours_dataframe_all, "karate3.csv", row.names = FALSE)
Method 2: use a loop, appending
As suggested by Neal Fultz.
for(i in seq_along(neighbours_dataframe))
{
write.table(
neighbours_dataframe[[i]],
"karate3.csv",
append = i > 1,
sep = ",",
row.names = FALSE,
col.names = i == 1
)
}