I have a problem with one task where I have to load some data set, and I have to make sure that missing values are read in properly and that column names are unambiguous.
The format of .txt file:
At the end, data set should contain only country column and median age.
I tried using read.delim, precisely this chunk:
rawdata <- read.delim("rawdata_343.txt", sep = "", stringsAsFactors = FALSE, header = TRUE)
And when I run it, I get this:
It confuses me that if country has multiple words (Turks and Caicos Islands) it assigns every word to another column.
Since I am still a beginner in R, any suggestion would be very helpful for me. Thanks!
Three points to note about your input file: (1) the first two lines at the top are not tabular and should be skipped with skip = 2, (2) your column separators are tabs and this should be specified with sep = "\t", and (c) you have no headers, so header = FALSE. Your command should be: -
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2)
UPDATE: A fourth point is that the first column includes row numbers, so row.names = 1. This also addresses the follow-up comment.
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2, row.names = 1)
It looks like your delimiter that you are specifying in the sep= argument is telling R to consider spaces as the column delimiter. Looking at your data as a .txt file, there is no apparent delimiter (like commas that you would find in a typical .csv). If you can put the data in a tabular form in something like a .csv or .xlsx file, R is much better at reading that data as expected. As it is, you may struggle to get the .txt format to read in a tabular fashion, which is what I assume you want.
P.s. you can use read.csv() if you do end up putting the data in that format.
Related
I am trying to import multiple CSV files in a for loop. Iteratively trying to solve the errors the code produced I go to the below to do this.
for (E in EDCODES) {
Filename <- paste("$. Data/2. Liabilities/",
E,
sep="")
Framename <- gsub("\\..*",
"",
E)
assign(Framename,
read.csv(Filename,
header = TRUE,
sep = ",",
stringsAsFactors = FALSE,
na.strings = c("\"ND",
"ND,5",
"5\""),
colClasses = c("BAA35" = "double"),
encoding = "UTF-8",
quote = ""))}
First I realized that the code does not always recognize the most important column "BAA35" as numeric, so I added the colClasses argument. Then I realized that the data has multiple versions of "NA", so I added the na.strings argument. The most common NA value is "ND, 5", which contains the separator ",". So if I add the na.strings argument as defined above I get a lot of EOF within quoted string warnings. The others are also versions of "ND, [NUMBER]" or "ND, 4, [YYYY-MM]".
If I then try to treat that issue with the most common recommendation I could find, adding quote = "" I just end up with a more columns than column names issue.
The data has 78 columns, so I don't believe posting it here will display in a usable way.
Can somebody recommend any solution for how I can reliable import this column as a numeric value and have R recognize NAs in the data correctly?
I think the issue might be that the na.strings contain commas and in some cases the ND,5 is read as one column with ND and one with a 5 and in other cases it's seen as the na.string. Any way to tell R to not split "ND,5" into two columns?
I'm new to R and I can't make this work with the information I'm finding.
I have many .txt files in a folder, each of them containing data from one subject. The files have identical columns, but the number of rows for each file varies. In addition, the column headers only start in row 9. What I want to do is
import the .txt files into RStudio in one go while skipping the first 8 rows, and
merging them all together into one data frame by their columns, so that the final data frame is a data set containing the data from all subjects in long format.
I managed to do 1 (I think) using the easycsv package and the following code:
fread_folder(directory = "C:/Users/path/to/my/files",
extension = "TXT",
sep = "auto",
nrows = -1L,
header = "auto",
na.strings = "NA",
stringsAsFactors = FALSE,
verbose=getOption("datatable.verbose"),
skip = 8L,
drop = NULL,
colClasses = NULL,
integer64=getOption("datatable.integer64"),# default:"integer64"
dec = if (sep!=".") "." else ",",
check.names = FALSE,
encoding = "unknown",
quote = "\"",
strip.white = TRUE,
fill = FALSE,
blank.lines.skip = FALSE,
key = NULL,
Names=NULL,
prefix=NULL,
showProgress = interactive(),
data.table=FALSE
)
That worked, however now my problem is that the data frames have been named after the very long path to my files and obviously after the txt files (without the 7 though). So they are very long and unwieldy and contain characters that they probably shouldn't, such as spaces.
So now I'm having trouble merging the data frames into one, because I don't know how else to refer to the data frames other than the default names that have been given to them, or how to rename them, or how to specify how the data frames should be named when importing them in the first place.
The code below looks for what files are in your directory, uses those names to get the file as a variable, and then uses rbindlist to combined the tables into a single table. Hope that helps. It assumes each .csv or .txt file in the directory has been pulled into the current environment as a separate data.table.
for (x in (list.files(directory))) {
# Remove the .txt extension from the filename to get the table name
if (grepl(".txt",x)) {
x = gsub(".txt","",x)
}
thisTable <- get(x) # use "get" to pull in the string as a variable
# now just combined into a single dataframe
if (exists("combined")) {
combined = rbindlist(list(combined,thisTable))
} else {
combined <- thisTable
}
}
The following should work well. However, without sample data or a more clear description of what you want it's hard to know for certain if this if what you are looking to accomplish.
#set working directory
setwd("C:/Users/path/to/my/files")
#read in all .txt files but skip the first 8 rows
Data.in <- lapply(list.files(pattern = "\\.txt$"),read.csv,header=T,skip=8)
#combines all of the tables by column into one
Data.in <- do.call(rbind,Data.in)
One of the columns in my dataframe contains semicolon(;) and when I try to download the dataframe to a csv using fwrite, it is splitting that value into different columns.
Ex: Input : abcd;#6 After downloading it becomes : 1st column : abcd,
2nd column: #6
I want both to be in the same column.
Could you please suggest how to write the value within a single column.
I am using below code to read the input file:
InpData <- read.table(File01, header=TRUE, sep="~", stringsAsFactors = FALSE,
fill=TRUE, quote="", dec=",", skipNul=TRUE, comment.char="")
while for writing:
fwrite(InpData, File01, col.names=T, row.names=F, quote = F, sep="~")
You didn't give us an example, but it is possible you need to use a different separator than ";"
fwrite(x, file = "", sep = ",")
sep: The separator between columns. Default is ",".
If this simple solution does not work, we need the data to reproduce your problem.
I want to import a table (.txt file) in R with read.table().
table1<- read.table("input.txt",sep = "\t")
The file contains data like 0.09165395632583884
After reading the data, data becomes 0.09165396. Last few digits are lost,
but I want to avoid this problem.
If I used
options(digits=22)
then it creates another problem, like maindata = 0.19285969062961023 but when I write the data in file,
write.table(table1,file = "output.txt",col.names = F, row.names = F)
I get data = 0.192859690629610225. Here, last digit is extra and the second last digit is change.
Can someone give me a hint how to solve the problem?
I am wishing to import csv files into R, with the first non empty line supplying the name of data frame columns. I know that you can supply the skip = 0 argument to specify which line to read first. However, the row number of the first non empty line can change between files.
How do I work out how many lines are empty, and dynamically skip them for each file?
As pointed out in the comments, I need to clarify what "blank" means. My csv files look like:
,,,
w,x,y,z
a,b,5,c
a,b,5,c
a,b,5,c
a,b,4,c
a,b,4,c
a,b,4,c
which means there are rows of commas at the start.
read.csv automatically skips blank lines (unless you set blank.lines.skip=FALSE). See ?read.csv
After writing the above, the poster explained that blank lines are not actually blank but have commas in them but nothing between the commas. In that case use fread from the data.table package which will handle that. The skip= argument can be set to any character string found in the header:
library(data.table)
DT <- fread("myfile.csv", skip = "w") # assuming w is in the header
DF <- as.data.frame(DT)
The last line can be omitted if a data.table is ok as the returned value.
Depending on your file size, this may be not the best solution but will do the job.
Strategy here is, instead of reading file with delimiter, will read as lines,
and count the characters and store into temp.
Then, while loop will search for first non-zero character length in the list,
then will read the file, and store as data_filename.
flist = list.files()
for (onefile in flist) {
temp = nchar(readLines(onefile))
i = 1
while (temp[i] == 0) {
i = i + 1
}
temp = read.table(onefile, sep = ",", skip = (i-1))
assign(paste0(data, onefile), temp)
}
If file contains headers, you can start i from 2.
If the first couple of empty lines are truly empty, then read.csv should automatically skip to the first line. If they have commas but no values, then you can use:
df = read.csv(file = 'd.csv')
df = read.csv(file = 'd.csv',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))
It's not efficient if you have large files (since you have to import twice), but it works.
If you want to import a tab-delimited file with the same problem (variable blank lines) then use:
df = read.table(file = 'd.txt',sep='\t')
df = read.table(file = 'd.txt',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))