Is there a way to use read.table() to read all or part of the file in, use the class function to get the column types, modify the column types, and then re-read the file?
Basically I have columns which are zero padded integers that I like to treat as strings. If I let read.table() just do its thing it of course assumes these are numbers and strips off the leading zeros and makes the column type integer. Thing is I have a fair number of columns so while I can create a character vector specifying each one I only want to change a couple from what R's best guess is. What I'd like to do is read the first few lines:
myTable <- read.table("//myFile.txt", sep="\t", quote="\"", header=TRUE, stringsAsFactors=FALSE, nrows = 5)
Then get the column classes:
colTypes <- sapply(myTable, class)
Change a couple of column types i.e.:
colTypes[1] <- "character"
And then re-read the file in using the modified column types:
myTable <- read.table("//myFile.txt", sep="\t", quote="\"", colClasses=colTypes, header=TRUE, stringsAsFactors=FALSE, nrows = 5)
While this seems like an infinitely reasonable thing to do, and colTypes = c("character") works fine, when I actually try it I get a:
scan() expected 'an integer', got '"000001"'
class(colTypes) and class(c("character")) both return "character" so what's the problem?
You use read.tables colClasses = argument to specify the columns you want classified as characters. For example:
txt <-
"var1, var2, var3
0001, 0002, 1
0003, 0004, 2"
df <-
read.table(
text = txt,
sep = ",",
header = TRUE,
colClasses = "character") ## read all as characters
df
df2 <-
read.table(
text = txt,
sep = ",",
header = TRUE,
colClasses = c("character", "character", "double")) ## the third column is numeric
df2
[updated...] or, you could set and re-set colClasses with a vector...
df <-
read.table(
text = txt,
sep = ",",
header = TRUE)
df
## they're all currently read as integer
myColClasses <-
sapply(df, class)
## create a vector of column names for zero padded variable
zero_padded <-
c("var1", "var2")
## if a name is in zero_padded, return "character", else leave it be
myColClasses <-
ifelse(names(myColClasses) %in% zero_padded,
"character",
myColClasses)
## read in with colClasses set to myColClasses
df2 <-
read.table(
text = txt,
sep = ",",
colClasses = myColClasses,
header = TRUE)
df2
Related
I am trying to download some CSV's from some links. Most of the CSV's are separated by ; however, one or two are separated by ,. Running the following code:
foo <- function(csvURL){
downloadedCSV = read.csv(csvURL, stringsAsFactors = FALSE, fileEncoding = "latin1", sep = ";")
return(downloadedCSV)
}
dat <- purrr::map(links, foo)
Gives me a list of 3 data.frame's. Two of them have 2 columns (correctly read in by the ; separator) and one of them has 1 column (incorrectly read in by the ; separator) because this file uses the , separator.
How can I incorporate into the function something like if the number of columns == 1 re-read the data but this time using , instead of ;? I tried passing sep = ";|," to the read.csv function but had no luck.
Links data:
links <- c("https://dadesobertes.gva.es/dataset/686fc564-7f2a-4f22-ab4e-0fa104453d47/resource/bebd28d6-0de6-4536-b522-d013301ffd9d/download/covid-19-total-acumulado-de-casos-confirmados-pcr-altas-epidemiologicas-personas-fallecidas-y-da.csv",
"https://dadesobertes.gva.es/dataset/686fc564-7f2a-4f22-ab4e-0fa104453d47/resource/b4b4d90b-08cf-49e4-bef1-5608311ce78a/download/covid-19-total-acumulado-de-casos-confirmados-pcr-altas-epidemiologicas-personas-fallecidas-y-da.csv",
"https://dadesobertes.gva.es/dataset/686fc564-7f2a-4f22-ab4e-0fa104453d47/resource/62990e05-9530-4f2f-ac41-3fad722b8515/download/covid-19-total-acumulado-de-casos-confirmados-pcr-altas-epidemiologicas-personas-fallecidas-y-da.csv"
)
We can also specify the sep as an argument
foo <- function(csvURL, sep){
downloadedCSV = read.csv(csvURL, stringsAsFactors = FALSE,
fileEncoding = "latin1", sep = sep)
return(downloadedCSV)
}
lstdat <- map2(links, c(";", ",", ";"), ~ foo(.x, sep=.y))
Or use fread from data.table, which can pick up the delimiter automatically
foo <- function(csvURL){
downloadedCSV = data.table::fread(csvURL, encoding = "Latin-1")
return(downloadedCSV)
}
dat <- purrr::map(links, foo)
I am trying to import a few csv files from a specific folder:
setwd("C://Users//XYZ//Test")
filelist = list.files(pattern = ".*.csv")
datalist = lapply(filelist, FUN=read.delim, sep = ',', header=TRUE,
stringsAsFactors = F)
for (i in 1:length(datalist)){
datalist[[i]]<-cbind(datalist[[i]],filelist[i])
}
Data = do.call("rbind", datalist)
After I use the above code, a few columns are type character, despite containing numbers. If I don't use stringsAsFactor = F then the fields read as factor which turns into missing values when I use as.numeric(as.character()) later on.
Is there any solution so that I can keep some fields as numeric? The fields that I want to be as numeric look like this:
Price.Plan Feature.Charges
$180.00 $6,307.56
$180.00 $5,431.25
Thanks
The $, , are not considered numeric, so while using stringsAsFactors = FALSE in the read.delim, it assigns the column type as character. To change that, remove the $, , with gsub, convert to numeric and assign it to the particular columns
df <- lapply(df, function(x) as.numeric(gsub("[$,]", "", x)))
I have a file in which every row is a string of numbers. Example of a row: 0234
Example of this file:
00020
04921
04622
...
When i use read.table it delete all the first 0 of each row (00020 becomes 20, 04921 -> 4921,...). I use:
example <- read.table(fileName, sep="\t",check.names=FALSE)
After this, for obtain a vector i use as.vector(unlist(example)).
I try different options of read.table but the problem remains
The read.table by default checks the column values and change the column type accordingly. If we want a custom type, specify it with colClasses
example <- read.table(fileName, sep="\t",check.names=FALSE,
colClasses = "character", stringsAsFactors = FALSE)
When we are not specifying the colClasses, the function use type.convert to automatically assign the column types based on the value
read.table # function
...
...
data[[i]] <- if (is.na(colClasses[i]))
type.convert(data[[i]], as.is = as.is[i], dec = dec,
numerals = numerals, na.strings = character(0L))
...
...
If I understand the issue correctly, you read in your data file with read.table but since you want a vector, not a data frame, you then unlist the df. And you want to keep the leading zeros.
There is a simpler way of doing the same, use scan.
example <- scan(file = fileName, what = character(), sep = "\t")
I realize this is a total newbie one (as always in my case), but I'm trying to learn R, and I need to import hundreds of csv files, that have the same structure, but in some the column names are uppercase, and in some they are lower case.
so I have (for now)
flow0300csv <- Sys.glob("./csvfiles/*0300*.csv")
for (fileName in flow0300csv) {
flow0300 <- read.csv(fileName, header=T,sep=";",
colClasses = "character")[,c('CODE','CLASS','NAME')]
}
but I get an error because of the lower cases. I have tried to apply "tolower" but I can't make it work. Any tips?
The problem here isn't in reading the CSV files, it's in trying to index using column names that don't actually exist in your "lowercase" data frames.
You can instead use grep() with ignore.case = TRUE to index to the columns you want.
tmp <- read.csv(fileName, header = T, sep = ";",
colClasses = "character")
ind <- grep(patt = "code|class|name", x = colnames(tmp),
ignore.case = TRUE)
tmp[, ind]
You may want to look into readr::read_csv2() or even data.table::fread() for better performance.
After reading the .csv-file you may want to convert the column names to all uppercase with
flow0300 <- read.csv(fileName, header = T, sep = ";", colClasses = "character")
colnames(flow0300) <- toupper(colnames(flow0300))
flow0300 <- flow0300[, c("CODE", "CLASS", "NAME")]
EDIT: Extended solution with the input of #xraynaud.
I'm trying to read a .csv file into R where all the column are numeric. However, they get converted to factor everytime I import them.
Here's a sample of how my CSV looks like:
This is my code:
options(StringsAsFactors=F)
data<-read.csv("in.csv", dec = ",", sep = ";")
As you can see, I set dec to , and sep to ;. Still, all the vectors that should be numerics are factors!
Can someone give me some advice? Thanks!
Your NA strings in the csv file, N/A, are interpreted as character and then the whole column is converted to character. If you have stringsAsFactors = TRUE in options or in read.csv (default), the column is further converted to factor. You can use the argument na.strings to tell read.csv which strings should be interpreted as NA.
A small example:
df <- read.csv(text = "x;y
N/A;2,2
3,3;4,4", dec = ",", sep = ";")
str(df)
df <- read.csv(text = "x;y
N/A;2,2
3,3;4,4", dec = ",", sep = ";", na.strings = "N/A")
str(df)
Update following comment
Although not apparent from the sample data provided, there is also a problem with instances of '$' concatenated to the numbers, e.g. '$3,3'. Such values will be interpreted as character, and then the dec = "," doesn't help us. We need to replace both the '$' and the ',' before the variable is converted to numeric.
df <- read.csv(text = "x;y;z
N/A;1,1;2,2$
$3,3;5,5;4,4", dec = ",", sep = ";", na.strings = "N/A")
df
str(df)
df[] <- lapply(df, function(x){
x2 <- gsub(pattern = "$", replacement = "", x = x, fixed = TRUE)
x3 <- gsub(pattern = ",", replacement = ".", x = x2, fixed = TRUE)
as.numeric(x3)
}
)
df
str(df)
You could have gotten your original code to work actually - there's a tiny typo ('stringsAsFactors', not 'StringsAsFactors'). The options command wont complain with the wrong text, but it just wont work. When done correctly, it'll read it as char, instead of factors. You can then convert columns to whatever format you want.
I just had this same issue, and tried all the fixes on this and other duplicate posts. None really worked all that well. The way I went about fixing it was actually on the excel side. If you highlight all the columns in your source file (in excel), right click==> format cells then select 'number' it'll import perfectly fine (so long as you have no non-numeric characters below the header)