I am trying to import a SAS data set to R (I cannot share the data set). SAS sees columns as number or character. However, some of the number columns have coded character values. I've used the sas7bdat package to bring in the data set but those character values in number columns return NaN. I would like the actual character value. I have tried exporting the data set to csv and tab delimited files. However, I end up with observations that take 2 lines (a problem with SAS that I haven't been able to figure out). Since there are over 9000 observations I cannot go back and look for those observations that take 2 lines manually. Any ideas how I can fix this?
SAS does NOT store character values in numeric columns. But there are some ways that numeric values will be printed using characters.
First is if you are using BEST format (which is the defualt for numeric variables). If the value cannot be represented exactly in the number of characters then it will use scientific notation.
Second is special missing values. SAS has 28 missing values. Regular missing is represented by a period. The others by single letter or underscore.
Third would be a custom format that displays the numbers using letters.
The first should not cause any trouble when importing into R. The last two can be handled by Haven. See the semantics Vignette in the documentation.
As to your multiple line CSV file there are two possible issues. The first is just that you did not tell SAS to use long enough lines for your data. Just make sure to use a longer LRECL setting on the file you are writing to.
filename csv 'myfile.csv' lrecl=1000000 ;
proc export data=mydata file=csv dbms=csv ; run;
The second possible issue is that some of your character variables include end of line characters in them. It is best to just remove or replace those characters. You could always add them back if they are really wanted. For example these steps will export the same file as above. It will first replace the carriage returns and line feeds in the character variables with pipe characters instead.
data for_export ;
set mydata;
array _c _character_;
do over _c;
_c = translate(_c,'||','0A0D'x);
end;
run;
proc export data=for_export file=csv dbms=csv ; run;
partial answer for dealing with data across multiple rows
library( data.table )
#first, read the whole lines into a single colunm, for example with
DT <- data.table::fread( myfile, sep = "")
#sample data for this example: a data.table with ten rows containing the numbers 1 to 10
DT <- data.table( 1:10 )
#column-bind wo subsets of the data, using a logical vector to select the evenery first
#and every second row. then paste the colums together and collapse using a
#comma-separator (of whatever separator you like)
ans <- as.data.table(
cbind ( DT[ rep( c(TRUE, FALSE), length = .N), 1],
DT[ rep( c(FALSE, TRUE), length = .N), 1] )[, do.call( paste, c(.SD, sep = ","))] )
# V1
# 1: 1,2
# 2: 3,4
# 3: 5,6
# 4: 7,8
# 5: 9,10
I prefer read_sas function from 'haven' package for reading sas data
library(haven)
data <- read_sas("data.sas7bdat")
Related
I have several hundred text delimited files. In some columns, a newline before the end of the row appears in random columns. When is try to read, it looks for the correct number of columns, but because it is split on to the next row.
the arg fill=T does not help because it creates empty incorrect columns.
If I have:
"Aa|Bb|C\nc\ntwo|three|four"
But really should be two rows by three columns:
"Aa|Bb|Cc\ntwo|three|four"
How can I get there for all rows of the data (the error occurs randomly throughout)?
Note that you have C\nc in the string, which introduces c to a new line. I guess you need to ensure the format of your input string as the first step, otherwise it is difficult to fixed via post-processing.
I am not sure if the code below is what you are after. Do you mean something like using read.csv?
read.csv(text = sub("\n","",s),sep = "|",header = FALSE)
which gives
V1 V2 V3
1 Aa Bb Cc
2 two three four
If you are using data.table, you can try fread (thank #akrun)
fread(sub("\n", "", s))
Data
s <- "Aa|Bb|C\nc\ntwo|three|four"
Is it possible to pass column indices to read_csv?
I am passing many CSV files to read_csv with different header names so rather than specifying names I wish to use column indices.
Is this possible?
df.list <- lapply(myExcelCSV, read_csv, skip = headers2skip[i]-1)
Alternatively, you can use a compact string representation
where each character represents one column: c = character, i
= integer, n = number, d = double, l = logical, f = factor, D
= date, T = date time, t = time, ? = guess, or ‘_’/‘-’ to
skip the column.
If you know the total number of columns in the file you could do it like this:
my_read <- function(..., tot_cols, skip_cols=numeric(0)) {
csr <- rep("?",tot_cols)
csr[skip_cols] <- "_"
csr <- paste(csr,collapse="")
read_csv(...,col_types=csr)
}
If you don't know the total number of columns in advance you could add code to this function to read just the first line of the file and count the number of columns returned ...
FWIW the skip argument might not do what you think it does (it skips rows rather than selecting/deselecting columns): as I read ?readr::read_csv() there doesn't seem to be any convenient way to skip and/or include particular columns (by name or by index) except by some ad hoc mechanism such as suggested above; this might be worth a feature request/discussion on the readr issues list? (e.g. add cols_include and/or cols_exclude arguments that could be specified by name or position?)
Problem reading data vector - My csv data file (rab.csv) has just one row of > 10,000 numbers read into R with:
bab <- read.table("rab.csv") #which yields:
bab
V1
1 23,29,9,28,16,10,8,24,16,20,14,15,17,31,25,19,24,55,28,55,23, . . . and so on
In using this data vector, I get:
Error: data vector must consist of at least two distinct values!
It seems to only see the number "1" that was somehow added in front of the data.
I'm quite new to this so probably something simple, but I've spent 2 days searching every possibility I can think of without finding a solution.
We can use scan to read the file as a vector.
v1 <- scan("rab.csv", what=numeric(), sep=",")
In the read.table, if we don't specify header=FALSE, it will take the first column as header and as it is numeric, it will append X as prefix. (though, it can be avoided by using check.names=FALSE argument)
I have a text file to read in R (and store in a data.frame). The file is organized in several rows and columns. Both "sep" and "eol" are customized.
Problem: the custom eol, i.e. "\t&nd" (without quotations), can't be set in read.table(...) (or read.csv(...), read.csv2(...),...) nor in fread(...), and I can't able to find a solution.
I'have search here ("[r] read eol" and other I don't remember) and I don't find a solution: the only one was to preprocess the file changing the eol (not possible in my case because into some fields I can find something like \n, \r, \n\r, ", ... and this is the reason for the customization).
Thanks!
You could approach this two different ways:
A. If the file is not too wide, you can read your desired rows using scan and split it into your desired columns with strsplit, then combine into a data.frame. Example:
# Provide reproducible example of the file ("raw.txt" here) you are starting with
your_text <- "a~b~c!1~2~meh!4~5~wow"
write(your_text,"raw.txt"); rm(your_text)
eol_str = "!" # whatever character(s) the rows divide on
sep_str = "~" # whatever character(s) the columns divide on
# read and parse the text file
# scan gives you an array of row strings (one string per row)
# sapply strsplit gives you a list of row arrays (as many elements per row as columns)
f <- file("raw.txt")
row_list <- sapply(scan("raw.txt", what=character(), sep=eol_str),
strsplit, split=sep_str)
close(f)
df <- data.frame(do.call(rbind,row_list[2:length(row_list)]))
row.names(df) <- NULL
names(df) <- row_list[[1]]
df
# a b c
# 1 1 2 meh
# 2 4 5 wow
B. If A doesn't work, I agree with #BondedDust that you probably need an external utility -- but you can invoke it in R with system() and do a find/replace to reformat your file for read.table. Your invocation will be specific to your OS. Example: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands . Since you note that you have \n, and \r\n in your text already, I recommend that you first find and replace them with temporary placeholders -- perhaps quoted versions of themselves -- and then you can convert them back after you have built your data.frame.
I have a dataframe Genotypes and it has columns of loci labeled D2S1338, D2S1338.1, CSF1PO, CSF1PO.1, Penta.D, Penta.D.1. These names were automatically generated when I imported the Excel spreadsheet into R such that the for the two columns labeled CSF1PO, the column with the first set of alleles was labeled CSF1PO and the second column was labeled CSF1PO.1. This works fine until I get to Penta D which was listed with a space in Excel and imported as Penta.D. When I apply the following code, Penta.D gets combined with Penta.C and Penta.E to give me nonsensical results:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE), function(x) x[1])))
Expected <- sapply(locuses, function(x) 1 - sum(unlist(Freqs[grepl(x, names(Freqs))])^2))
This code works great for all loci except the Pentas because of how they were automatically names. How do I either write an exception for the strsplit at Penta.C, Penta.D, and Penta.E or change these names to PentaC, PentaD, and PentaE so that the above code works as expected? I run the following line:
Genotypes <- transform(Genotypes, rename.vars(Genotypes, from="Penta.C", to="PentaC", info=TRUE))
and it tells me:
Changing in Genotypes
From: Penta.C
To: PentaC
but when I view Genotypes, it still has my Penta loci written as Penta.C. I thought this function would write it back to the original data frame, not just a copy. What am I missing here? Thanks for your help.
The first line of your code is splitting the variable names by . and extracting the first piece. It sounds like you instead want to split by . and extract all the pieces except for the last one:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE),
function(x) paste(x[1:(length(x)-1)], collapse=""))))
Looks like you want to remove ".n" where n is a single digit if and only if it appears at the end of a line.
loci.columns <- read.table(header=F,
text="D2S1338,D2S1338.1,CSF1PO,CSF1PO.1,Penta.D,Penta.D.1",
sep=",")
loci <- gsub("\\.\\d$",replace="",unlist(loci.columns))
loci
# [1] "D2S1338" "D2S1338" "CSF1PO" "CSF1PO" "Penta.D" "Penta.D"
loci <- unique(loci)
loci
# [1] "D2S1338" "CSF1PO" "Penta.D"
In gsub(...), \\. matches ".", \\d matches any digit, and $ forces the match to be at the end of the line.
The basic problem seems like the names are being made "valid" on import by the make.names function
> make.names("Penta C")
[1] "Penta.C"
Avoid R's column re-naming with use of the check.names=FALSE argument to read.table. If you refer explicitly to columns you'll need to provide a back-quoted strings
df$`Penta C`