I'm using the write function right now in R with a matrix, and this is what I have
write(my_mtx,file='mtx.tsv',sep='\t')
But this gives me a file with one column? I've also tried adding an 'ncolumns' argument
write(mt_mtx,ncolumns=length(colnames(my_mtx)),file='mtx.tsv',sep='\t')
But this just gives me a repeat of the one columns as opposed to actual separated columns as it appears in the matrix. a little help?
Try using write.table instead
write.table(mt_mtx, file = 'mtx.tsv', sep ='\t', col.names = FALSE, row.names = FALSE)
Then it will default to the correct number of columns and there is no need to transpose
default for write() is one column if the data are character, five columns if the data are numeric, and it fills by rows (see ?write). Try this:
write(t(my_mtx),file='mx.tsv',sep='\t',ncolumns=ncol(my_mtx))
Related
I am trying to import a SAS data set to R (I cannot share the data set). SAS sees columns as number or character. However, some of the number columns have coded character values. I've used the sas7bdat package to bring in the data set but those character values in number columns return NaN. I would like the actual character value. I have tried exporting the data set to csv and tab delimited files. However, I end up with observations that take 2 lines (a problem with SAS that I haven't been able to figure out). Since there are over 9000 observations I cannot go back and look for those observations that take 2 lines manually. Any ideas how I can fix this?
SAS does NOT store character values in numeric columns. But there are some ways that numeric values will be printed using characters.
First is if you are using BEST format (which is the defualt for numeric variables). If the value cannot be represented exactly in the number of characters then it will use scientific notation.
Second is special missing values. SAS has 28 missing values. Regular missing is represented by a period. The others by single letter or underscore.
Third would be a custom format that displays the numbers using letters.
The first should not cause any trouble when importing into R. The last two can be handled by Haven. See the semantics Vignette in the documentation.
As to your multiple line CSV file there are two possible issues. The first is just that you did not tell SAS to use long enough lines for your data. Just make sure to use a longer LRECL setting on the file you are writing to.
filename csv 'myfile.csv' lrecl=1000000 ;
proc export data=mydata file=csv dbms=csv ; run;
The second possible issue is that some of your character variables include end of line characters in them. It is best to just remove or replace those characters. You could always add them back if they are really wanted. For example these steps will export the same file as above. It will first replace the carriage returns and line feeds in the character variables with pipe characters instead.
data for_export ;
set mydata;
array _c _character_;
do over _c;
_c = translate(_c,'||','0A0D'x);
end;
run;
proc export data=for_export file=csv dbms=csv ; run;
partial answer for dealing with data across multiple rows
library( data.table )
#first, read the whole lines into a single colunm, for example with
DT <- data.table::fread( myfile, sep = "")
#sample data for this example: a data.table with ten rows containing the numbers 1 to 10
DT <- data.table( 1:10 )
#column-bind wo subsets of the data, using a logical vector to select the evenery first
#and every second row. then paste the colums together and collapse using a
#comma-separator (of whatever separator you like)
ans <- as.data.table(
cbind ( DT[ rep( c(TRUE, FALSE), length = .N), 1],
DT[ rep( c(FALSE, TRUE), length = .N), 1] )[, do.call( paste, c(.SD, sep = ","))] )
# V1
# 1: 1,2
# 2: 3,4
# 3: 5,6
# 4: 7,8
# 5: 9,10
I prefer read_sas function from 'haven' package for reading sas data
library(haven)
data <- read_sas("data.sas7bdat")
Is it possible to pass column indices to read_csv?
I am passing many CSV files to read_csv with different header names so rather than specifying names I wish to use column indices.
Is this possible?
df.list <- lapply(myExcelCSV, read_csv, skip = headers2skip[i]-1)
Alternatively, you can use a compact string representation
where each character represents one column: c = character, i
= integer, n = number, d = double, l = logical, f = factor, D
= date, T = date time, t = time, ? = guess, or ‘_’/‘-’ to
skip the column.
If you know the total number of columns in the file you could do it like this:
my_read <- function(..., tot_cols, skip_cols=numeric(0)) {
csr <- rep("?",tot_cols)
csr[skip_cols] <- "_"
csr <- paste(csr,collapse="")
read_csv(...,col_types=csr)
}
If you don't know the total number of columns in advance you could add code to this function to read just the first line of the file and count the number of columns returned ...
FWIW the skip argument might not do what you think it does (it skips rows rather than selecting/deselecting columns): as I read ?readr::read_csv() there doesn't seem to be any convenient way to skip and/or include particular columns (by name or by index) except by some ad hoc mechanism such as suggested above; this might be worth a feature request/discussion on the readr issues list? (e.g. add cols_include and/or cols_exclude arguments that could be specified by name or position?)
i have about 30 columns within a dataframe of over 100 columns. the file i am reading in stores its numbers as characters. In other words 1300 is 1,300 and R thinks it is a character.
I am trying to fix that issue by replacing the "," with nothing and turn the field into an integer. I do not want to use gsub on each column that has the issue. I would rather store the columns as a vector that have the issue and do one function or loop with all the columns.
I have tried using lapply, but am not sure what to put as the "x" variable.
Here is my function with the error below it
ItemStats_2014[intColList] <- lapply(ItemStats_2014[intColList],
as.integer(gsub(",", "", ItemStats_2014[intColList])) )
Error in [.data.table(ItemStats_2014, intColList) : When i is a
data.table (or character vector), the columns to join by must be
specified either using 'on=' argument (see ?data.table) or by keying x
(i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might
have further speed benefits on very large data due to x being sorted
in RAM.
The file I am reading in stores its numbers as characters [with commas as decimal separator]
Just directly read those columns in as decimal, not as string:
data.table::fread() understands decimal separators: dec=',' by default.
You might need to play with fread(..., colClasses=(...) ) argument a bit to specify the integer columns:
myColClasses <- rep('string',100) # for example...
myColClasses[intColList] <- 'integer'
# ...any other colClass fixup as needed...
ItemStats_2014 <- fread('your.csv', colClasses=myColClasses)
This approach is simpler and faster and uses much less memory than reading as string, then converting later.
Try using dplyr::mutate_at() to select multiple columns and apply a transformation to them.
ItemStats_2014 <- ItemStats_2014 %>%
mutate_at(intColList, funs(as.integer(gsub(',', '', .))))
mutate_at selects columns from a list or using a dplyr selector function (see ?select_helpers) then applies one or more functions to each column. The . in gsub refers to each selected column that mutate_at passes to it. You can think of it as the x in function(x) ....
This is what my text file looks like:
1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12
All these values are continuous and each character indicates something. I am unable to figure out how to separate each value into columns and specify a heading for all these new columns which will be created.
I used this code, however it does not seem to work.
birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth”))
Any help would be appreciated.
Assuming that you have a clear definition for every column, you can use regular expressions to solve this in no time.
From your column names and example data, I guess that the regular expression that matches each field is:
ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}
We can put all this together in a regular expression that captures each of these fields
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
Note: In R, we need to scape "\" that's why we write \d instead of \d.
That said, here comes the code to solve the problem.
First, you need to read your strings
lines <- readLines("birthweighthw1.txt")
Now, we define our regular expression and use the function str_match from the package stringr to get your data into character matrix.
require(stringr)
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
captured <- str_match(string= lines, pattern= reg.expression)
You can check that the first column in the matrix contains the text matched, and the following columns the data captured. So, we can get rid of the first column
captured <- captured[,-1]
and transform it into a data.frame with appropriate column names
result <- as.data.frame(captured,stringsAsFactors = FALSE)
names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")
Now, every column in result is of type character, you can transform each of them into other types. For example:
require(dplyr)
result <- result %>% mutate(ethnic=as.factor(ethnic),
age=as.integer(age),
smoke=as.factor(smoke),
preweight=as.numeric(preweight),
delweight=as.numeric(delweight),
breastfed=as.factor(breastfed),
brthwght=as.integer(brthwght),
brthlngth=as.numeric(brthlngth)
)
I am trying to read a data table into R. The data contains:
two columns with numeric values (continuous),
1556 columns with either 0 or 1, and
one last column with strings, representing two groups (group A and
group B).
Some values are missing, and they are replaced with either ? or some spaces and then ?. As a result, when I read the table into R, the numbers were read as characters.
For example, if Data[1,1]=125, when I write is.numeric(Data[1,1]) I get FALSE. I want to turn all the numbers into numbers, and I want all the ? (with or without spaces before) into missing values. Do not know how to do this. Thank you! (I have 3279 rows).
You can specify the na.strings argument of ?read.table to be na.strings = c("?", "?."). Use that inside the read.table() call when you read the data into R. It should then by recognised correctly. Since you also have some spaces in your data, you could additionally use the strip.white = TRUE argument inside the read.table call.