For our analysis we need to read raw data from csv (xls) & convert it into SAS dataset before doing our analysis.
Now, the problem is this raw data generally have 2 issues:
1. The ordering of columns changes sometimes. So, if in the earlier period we have columns in order of variable A,then B, then C, etc. It might change to B, then C, then A.
2. There are foreign elements like "#", or ".", or "some letters", etc.
Now, we have to first clean the raw data, before reading into SAS. This take considerable amount of time. Is there any way we can clean the data within SAS system itself before reading the data. If we can rectify the data with SAS code, it will save quite amount of time.
Here's the example:
Period 1: I got the data in Data1.csv in this format. In column B, which is numeric, I've "#" & ".". And colummn C, which is also numeric, I've "g". If I import Data1.csv using either PROC IMPORT or Infile statement, these foreign elements in column B & C will remain. The question here is how to do that? I can use If STATEMENT. But the problem is there are too many foreign elements (e.g. instead of "#", ".", "g", I might get other foreign elements like "$", "h" etc.) If there's any way we can have a code which detect & remove foreign elements without I've to specifying it using IF STATEMENT everytime I import the raw data in SAS.
A B C
Name1 1 5
Name2 2 6
Name3 3 4
Name4 # g
Name5 5 3
Name6 . 6
Period 2: In this period I got DATA2.csv which is given below. When I use INFILE statement, I specify 1st A should be read with the specific name, then B with specific name & then C. In 2nd period when I get the data B is given 1st. So, when SAS read the data I've B instead of A. So, I've to check the variables ordering with previous phase data everytime & correct it before reading the data using infile statement. Since the number of variables are too large, it's very time consuming ( & at time frustrating) to verify the column ordering in this fashion. Is there SAS code, with which SAS will automatically read A,& then B & then C, even though it's not in this order?
B A C
1 Name1 5
2 Name2 6
3 Name3 4
# Name4 g
5 Name5 3
. Name6 6
Even though I mainly use SAS in my analysis purpose. But I can use R to clean the data, then use to read it in SAS for further analysis. So R code can also be helpful.
Thanks.
In R you increase the speed of file reading when you specify that a column is a particular class. With the example provided (3 columns with the middle one being "character" you might use this code:
dat <- read.csv( filename, colClasses=c("numeric", "character", "numeric"), comment.char="")
The "#" and "." would become NA values when encountered in the numeric columns. The above code removes the default specification of the comment character which is "#". If you wanted the "#" and "." entries in character columns to be coerced to NA_character_, you could use this code:
dat <- read.csv( filename,
colClasses=c("numeric", "character", "numeric"),
comment.char="",
na.strings=c("NA", ".", "#") )
By default the header=TRUE setting is assumed by read.csv(), but if you used read.table() you would need to assert header=TRUE with the two file structures you showed. There is further documentation and worked examples of reading Excel data here: However, my advice is to do as you are planning and use CSV transfer. You will see the screwy things Excel does with dates and missing values more quickly that way. You would be well advised to change the data formats to a custom "yyyy-mm-dd" in agreement with the POSIX standard, in which case you can also specify a "Date" classed column and skip the process of turning character classed columns in the default Excel formats (all of which are bad) into dates.
Yes, you can use SAS to do any kind of "data cleaning" you might imagine. The SAS DATA step language is full of features to do things like this, but there is no magic bullet; you need to write the code yourself.
A csv file is just a plain text file (very different from an xls file). Normally the first row in a csv file contains column names and the data begins with the second row. If you use PROC IMPORT, SAS will use the first row to construct variable names and try to determine data types by scanning the first several rows of the file. For example:
proc import datafile='c:\temp\somefile.csv'
out=SASdata
dbms=csv replace;
run;
Alternatively, you can read the file with a data step. This would require that you know the file layout in advance. For example:
data SASdata;
infile 'c:\temp\somefile.csv' dsd firstobs=2 lrecl=32767 truncover;
informat A $50.; /* A character variable with max length 50 */
informat B yymmdd10.; /* A date presented like 2012-08-25 */
informat C dollar12.; /* A number containing dollar sign, commas, or decimals */
input A B C; /* The order of the variables in the file */
if B = . then B = today(); /* A possible data cleaning statement */
run;
Note that the INPUT statement controls the order that the variables exist in the file. The point is that the code you use must match the layout of each file you process.
These are just general comments. If you encounter problems, post back with a more specific question.
UPDATE FOR UPDATED QUESTION: The variables from the raw data file must be listed in the INPUT statment in the same order as they existin each file. Also, you need to define the column types directly, and establish whatever rules they need to follow. There is no way to do this automatically; each file much be treated separately.
In this case, let's assume your variables are A, B, and C, where A is character and B and C are numbers. This program might process both files and add them to a history dataset (let's say ALLDATA):
data temp;
infile 'c:\temp\data1.csv' dsd firstobs=2 lrecl=32767 truncover;
/* Define dataset variables */
informat A $50.;
informat B 12.;
informat C 12.;
/* Add a KEEP statement to keep only the variables you want */
keep A B C;
input A B C;
run;
proc append base=ALLDATA data=temp;
run;
data temp;
infile 'c:\temp\data2.csv' dsd firstobs=2 lrecl=32767 truncover;
informat A $50.;
informat B 12.;
informat C 12.;
input B A C;
run;
proc append base=ALLDATA data=temp;
run;
Notice that the "data definition" part of each data step is the same; the only difference is the order of variables listed in the INPUT statement. Notice that because the variables A and B are defined as numeric, when those invalid characters are read (# and g), the values are stored as missing values.
In your case, I'd create a template SAS program to define all the variables you want in the order you expect them to be. Then use that template to import each file using the order of the variables in that file. Setting up the template program might take a while, but to run it you would only need to modify the INPUT statement.
Related
I am trying to read in a tabstop seperated csv file using read_delim(). For some reason the function seems to change some field entries in integer values:
# Here some example data
# This should have 3 columns and 1 row
file_string = c("Tage\tID\tVISITS\n19.02.01\t2163994407707046646\t40")
# reading that data using read_delim()
data = read_delim(file_string, delim = "\t")
view(data)
data$ID
2163994407707046656 # This should be 2163994407707046646
I totally do not understand what is happening here. If I chnage the col type to character the entry stays the same. Does anyone has an explanation for this?
Happy about any help!
Your number has so many digits, that it does not fit into the R object. According to the specification IEEE 754, the precision of double is 53 bits which is approx. a number with 15 decimal digits. You reach that limit using as.double("2163994407707046646").
Can anyone please tell me how to assign a unique value to a result set every time its executed ? As displayed in table below, a entry should be added in front of every record and this entry should be same for all the result set that has been obtained during a single execution. The purpose of this to extract the all records in future by just giving a short statement like (where Unique ID = A_Ground_01). Thanks
User DateTime Latitude Longitude Floor **Unique ID**
1 A 2017-06-15 47.29404 5.010650 Ground A_Ground_01
2 A 2017-06-15 47.29403 5.010634 Ground A_Ground_01
3 A 2017-06-15 47.29403 5.010668 Ground A_Ground_02
4 A 2017-06-15 47.29403 5.010663 Ground A_Ground_02
With not knowing anything about your initial dataframe, or the function being executed I might recommend something similar to the following.
In this example I'll assume you have a main dataframe we'll call df.main and some new data you will be binding to the main dataframe, we'll call df.newdata
Create a column in your main dataframe called df.main$ExecID that will contain integer values.
Run whatever your function is and assign df.newdata$ExecID <- max(df.main$ExecID) + 1
Generate the unique id using df.newdata$UniqueID <- paste(df.newdata$User, df.newdata$Floor, df.newdata$ExecID, sep = "_")
Then run rbind(df.main, df.newdata)
To provide a better solution for your specific situation, we really would need to see example code of how your script is written.
I have a table with columns ranging from foo1...... foo999 . Could you give me a code in R to extract a set number of columns as needed by the user. The user should be able to decide about what range of columns he needs i.e from 1 to 60 in a separate table or maybe its 1 to 565 that the user needs. I tried a couple of methods. This is what I have currently. The solution seems to be quite basic but I can't find it anywhere.
number <- readline(prompt="Enter the number of columns: ")
subset(data, select=foo1:foo(number))
The expected output is the contents of the columns from the range that the user needs preferably stored in another variable so that I could use the data for further analysis.
Here's a construct that could do it:
number <- readline(prompt="Enter the number of columns: ")
columns<- eval(parse(text=number))
df_selected <- df[,columns]
This will handle the user entering something like 3:8 or c(1,4,9), etc.
I want to import a table (.txt file) in R with read.table(). One column in my table is an ID with nine numerals - some ids begin with a 0, other with 1 or 2.
R truncates the first 0 (012345678 becomes 12345678) which leads to problems when using this ID to merge another table.
Can someone give me a hint how to solve the problem?
As said in Ben's answer, colClasses is the easier way to do it. Here is an example:
read.table(text = 'col1 col2
0012 0001245',
head=T,
colClasses=c('character','numeric'))
col1 col2
1 0012 1245 ## col1 keep 00 but not col2
A reproducible example would be nice, but: use the colClasses argument to read.table() to specify that you want this column to be read as a character variable, not numeric. Or make them back into character variables after reading them in, using sprintf to pad the numbers with leading zeros. (The former is probably easier.)
Here is a for loop to add leading zeros to rows based on a condition. Although this is a post-hoc solution (adding leading 0's after reading the table), it worked for me so thought I'd share:
Let's take the example of a column of zip codes. All values should contain 5 digits (e.g. 01234), but R removes leading zeros (so '01234' becomes '1234'). You can add a trailing zero to all cells that contain only 4 characters with this code:
for (i in 1:nrow(df)){
if(nchar(df$zipCode[i])<5){
df$zipCode[i]<- paste0('0',df$zipCode[i])
}
}
I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.