Read data into R deleting or skipping lines containing characters - r

I'm sure this is simple, but I'm not coming across an answer. I would like to import a data frame into R without processing the lines in a text editor first. Essentially, I want R to do it on read in. So all lines containing
FRAME 1 of ***
OR
ATOM-WISE TOTAL CONTACT ENERGY
will be skipped, deleted or ignored.
And all that will be left is;
Chain Resnum Atom number Energy(kcal/mol)
ATOM C 500 1519 -2.1286
ATOM C 500 1520 -1.1334
ATOM C 500 1521 -0.8180
ATOM C 500 1522 -0.7727
Is there a simple solution to this? I'm not sure which scan() of read.table() arguments would work.
EDIT
I was able to use readLines and gsub to read in the file and remove the (un)necessary lines. I omitted the "" left from the deleted words and now I am trying to convert the character df to a regular(numeric) df. When I use data.frame(x) or as.data.frame(x) I am left with a data frame with 100K rows and only one variable. There should be at least 5 variables.

readLines gives you a vector with one character string for each line of the file. So you have to split these strings into the elements you want before you convert to a dataframe. If you have nice space-separated values, try:
m = matrix(unlist(strsplit(data, " +")), ncol=5, byrow=TRUE)
# where 'data' is the name of the vector of strings
df = data.frame(m, stringsAsFactors=FALSE)
Then for each column with numeric data, use as.numeric() on the column to convert.

Related

How to import CSV data in R separated by tab/space?

The data in have is as below in a .csv file.
id.airwaybill_number.order_number.org_pincode.product_type.inscan_date.pickup_date.actual_weight.original_act_weight.chargeable_weight.collectable_value.declared_value.code.name.active.center_shortcode.center_shortcode.if.sc.center_shortcode...NULL csc.center_shortcode sc.center_shortcode..rts_status.reverse_pickup.ref_airwaybill_number.dest_pincode.pincode.item_description.length.breadth.height.volumetric_weight.city_name.city_name.state_shortcode.state_shortcode.zone_shortcode.zone_shortcode
"61773384 147200492 SLP759809537 110008 ppd 2016-03-02 04:38:56 2016-03-01 0.25 0.25 0.5 0 424 92006 JASPER INFOTECH PRIVATE LIMITED activ 0 NULL 37.5 DLT MPS MPS 0 0 NULL 403516 403516 Vimarsh Rechargeable Tube With Charger Emergency Light 10 10 10 0.2 DELHI MAPUSA DL GA NCR WS"
When I import it into R using -
y <- read.csv("x.csv", sep = "\t")
y <- read.table("x.csv", sep = "\t")
All the data comes into one cell. This is sample of very big data and I want to import the data column wise and not in a single cell.
Please help.
Your file is a little odd, in that it seems to have a mix of delimiters (some \t, some _, and some ,), and as #Sun Bee mentions in the comments, your header doesn't seem to match up with your data. For those reasons, it might be worth working on the file "from scratch" rather than relying on something like read.table or fread.
First, read in the file as text:
con <- file( "x.csv" )
input <- readLines( con )
close( con )
Then perform a few tasks on it. First, split the text in each line by any of \t, ,, and _.
data <- sapply( input, strsplit, "\t|,|_" )
If you take a look at the lengths of each element, you'll see that the first (the header) is an odd one out, meaning the values won't line up with the header names.
sapply( data, length )
My suggestion here is to remove that first row, and go without a header for now.
data <- data[ -1 ]
Then bind the list together rowwise to make a matrix* (which you can convert to a data.frame if you prefer). I'm removing the row names here because I assume you don't need them.
data <- do.call( rbind, data )
row.names(data) <- NULL
What results from the above is something that I'd say represents your data well, albeit without columns names. You can take the first line of your file and work with it to extract proper column names if you wish, but I'm not seeing exactly how they should go, so I won't attempt it here.
NOTE if you want the rbind function not to convert the columns to factor class (which it will by default), you can specify options( stringsAsFactors = FALSE ) beforehand.

Combining CSV files and splitting the column into 2 columns using R

I have 40 CSV files with only 1 column each. I want to combine all 40 files data into 1 CSV file with 2 columns.
Data format is like this :
I want to split this column by space and combine all 40 CSV files into 1 file. I want to preserve the number format as well.
I tried below code but Number format is not fixed and and extra 3rd column added for Negative numbers. Not sure why.
My Code :
filenames <- list.files(path="C://R files", full.names=TRUE)
merged <- data.frame(do.call("rbind", lapply(filenames, read.csv, header = FALSE)))
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," ",fixed=FALSE))
write.csv(data, "export1.csv", row.names=FALSE, na="NA")
The output which i got is as shown below. If you observe, the negative numbers are put into extra column. I just want to split by space and put in 2 columns in the exact number format as in the input.
R Output:
The problem is that the source data is delimited by:
one space when the second number is negative, and
two spaces when the second number is positive (space for the absent minus sign).
The trick is to split the string on one or more spaces:
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," +",fixed=FALSE))
I'm a bit OCD on charsets, unreliable files, etc, so I tend to use splitters such as "[[:space:]]+" instead, since it'll catch whitespace-variants instead of the space " " or tab "\t".
(In regex-speak, the + says "one or more". Other modifiers include ? as zero or one, and * as zero or more.)

read txt files with left-aligned data but inconsistent number of spaces in R

I have a series of txt files formatted in the same way.
The first few rows are all about file information. There are no variable names. As you can see spaces between factors are inconsistent but Columns are left-aligned or right-aligned.I know SAS could directly read data with this format and wonder if R provide any function similar.
I tried read.csv function to load these data and I want to save them in a data.frame with 3 columns, while it turns out the option sep = "\s"(multiple spaces) in the function cannot recognize regular expression.
So I tried to read these data in a variable first and use substr function to split them as following.
step1
Factor<-data.frame(substr(Share$V1,1,9),substr(Share$V1,9,14),as.numeric(substr(Share$V1,15,30)))
step2
But this is quite unintelligent, and need to count the spaces between.
I wander if there is any method to directly load data as three columns.
> Factor
F T S
1 +B2P A 1005757219
2 +BETA A 826083789
We can use read.table to read it as 3 columns
read.table(text=as.character(Share$V1), sep="", header=FALSE,
stringsAsFactors=FALSE, col.names = c("FactorName", "Type", "Share"))
# FactorName Type Share
#1 +B2P A 1005757219
#2 +BETA A 826083789
#3 +E2P A 499237181
#4 +EF2P A 38647147
#5 +EFCHG A 866171133
#6 +IL1QNS A 945726018
#7 +INDMOM A 862690708
Another option would be to read it directly from the file, skipping the header line and change the column names
read.table("yourfile.txt", header=FALSE, skip=1, stringsAsFactors=FALSE,
col.names = c("FactorName", "Type", "Share"))

Importing csv file into R - numeric values read as characters

I am aware that there are similar questions on this site, however, none of them seem to answer my question sufficiently.
This is what I have done so far:
I have a csv file which I open in excel. I manipulate the columns algebraically to obtain a new column "A". I import the file into R using read.csv() and the entries in column A are stored as factors - I want them to be stored as numeric. I find this question on the topic:
Imported a csv-dataset to R but the values becomes factors
Following the advice, I include stringsAsFactors = FALSE as an argument in read.csv(), however, as Hong Ooi suggested in the page linked above, this doesn't cause the entries in column A to be stored as numeric values.
A possible solution is to use the advice given in the following page:
How to convert a factor to an integer\numeric without a loss of information?
however, I would like a cleaner solution i.e. a way to import the file so that the entries of column entries are stored as numeric values.
Cheers for any help!
Whatever algebra you are doing in Excel to create the new column could probably be done more effectively in R.
Please try the following: Read the raw file (before any excel manipulation) into R using read.csv(... stringsAsFactors=FALSE). [If that does not work, please take a look at ?read.table (which read.csv wraps), however there may be some other underlying issue].
For example:
delim = "," # or is it "\t" ?
dec = "." # or is it "," ?
myDataFrame <- read.csv("path/to/file.csv", header=TRUE, sep=delim, dec=dec, stringsAsFactors=FALSE)
Then, let's say your numeric columns is column 4
myDataFrame[, 4] <- as.numeric(myDataFrame[, 4]) # you can also refer to the column by "itsName"
Lastly, if you need any help with accomplishing in R the same tasks that you've done in Excel, there are plenty of folks here who would be happy to help you out
In read.table (and its relatives) it is the na.strings argument which specifies which strings are to be interpreted as missing values NA. The default value is na.strings = "NA"
If missing values in an otherwise numeric variable column are coded as something else than "NA", e.g. "." or "N/A", these rows will be interpreted as character, and then the whole column is converted to character.
Thus, if your missing values are some else than "NA", you need to specify them in na.strings.
If you're dealing with large datasets (i.e. datasets with a high number of columns), the solution noted above can be manually cumbersome, and requires you to know which columns are numeric a priori.
Try this instead.
char_data <- read.csv(input_filename, stringsAsFactors = F)
num_data <- data.frame(data.matrix(char_data))
numeric_columns <- sapply(num_data,function(x){mean(as.numeric(is.na(x)))<0.5})
final_data <- data.frame(num_data[,numeric_columns], char_data[,!numeric_columns])
The code does the following:
Imports your data as character columns.
Creates an instance of your data as numeric columns.
Identifies which columns from your data are numeric (assuming columns with less than 50% NAs upon converting your data to numeric are indeed numeric).
Merging the numeric and character columns into a final dataset.
This essentially automates the import of your .csv file by preserving the data types of the original columns (as character and numeric).
Including this in the read.csv command worked for me: strip.white = TRUE
(I found this solution here.)
version for data.table based on code from dmanuge :
convNumValues<-function(ds){
ds<-data.table(ds)
dsnum<-data.table(data.matrix(ds))
num_cols <- sapply(dsnum,function(x){mean(as.numeric(is.na(x)))<0.5})
nds <- data.table( dsnum[, .SD, .SDcols=attributes(num_cols)$names[which(num_cols)]]
,ds[, .SD, .SDcols=attributes(num_cols)$names[which(!num_cols)]] )
return(nds)
}
I had a similar problem. Based on Joshua's premise that excel was the problem I looked at it and found that the numbers were formatted with commas between every third digit. Reformatting without commas fixed the problem.
So, I had the similar situation here in my data file when I readin as a csv. All the numeric value were turned into char. But in my file there was a value with a word "Filtered" instead of NA. I converted "Filtered" to NA in vim editor of linux terminal with a command <%s/Filtered/NA/g> and saved this file and later used it and read it in R, all the values were num type and not char type any more.
Looks like character value "Filtered" was inducing all values to be char format.
Charu
Hello #Shawn Hemelstrand here are the steps in detail below:
example matrix file.csv having 'Filtered' word in it
I opened the file.csv in linux command terminal
vi file.csv
then press "Esc shift:"
and type the following command at the bottom
"%s/Filtered/NA/g"
press enter
then press "Esc shift:"
write "wq" at the bottom (this save the file and quit vim editor)
then in R script I read the file
data<- read.csv("file.csv", sep = ',', header = TRUE)
str(data)
All columns were num type which were earlier char type.
In case you need more help, it would be easier to share your txt or csv file.

R - Import & Merge Multiple Excel Files And Add Filesource Variable

I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.
What I want to do:
I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:
Import the data into R from Excel
Add a variable to each file containing the filename
Merge all of the data (add observations/rows - do not need to match on variables)
I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.
What the Data Look Like:
Row 1 Title
Row 2 StudentID Var1 Var2 Var3 Var4 Var5
Row 3 11234 1 9/8/2011 343 159-167 32
Row 4 11235 2 9/16/2011 112 152-160 12
Row 5 11236 1 9/8/2011 325 164-171 44
Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.
What I have so far:
At first I simply tried to import data from excel. Using the XLSX package, this works nicely:
dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1,
sheetName=NULL, startRow=2,
endRow=NULL, as.data.frame=TRUE,
header=TRUE)
Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.
setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));
I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.
When I do this I get the following error:
Error in data.frame(res, ...) :
arguments imply differing number of rows: 616, 1, 5
I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.
What next?
If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!
I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine
You need to pass a single FILENAMEVAR to read.xlsx2.
lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))

Resources