An SSIS package exporting data to the flat file text. The exporting work fine, except for the blank row at the end of file. Is there a simple way to get rid of the blank row?
Ragged Right in Source and Destination Flat File Connection
9 Columns Defined with Appropriate Input Column Width and Output Column Width.
no header rows and last column delimiter CRLF.
Source
Destination
Thanks in advance.
BV
Use a Conditional Split Transformation in between source and destination to remove empty rows from data flow pipeline. In the Split transformation editor specify boolean SSIS expression in Condition column something like
LEN([Col 1]) > 0 && LEN([Col 2]) > 0 ...
Or
ID != "" && [Col 1] != "" && [Col 2] != ""
Conditional Split
Related
I have a dataframe that is sourced from an excel CSV file. because of how I build the source file, some of the columns will come in with blanks (i.e. I didnt put anything in that cell in excel). I need to set those cells to 0 before I process the data.
I read the data in using fread.
Some of the cells have come in with N/A in the value which is easy enough to replace. However some show "blank" in View(theDataFrame) and are not caught by replace_na. I cant figure out what is in these cells so i can target them and replace with 0.
I tried paste("*", theDataFrame[1, "Current Price"], "*") which printed an asterisk, two white spaces and another asterisk. so, I tried gsub("/s/s", "0", theDataFrame$`Current Price`) but that did not work.
is.empty(theDataFrame[1, "Current Price"]) is FALSE so, it's not empty. How can i figure out what the heck fread put in there so I can replace it?
Thanks!
We could use
theDataFrame[theDataFrame == ''] <- 0
I have got this .txt file outputed by a microscope to process.
#read the .txt file generated by microscope, skipping the first 9 lines of garbage information
df <- read.csv("Objects_Population - AllCells.txt", sep="\t", skip = 9,header=TRUE, fill = T)
Then I started looking at the structure of the dataframe, everything seems fine except I now found an extra column in the end of the data frame named "x.1" and all rows of it are NA values. I don't see this column when I open the .txt file in excel. I suspect the problem has something to do with the column names generated by microscope, they contain quite some special characters
Below is the dataframe read by Excel(only showing the last 2 columns since I have 132 columns, and their names are disgustingly long):
AllCells - Cell Contact Area with Neighbors [%] AllCells - Nucleus Nearest Neighbor Distance [µm]
0 4.82083
21.9512 0
15.7895 0
29.4118 0.584611
0 4.21569
0 1.99599
0 3.50767
...
This has happened to me before but I never took it too serious as I was always interested in a subset of my data frame. Now I'm looking at all columns then this starts to bothering me.
Is there any way I can read them correctly without R attaching that additional "X.1" column in the end? Preferably not manually delete or subset out the last column...
Cheers,
ML
If all other column names are correct, you have probably a trailing \t in the text file. R tries to include it and gives it the generic column name X.1.
You could try and read the file first as 'plain text' and remove the trailing \t and only then use read.csv:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t (this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)
Had that problem too. Fixed it by saving the file as "CSV (MS-DOS (*csv)" instead of what I originally had as "CSV (Comma delimited)(*csv)".
This is almost certainly because you've got an extra empty column in your spreadsheet.
In Excel, open your sheet and press Ctrl-End. If you end up in an empty cell outside the range of your data, there's the problem. Select the column (Ctrl-Space), right-click, and choose Delete.
I also encountered similar problem. I found that three extra columns were created (X, X.1, X.2), after I loaded dataset from excel sheet to R studio.
Steps Followed by me:
a) I went to the excel sheet and selected those three extra columns after last column with actual values in excel sheet. Selected extra 3 columns by keeping cursor on top of columns and then right click the mouse and select delete.
b) Again loaded that excel sheet in R. I did not find those 3 columns.
So I'm trying to create a loop which reads input files based on the value present in a column 4 of my data frame.
for (idx in 1:nrow(config_path_test)) {
if (config_path_test[idx,2] == "ens84" & is.na(config_path_test[idx,4]))
{#process Raw data}
else {
if(config_path_test[idx,2] == "ens84" & !is.na(config_path_test[idx,4]))
{#process TPM data}
}
Tested my code with column 4 containing paths to my data in all fields, no paths in any of the fields or a combination of both (some have paths, some will not).
Code above deals with both all and no fields succesfully.
However, for the combination of both, I'm a bit stuck.
As no paths being present is reflected for the entire column as NA, i've been using:
is.na() and !is.na()
However, a combination of both will not show NA for the missing values. The fields are blank.
Any idea how to change my code to process the blank fields? Or any idea on how to process the blank values to NA?
Thanks in advance
If I am understanding the issue correctly, the files you have read in have some ("") fields that you would like to cast to NA? If that is the case, do the following outside your function:
df[df == ""] <- NA
Do check to ensure it is not something similar, such as (" "). This thread might also be of interest to you.
I have a folder with tons of txt files from where I have to extract especific data. The problem is that the format of the file has changed once and the position of the data I need to extract has also changed. So I need to deal with files in different format.
To try to make it more clear, in column 4 I have the name of the variable and in 5 I have the value, but sometimes this is in a different row. Is there a way to find the name of the variable (in which row) and then extract its value?
Thanks in advance
EDITING
In some files I will have the data like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Current--------28
But in some point in life, there was a change in the software to add another variable and the new file iis like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
So I need to deal with these 2 types of data, extracting the same variables which are in different rows.
If these files can't be read with read.table use readLines and then find those lines that start with the keyword you need.
For example:
Sample file 1 (with the dashes included and extra line breaks):
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
Sample file2 (with a comma as separator):
Column 1,Column 2.
Device ID,A.
Current,555
Voltage, 500.
Error,5.
For both cases do:
text = readLines(con = file("your filename here"))
curr = text[grepl("^Current", text, ignore.case = T)]
Which returns:
for file 1:
[1] "Current--------28"
for file 2:
[1] "Current,555"
Then use gsub to remove anything that is not a number.
apologies in advanced if this question has been asked before but I couldn't find anything on it.
Right now, I'm attempting to take certain columns from files, and to name the columns the names of those files. I've done it before, and I know it's not too difficult, but I am running into a lot of trouble. MY code is as follows (allfiles is declared earlier in the code as all of the files in that directory)
makelist<-function(list_text){
if (list_text == "squared_median " || list_text == "squared_median_ranked"
|| list_text == "value_median " || list_text == "value_median_ranked")
metric = "median"
else
metric = "avg"
currfiles=allfiles[grepl(list_text,allfiles)]
currfile=currfiles[1]
currtable=read.table(currfile, header=T, sep='\t',stringsAsFactors = F)
a<-cbind(gene=currtable[,1],paste0(currfile)=currtable[,metric])
#col.name(a[,ncol(a)])<-currfile
#names(a)[ncol(a)]<-as.character(currfile)
for(currfile in currfiles[2:length(currfiles)])
{
currtable=read.table(currfile, header=T, sep='\t', stringsAsFactors=F)
if (length(currtable[,metric]) > length(a[,1]))
apply(a,2, function(x) length(x) = length(currtable[,metric]))
a=cbind(a, "gene"=currtable[,1],currfile=currtable[,metric])
#names(a)[ncol(a)]<-paste(currfile)
}
#names(a)=c("gene", currfiles[1], "gene", currfiles[2],"gene", currfiles[3],"gene", currfiles[4])
write.table(a, paste(output_folder, list_text,".txt"),sep='\t',quote=F,row.names=F)
}
Essentially, I'm passing in a string that is used to gather certain files from a directory. From, there the code grabs the median or average column from that file, and names the column the file from which it got that information. I've tried loads of different ways with no success. The commented ways are ways that did not work -- either they left the column name blank, or named it the literal variable name "currfile" as opposed to the file name which it contains. I've gone as far as individually renaming all of the columns with
names(a)=c("gene", currfiles[1], "gene", currfiles[2]...currfiles[n])
And that just names every other column currfiles.
Can you help me identify what's wrong? I've tried setting the name as get(currfile) too and that won't let me run the script.
These lines
#col.name(a[,ncol(a)])<-currfile
#names(a)[ncol(a)]<-as.character(currfile)
Have left me with blank column names.
** as an aside the lines with the if statement concerning length are supposed to extend the length of each column to the latest longest column, but doesn't seem to be working. That could be something else I'll read up about a bit more.
Thanks for your help,
Mike
To set column names of a table, you use colnames(table) (ref).
In your case, I'd expect colnames(a)[-1] <- currfile to do the trick, if I am understanding currently that you want to name the last column of the table a with the string in variable currfile.