For my website I've just implemented tinyMCE for my site (just a word processor). Everything works fine except when i try to store the string variable input into a sql server database. I want to store a string and not have the html tags make me exceed the 8000 length limit(the html tags take up most of that space). My question is, is there a solution so I can store my document with the html tags without shortening my document? Thanks
Some ideas I've had but not sure if they'll work
create an if statement that will determine the length If > 8000 than split the string apart and insert into seperate fields.
maybe their is a compression feature which I'm unaware of?
Paul
Can you store it as a BLOB or possibly even FILESTREAM. I know BLOB's have a size limit of 2 GB and are probably less than the ideal depending on the average size of the file you expect because of the hit to the log file. FILESTREAM's were added in SQL SERVER 2008 to handle large files by writing them directly to the filesystem by setting an attribute on the varbinary type.
Related
It seems like I'm hitting a 1GB upper-boundary on my U-SQL input file size. Is there such a limit, and if so, how can this be increased?
Here's my case in a nutshell:
I'm working on a custom xml extractor where I'm processing XML files of roughly 2,5gb. These XML files conform to well maintained XSD schemas. using xsd.exe I've generated .NET classes for Xml serialization. The custom extractor uses these desialized .NET objects to populate the output rows.
This all works pretty neat running U-SQL on my local ADLA Account from Visual Studio. Memory usage goes up to approx 3 gb for a 2,5 gb input xml, so this should perfectly fit on a single vertex per file.
This still works great using <1gb input files on the Data Lake.
However, when trying to scale things up at the Data Lake Store, it seems the job got terminated by hitting the 1gb input file size boundary.
I know streaming the outer XML, and then serializing the inner XML fragments is an alternative option, but we don't want to create - and particularly maintain - too much custom code depending on those externally managed schemas.
Therefore, raising the upper-limit would be great.
I see two issues right now. One that we can address, and one for which we have a feature under development for later this year.
U-SQL per default assumes that you want to scale out processing over your file and will split it into 1GB "chunks" for extraction. If your extractor needs to see all the data (e.g., in order to parse XML or JSON or an image for example) you need to mark the extractor to process the files atomically (not splitting it) in the following way:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{ ...
Now while a vertex has 3GB of data, we currently limit the memory size for a UDO like an extractor to 500MB. So if you process your XML in a way that requires a lot of memory, you will currently still fail with a System.OutOfMemory error. We are working on adding annotations to the UDOs that let you specify your memory requirements to overwrite the default, but that is still under development at this point. The only ways to address that is to either make your data small enough, or - in the case of XML for example - use a streaming parsing strategy that does not allocate too much memory (e.g., use the XML Reader interface).
Is there any downside of using GUID as a file name an นด uploaded image to avoid duplication ?
Your filenames will be unique, true. But there won't be any way to sort them, in any order.
You could use the Unix TimeStamp in front of your GUID, to help sort-by-name and perform other such operations, without having to use a look-up table in your database.
If you store uploaded files with a name based on the hash (eg. SHA1) of the file contents, then you can also store files with identical contents only once (saving space).
I think its enough unique, but consider using an own generator using a date and time and some serial number, i think the names would be more expressive.
There is a downside to using GUIDs in a filename. CLSID's are GUIDs, and some virus scanners will think that you are trying to use this exploit, and will mark your files as potential malware.
See Microsoft Windows CLSID Hidden File Extension Vulnerability for more information.
In db I have : (have = existing situation. the DB has already has the data)
Long html string (nvarchar max)
user have :
asp.net aspx page
mission:
extract from the whole html string 1 specific sentence - and show it to user
options :
1) do the strip job in sql and transfer the result to the client
pros: network is not loaded with the whole html - but the result
cons : sql server is working harder
2) send the whole html in the wire to the server and do the job there.
pros : sql is not working too hard , iis do the job with ready-tools ( html agilitypack)
cons : network wire is loaded with unwanted data
which approach is the right one ?
p.s.
lets assume that the hardware is excellent.
If you need to retrieve this specific piece of data repeatedly you should store it separately - in another column on the same table, for example.
When saving the HTML, extract this piece of information for storage (also on update, to ensure that the information stays in sync).
I would suggest using an HTML parser like the HTML Agility Pack to parse and query the HTML and doing it in the ASP.NET code that will save to the DB instead of doing it in the DB (as string manipulation does not have great support in databases).
This has the benefit of only retrieving the data when you need it and processing only when required.
Its always good to go for loss-less data storage.. because it could be useful in future.. i think 2nd option is good, you should save html in db and use agilitypack to parse the data.. as you have mentioned the hardware is excellent agilitypack should do its work easily because it works on principles of xml document that is quite faster to traverse nodes.
Regards.
I would also go with HTML Agility Pack option only.
That said, if the HTML stored in the DB is well formed and you are using SQL Server 2005+ then try out this to see if it helps - http://vadivel.blogspot.com/2011/10/strip-html-using-udf-in-sql-server-2005.html
I develop a web application that let users to upload files like images and documents. this file divided into two parts :
binary files
document files
I want to allow users to search documents that uploaded. specialy using full text search. What data types I should use for these two file types?
You can store the data in binary and use full text search to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled.Not sure how adding a full text index will affect your system - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.
For more information look at this:
http://technet.microsoft.com/en-us/library/ms142499(SQL.90).aspx
Pros:
The main advantage of storing data in the database is that it makes the data "self-contained". Since all of the data is contained within the database, backing up the data, moving the data from one database server to another, replicating the database, and so on, is much easier.
also you can enable versioning of files and also make it easier for load balanced web farms.
Cons:
you can read it here: https://dba.stackexchange.com/questions/3924/sql-server-2005-large-binary-storage. But this is something that you have to do in order to search through the files efficiently.
Or the other thing that I could suggest is probably storing keywords in the database and then linking the same to file in the fileshare.
Here is an article discussing abt using a FileStream and a database: http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx
You first need to convert the PDF to text. There are libraries for this sort of thing (ie: PowerGREP). Then I'd recommend storing the text of the PDF files in a database. If you need to do full text searching and logic such as "on the same line" then you'll need to store one record per line of text. If you just want to search for text in a file, then you can change the structure of your SQL schema to match your needs.
For docx files, I would convert them to RTF and search them that way while stored in SQL.
For images, Microsoft has a program called Microsoft OneNote that does OCR (optical character recognition) so you can search for text within images. It doesn't matter what tool you use, just that it supports OCR.
Essentially, if you don't have a way to directly read the binary file, then you need to convert it to text with some library, then worry about doing your searching.
The full-text index can be created for columns which use any of the following data types – CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT, VARBINARY, VARBINARY (MAX), IMAGE and XML.
In addition, To use full text search you must create a full-text index for the table against which they want to run full-text search queries. For a particular SQL Server Table or Indexed View you can create a maximum of one Full-Text Index.
these are two article about it:
SQL SERVER - 2008 - Creating Full Text Catalog and Full Text Search
Using Full Text Search in SQL Server 2008
I'm programming site for ren-a-car company, and I need to save exactly three pictures for every car. What is better to store in database ( table description) path to images and images save in some folder OR to save pictures in table (MS SQL )?
The main question is: how big are those pictures on average??
A thorough examination by Microsoft Research (To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem) has shown:
that it's better to store stuff inside your SQL tables as long as the data is typically smaller than 256 KB
that it's better to store on filesystem if the elements are typically larger than 1 MB
The in-between is a bit of a gray area....
So: are you pictures mostly more than 1 MB?? Then store on disk and keep a reference. Otherwise, I'd recommend storing them inside your database table.
Better to save the path to the picture (and probably even better just the filename usually). By storing the picture in the table, you are increasing the size of the tables and therefore increasing lookup time, insert time, etc for no apparent gain.
The other thing is that you say you need to save exactly 3 pictures for each posting, so this makes me think you're using fields in your posts table such as pic1, pic2, pic3. You might want to normalize this so that you have a post_pictures table that links each post with a picture.
For instance:
post_pictures:
post_picture_id | post_id | picture_filename. (you can even get away with having just two fields).
Depending on the horsepower of your server farm, save in SQL. I have set the up fields as varbinary(max).
What advantage do you have in storing the images in database? Can you think of any use of the image data being stored in a DB?
I suggest you store the images in file system rather than database. You can use the database to store meta data of the images (path,keywords etc) in the database.