T-SQL Function to strip HTML tags

    Date: 02/10/05 (SQL Server)    Keywords: php, html, sql, web

    My boss called me over and asked me to write a T-SQL function on behalf of our web guy, that would arbitrarily strip out all HTML tags from a VARCHAR. In any other language like PHP, Perl, etc., I'd employ Regular Expressions to do this globally. However, T-SQL only really has PATINDEX and CHARINDEX, those aren't all that powerful.

    I did wind up coming up with an iterative solution, which I thought I'd share with everyone to get suggestions and feedback.



    CREATE TABLE #tmpText (
    id INT IDENTITY(1, 1),
    data varchar(4000)
    )

    INSERT INTO #tmpText (data)
    VALUES ('this is some html')

    INSERT #tmpText VALUES (
    'Some Name


    SOME HTML text after the body'
    )

    INSERT #tmpText VALUES (
    'Another Name

    Another HTML text after the body'
    )

    -- WHILE a '<' and '>' pair are present and the former's position is less than the latter's position value
    -- Find a '<' then the next '>' positions, and run a stuff to remove whatever's in between!

    BEGIN TRANSACTION
    WHILE EXISTS (SELECT 1
    FROM #tmpText
    WHERE PATINDEX('%<%>%', data) > 0
    )
    BEGIN
    UPDATE #tmpText
    SET data = STUFF(data, PATINDEX('%<%>%', data), CHARINDEX('>', data) - PATINDEX('%<%>%', data) + 1, '')
    WHERE PATINDEX('%<%>%', data) > 0
    END

    SELECT *
    FROM #tmpText

    ROLLBACK



    I'm going to keep trying to dissect this in order to come up with a cleaner, non-iterative solution. If anyone else already has a function that does this in one fell-swoop, rather than via looping, PLEASE share it! :-)

    UPDATE: I've made further code changes to handle a few odd circumstances and have finished the function. Thought I'd share it with everyone for a "peer review." Please comment away!


    ALTER FUNCTION uf_stripHTML
    (
       @strHTML varchar(8000),
       @flgFormat int = 1
    )
    RETURNS varchar(8000)
    AS
    BEGIN

    --------------------------------------------------------------------------------------------------
    -- Date Written: 02-10-2005
    -- Purpose: Arbitrarily strip all text between all pairs of < > tags. This SHOULD be
    -- HTML but theoretically could be other data?
    ----------------------------------------------------------------------------------------------------
    -- Input Parameters: @strHTML = The string that we are stripping HTML from.
    -- @flgFormat = If set to 1 (DEFAULT), then the function will attempt to
    -- preserve basic formatting by detecting non-breaking spaces, start and end
    -- paragraph tags, and break-return tags, and replacing them with their
    -- ASCII equivalents.
    ----------------------------------------------------------------------------------------------------
    -- Comments: This solution employs an iterative algorithm to repeatedly sweep through
    -- the variable's text, removing HTML tags one at a time.
    ----------------------------------------------------------------------------------------------------

    DECLARE @ltPosition int

    -- If flgFormat is 1, then replace pre-determined list of tags and characters with
    -- corresponding values!
    IF @flgFormat = 1
    BEGIN
       SET @strHTML = REPLACE(@strHTML, '

    ', CHAR(10) + CHAR(13))
       SET @strHTML = REPLACE(@strHTML, '

    ', CHAR(10) + CHAR(13))
       SET @strHTML = REPLACE(@strHTML, '
    ', CHAR(10))
    END


    -- Arbitrarily replace &_nbsp; (intentionally mistyped)
    SET @strHTML = REPLACE(@strHTML, '&_nbsp;', CHAR(32))

    -- STRIP OUT HTML HERE
    WHILE (SELECT PATINDEX('%<[^ ]%>%', @strHTML)) > 0
    BEGIN
       -- Must search for the correct '>' because any unmatched ones will cause errors!
       SET @ltPosition = 0
       WHILE (PATINDEX('%<[^ ]%>%', @strHTML) > @ltPosition)
          SET @ltPosition = CHARINDEX('>', @strHTML, @ltPosition + 1)

       SET @strHTML = STUFF(@strHTML, PATINDEX('%<[^ ]%>%', @strHTML), @ltPosition - PATINDEX('%<[^ ]%>%', @strHTML) + 1, '')
    END

    RETURN @strHTML

    END
    GO

    Source: http://www.livejournal.com/community/sqlserver/17939.html

« sp_cursoropen behavior || This is rather odd... »


antivirus | apache | asp | blogging | browser | bugtracking | cms | crm | css | database | ebay | ecommerce | google | hosting | html | java | jsp | linux | microsoft | mysql | offshore | offshoring | oscommerce | php | postgresql | programming | rss | security | seo | shopping | software | spam | spyware | sql | technology | templates | tracker | virus | web | xml | yahoo | home