Date: 02/10/05 (SQL Server) Keywords: php, html, sql, web My boss called me over and asked me to write a T-SQL function on behalf of our web guy, that would arbitrarily strip out all HTML tags from a VARCHAR. In any other language like PHP, Perl, etc., I'd employ Regular Expressions to do this globally. However, T-SQL only really has PATINDEX and CHARINDEX, those aren't all that powerful. SOME HTML text after the body' ) INSERT #tmpText VALUES ( ' Another HTML text after the body' ) -- WHILE a '<' and '>' pair are present and the former's position is less than the latter's position value -- Find a '<' then the next '>' positions, and run a stuff to remove whatever's in between! BEGIN TRANSACTION WHILE EXISTS (SELECT 1 FROM #tmpText WHERE PATINDEX('%<%>%', data) > 0 ) BEGIN UPDATE #tmpText SET data = STUFF(data, PATINDEX('%<%>%', data), CHARINDEX('>', data) - PATINDEX('%<%>%', data) + 1, '') WHERE PATINDEX('%<%>%', data) > 0 END SELECT * FROM #tmpText ROLLBACK I'm going to keep trying to dissect this in order to come up with a cleaner, non-iterative solution. If anyone else already has a function that does this in one fell-swoop, rather than via looping, PLEASE share it! :-) UPDATE: I've made further code changes to handle a few odd circumstances and have finished the function. Thought I'd share it with everyone for a "peer review." Please comment away! ALTER FUNCTION uf_stripHTML ( @strHTML varchar(8000), @flgFormat int = 1 ) RETURNS varchar(8000) AS BEGIN ---------------------------------------- -- Date Written: 02-10-2005 -- Purpose: Arbitrarily strip all text between all pairs of < > tags. This SHOULD be -- HTML but theoretically could be other data? ---------------------------------------- -- Input Parameters: @strHTML = The string that we are stripping HTML from. -- @flgFormat = If set to 1 (DEFAULT), then the function will attempt to -- preserve basic formatting by detecting non-breaking spaces, start and end -- paragraph tags, and break-return tags, and replacing them with their -- ASCII equivalents. ---------------------------------------- -- Comments: This solution employs an iterative algorithm to repeatedly sweep through -- the variable's text, removing HTML tags one at a time. ---------------------------------------- DECLARE @ltPosition int -- If flgFormat is 1, then replace pre-determined list of tags and characters with -- corresponding values! IF @flgFormat = 1 BEGIN SET @strHTML = REPLACE(@strHTML, ' ', CHAR(10) + CHAR(13)) SET @strHTML = REPLACE(@strHTML, ' ', CHAR(10)) END -- Arbitrarily replace &_nbsp; (intentionally mistyped) SET @strHTML = REPLACE(@strHTML, '&_nbsp;', CHAR(32)) -- STRIP OUT HTML HERE WHILE (SELECT PATINDEX('%<[^ ]%>%', @strHTML)) > 0 BEGIN -- Must search for the correct '>' because any unmatched ones will cause errors! SET @ltPosition = 0 WHILE (PATINDEX('%<[^ ]%>%', @strHTML) > @ltPosition) SET @ltPosition = CHARINDEX('>', @strHTML, @ltPosition + 1) SET @strHTML = STUFF(@strHTML, PATINDEX('%<[^ ]%>%', @strHTML), @ltPosition - PATINDEX('%<[^ ]%>%', @strHTML) + 1, '') END RETURN @strHTML END GO Source: http://www.livejournal.com/community/sqlserver/17939.html
|