Tuesday, February 16, 2010

Using html decode to transform/cleanse rich text data

Recently I was tasked with transforming incoming data from a transactional system. The incoming data was generated by a rich text editor and hence had html tags embedded within it. I wanted to strip the data of the html elements and convert it to pure string. I was accessing a web service that was exposed by the system to get the rich text html data. In addition, the data was html encoded during the SOAP request to the web service.

For e.g. the string "test" was stored as : "&lt ;div class="user"&gt ;test&lt ;/div&gt ;"

You may notice that the opening and closing angle brackets <> were converted to < and >. To get the string in its original form I used HttpUtility.HtmlDecode() function. Decoding the above string once produced:test. Which was expected but there were certain records which looked like: &lt ;div class="user"&gt ;Gareth&#8217;s advice was not followed.&lt ;/div&gt ;

Decoding them once was no good as the #8217; never got decoded to an apostrophe. I decoded the string once again and got the expected output:
Gareth's advice was not followed.

Now I had to remove the html tags and extract the text in between them. Here's the VB code for this:

'Delete text between angled brackets
mStartPos = InStr(strContent, "<") mEndPos = InStr(strContent, ">")
Do While mStartPos <> 0 And mEndPos <> 0 And mEndPos > mStartPos
mString = Mid(strContent, mStartPos, mEndPos - mStartPos + 1)
strContent = Replace(strContent, mString, "")
mStartPos = InStr(strContent, "<") mEndPos = InStr(strContent, ">")
Loop

Do While Left(strContent, 1) = Chr(13) Or Left(strContent, 1) = Chr(10)
strContent = Mid(strContent, 2)
Loop

txt = strContent

No comments:

Post a Comment