NYCPHP Meetup

NYPHP.org

[nycphp-talk] htmlentities charset bug

Michael B Allen ioplex at gmail.com
Wed Jan 23 12:58:17 EST 2008


On 1/23/08, Cliff Hirsch <cliff at pinestream.com> wrote:
> On 1/23/08 10:10 AM, "csnyder" <chsnyder at gmail.com> wrote:
> > On Jan 22, 2008 4:11 PM, Cliff Hirsch <cliff at pinestream.com> wrote:
> >
> >>  Reason: Invalid multibyte sequence in argument
> >>
> >>  Root cause: cut and pasting text from MS Word in XP.
> >
> > Neat. Any idea what the offending character or sequence was?
> Oh yeah, curly single and double quotes, or whatever the proper name for
> them is.
>
> The sequence was:
> 1. Wrote FAQ in Word on Windows XP.
> 2. Cut and pasted entries into database (can't recall whether it was through
> my app or phpMyAdmin)
> 3. Errors start appearing.
>
> Those curly single and double quotes are killers.

The problem isn't htmlentities, it's the charset you're pages are
emitted in. If you emit an HTML form in ISO-8859-1 and then submit
garbage data, the database may store it as garbage and now you have a
simple garbage-in / garbage-out scenario. Feed that to htmlentites and
tell it it's ISO-8859-1 and you'll get an "Invalid multibyte sequence"
error.

Technically I think the browser should see that the page is ISO-8859-1
and squash invalid sequences to some default character like '?' when
you paste it into the form but accepting the data as-is is more
forgiving I suppose. If the browser was really sophisticated about it
it could pop-up a dialog that warns you and asks you if you would like
to transliterate those characters to ISO-8859-1 equivalent glyphs.

Use UTF-8 and the problem will go away. At least you won't get an
error. The curly quotes will remain curly quotes which means if you
don't see them and fix them and someone swipes that text off the page
and pastes it into an email that's ISO-8859-1 then they'll be squashed
to '?' or sometimes ugly rectangles.

I always use UTF-8.

Mike

-- 
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/



More information about the talk mailing list