Category: PHP


Latin 1 to UTF-8

August 5th, 2010 — 7:35am

Long story short, I was doing some screen scraping on web application. My web app use Latin 1 and target is using UTF-8. So I had a few issues to resolve in order for it to work…

1. Display Issue: If target site has beyond ascii character, it wasn’t displaying correctly. To resolve this was easy. I walk through the characters and replaced with html encode if character code is higher than base ascii….(yes i know… the scripting language that I use doesn’t offer nice libraries…)

2. Posting Issue: This was the pain one… because my web app use Latin 1, anything higher than ascii is translated as HTML Encoding. So to post to UTF-8 site, there are 2 steps involving. (1) Convert HTML Encoded character to Unicode (2) Convert Unicode to URL Encode
2.1. Convert HTML Encoded character to Unicode: This is very simple. Basically use this regenx pattern “&#[0-9]{3.5};” to look for HTML Encoded character and extract numeric only and wrap with command to convert it to character (for instance in php it is chr).

2.2. Convert Unicode to URL Encode: Now that you have Unicode prepared, all you have to do is to do url-encoding! (and no, the scripting language i use doesn’t offer native solution. and yes, I have plan to switch scripting language. so this is for person like me who stuck with language that offers very limited set of libary)
2.2.1. My First Attempt: use curl –data-urlencode option. Please note that this was added in version 7.18.0. If you are Mac user like me, you can install/upgrade curl using MacPorts or you can download source from here and install. Unfortunately I could not use this option… because there is one restriction to this – “content cannot contain = or @ symbols, as that will then make the syntax match”.
2.2.2.My Second Attempt: Relies on Shell… Now this is almost always works! I decided to use php rawurlencode. Calling like this from terminal works like a charm.

$ php -r “echo rawurlencode(‘ξακσλξ’)

Issue the command via plugin to execute shell command did not work. It is not unicode supported so it end up returning url encode of ??????
I contacted plugin vendor and he is planning to release unicode support soon… so until then somehow

My co-worker found this article: http://akameng.com/web/php_utf_unicode_convertor. Ooooo, this shows the formula to how to convert this using integer division and multiplication without using bitwise operation. This article also explains this formula.
If ud <128 (7F hex) then UTF-8 is 1 byte long, the value of ud.

If ud >=128 and <=2047 (7FF hex) then UTF-8 is 2 bytes long.
byte 1 = 192 + (ud div 64)
byte 2 = 128 + (ud mod 64)

So as a final product (for now) I have code like this:

$ php -r “echo rawurlencode(chr(206).chr(190).chr(32).chr(32).chr(195).chr(169).chr(110).chr(100));”

I am done for now.

Comment » | PHP

Install NetBeans 6.9 to OS X 10.5.8

July 5th, 2010 — 11:23pm

NetBeans 6.9 is a great PHP IDE and I have been pretty happy about it. Although initially, I almost gave up installing this on my MacBook Pro – After installation was done and launch the application I have encountered an error saying “Cannot run on older version of Java 6 Standard Edition.
Please install Java 6 Standard Edition or newer or use –jdkhome switch to point to its installation directory.

To solve this problem:

Application: Utilities: Java Preference >> Drag Java SE 6 to the top

If this doesn’t fix the issue try:

Terminal: type “env”. Do you see JAVA_HOME is defined? If so, is this pointing to older version? If so fixing that should resolve the issue (at least it resolved my case). Here is how I fixed:

  1. my JAVA_HOME was set to

    /Library/Java/Home

    which was simply the symlink to

    System/Library/Frameworks/ JavaVM.framework/Versions/1.5.0/Home

  2. rename the symlink to Home_old
  3. then create new symlink point to 1.6 (SE 6) as follows: ln -s

    /System/Library/Frameworks/ JavaVM.framework/Versions/1.6.0/Home /Library/Java/Home

4 comments » | PHP

Back to top