Comparing PHP array_shift to array_pop

February 5th, 2010 by Sjan Evardsson

I noticed a note in the PHP documentation about speed differences between array_shift() (pulling the first element off the array) and array_reverse() followed by array_pop() (resulting in the same data, but got to by pulling the last element off the array).

Since I was working on some code to convert URL pieces to program arguments (like turning /admin/users/1/edit into section=admin, module=users, id=1, action=edit – stuff we tend to do every day) I thought I would take a look at the speed differences since I have always used array_shift() for this (after turning the string into an array via explode()).

My initial tests showed that array_shift was much faster than array_reverse followed by array_pop, and I wondered why someone would say that in the first place. But then I thought about it for a bit. When using array_shift the entire remaining array has to be re-indexed every call. For a very short array (like the one I was using) this is negligible. When you start looking at much larger arrays, however, this overhead adds up quickly.

To find out roughly where the break-even point on these two methods lie I whipped up a quick script to run with arrays sized from 10^1 values up to 10^5 values. What I found is that at less than 100 values you are not really gaining much (if anything) by using array_reverse and array_pop versus array_shift. Once you get to the 1000 value array size, however, the differences really add up (as you can see in the logarithmic scaling of the chart below).

shift_vs_pop.jpg

The code I used to generate the numbers (which are shown in the chart as averages over 3 runs, rounded to the nearest millionth of a second) is:

  1. <?php
  2. $counts = array(10,100,1000,10000,100000);
  3. foreach ($counts as $len)
  4. {
  5.         $m2 = $m1 = array();
  6.         $x = 1;
  7.         while ($x <= $len)
  8.         {
  9.                 $m2[] = $m1[] = $x;
  10.                 $x++;
  11.         }
  12.         echo "Timing with array_shift() for $len items\n";
  13.         echo "000000";
  14.         $s1 = microtime(true);
  15.         while (!empty($m1))
  16.         {
  17.                 $tmp = array_shift($m1);
  18.                 if ($tmp % 10 == 0)
  19.                 {
  20.                         echo chr(8),chr(8),chr(8),chr(8),chr(8),chr(8);
  21.                         echo str_pad(.$tmp,6,‘0′,STR_PAD_LEFT);
  22.                 }
  23.         }
  24.         $s2 = microtime(true);
  25.         echo "\nTook ",$s2$s1," seconds\n";
  26.        
  27.         echo "Timing with array_reverse and array_pop() for $len items\n";
  28.         $s1 = microtime(true);
  29.         $m2 = array_reverse($m2);
  30.         while (!empty($m2))
  31.         {
  32.                 $tmp = array_pop($m2);
  33.                 if ($tmp % 10 == 0)
  34.                 {
  35.                         echo chr(8),chr(8),chr(8),chr(8),chr(8),chr(8);
  36.                         echo str_pad(.$tmp,6,‘0′,STR_PAD_LEFT);
  37.                 }
  38.         }
  39.         $s2 = microtime(true);
  40.         echo "\nTook ",$s2$s1," seconds\n";
  41.         echo "\n";
  42. }
  43. ?>

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Cisco search patent: my concerns

December 31st, 2009 by Sjan Evardsson

An article yesterday at bnet.com about Cisco’s patent filing for search has me concerned. Instead of relying on crawling links (and obeying robots.txt) like current search engines do (or at least should), Cisco’s idea is to look into packets at the network level and pull apart network traffic to discover HTTP requests. While that may not sound so terrible, I can see a need to change the way I do some business.

I often have development work, intended for collaboration with clients that is wholly not discoverable via web crawling. It is not that there are any great secrets there (unless the client is particular about not letting anyone know what their new site will look like before it goes live) but it is not meant to be permanent, either. This means that unless you know the full URL to the documents in question you are not likely to find them. These URLs are emailed to the client so they can click on the link in their email and let me know which parts of the app work the way they want, what doesn’t work, UI changes they would like to make, etc. With the standard web-crawlers these pages will never show up in a search listing.

If a layer three network device is picking those URLs out of traffic it is passing, however, those pages might be indexed, and once indexed, added to search. Now, a week later, when the directory x79q3_zz_rev2 is trashed, there are indexed searches pointing at what will return nothing but 404. Not good for me, not good for the client and not good for the individual doing the search.

My second concern is one of bandwidth. Yes, I know, there is lots of bandwidth and “everybody is on broadband these days anyway” (I don’t know how many times I hear that). Be that as it may, the “everybody” that is on broadband is not actually everybody, and anything that adds more delay to packet routing only makes the situation worse. And what happens when user A sends a request through their ISP to get an HTTP resource? How many hops does it cross? And how many of those will be running Cisco devices? (Hint: most). How many of those Cisco devices are going to do introspection on that packet to pull out the URL? How long does that take? Now consider how many HTTP requests your browser actually makes when downloading a web page. The page itself, linked CSS files, linked JS and any images (and let’s please not even consider AJAX requests).

While the idea is novel, I don’t think it is a good idea, and I would actually hope that Cisco gets the patent and sits on it and uses it merely to bludgeon anyone who actually tries to do this.

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Custom Parallels VM icons

November 24th, 2009 by Sjan Evardsson

I run a lot of VMs in Parallels. (Currently I am running 7, although not all at once, of course.) I end up with a bunch of red generic Parallels VM alias icons on my desktop. Which means that the usual visual quick clues (color, logos, etc) aren’t there and I have to look at the text underneath. Sometimes I am in a rush and start Windows Server 2008 instead of Windows 7 Pro, or Ubuntu Linux instead of Debian Linux (one is set up as a desktop and one as a server with no X).

I really wanted some custom icons for those VMs. My solution, (as usual) when it doesn’t exist make it. So, I opened pvs.icns (contained in the Parallels Desktop.app bundle /Applications/Parallels Desktop.app/Contents/Resources/pvs.icns) in Icon Composer.app, selected the 512 x 512 version and copied it to the clipboard. I then pasted that into a new Photoshop document and began editing. I saved each new version as a 512 x 512 pixel png and then dropped them in img2icns.app which converted them to the icns files I needed to customize my VM launchers.

icon_anim.gifBehold the glory:

They aren’t perfect, especially the Windows Server 2008, but they are different enough that it is easy to select the right VM in a heartbeat.

You can download the icns files from http://www.evardsson.com/files/parallels_icons.zip

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Try out Chrome OS in a VM – even Parallels!

November 21st, 2009 by Sjan Evardsson

If you have been curious about trying out Google’s Chrome OS (or Chromium OS – they seem to call it both) there is a VMWare image available for download at gdgt.com. You will need to set up an account there if you don’t already have one, but it is painless. The VM image is zipped to around 300MB so downloading is not painful at all.

If you are using VMWare Player or VirtualBox or VMWare Fusion (on Mac) there is nothing you need to do but open it up and go. If you are using Parallels, however, there are a couple steps to take.

First you need to convert the vmdk to a raw disk image. To do this you will need to get Qemu (actually, qemu-img, a utility that comes with Qemu.) If you are on a Mac (as most Parallels users are) you can download and install Q, which is a very nice OS X port of Qemu with a GUI (which we won’t be using for this exercise).

The command to convert the disk image is slightly different if you are using the default Qemu package or the one provided with Q. If you are using the default the following should work (assuming your install of Qemu is in /usr/bin/):

/usr/bin/qemu-img convert chrome-os-0.4.22.8-gdgt.vmdk -O raw chrome.hdd

If you are using Q, the version of qemu-img that is included does not quite handle the command line switches correctly. Luckily, it defaults to raw image output. The command if you have Q installed should look like:

/Applications/Q.app/Contents/MacOS/qemu-img convert chrome-os-0.4.22.8-gdgt.vmdk chrome.hdd

Now, start up Parallels, and add a new VM. For type, set it to Other Linux and when it asks whether to create a new disk image or use an existing one tell it to use the disk image you just created.

Start the VM and enjoy(?) the browser as OS experience. Oh, and the login credentials? Your Google account.

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Daylight Saving Time Headaches

November 3rd, 2009 by Sjan Evardsson

I have never been particularly fond of the concept of Daylight Saving Time (cutting one off of a blanket and sewing to the other end does not make a longer blanket.) This time around, though, I ran into an issue involving the perfect combination of a monthly cron job, a server set to local time and the switch from Daylight Saving to Standard Time on the first of the month.

At precisely 1:14 am on the first day of the month the cron job ran, as it does the first day of every month, and picked a raffle winner for one of our client’s monthly contests. At 2:00 am the time on the server rolled back to 1:00 am in accordance with the switch to Standard Time for the US. Fourteen minutes later the job ran again, and picked another winner.

Whoops. Now our system has awarded two people a single prize. Telling the second one to get the prize that they didn’t really win would not get us any points with the client, as their customer would be upset. Likewise, charging the client for the second prize is a non-starter, as it is, in fact, our fault. When I inherited these systems I looked through all the cron jobs to get a feel for what the system is doing and when. What didn’t occur to me, however, was that jobs scheduled at the wrong time of day could fall victim to Daylight Saving/Standard Time change-overs.

Any daily job that runs between 1:00 am and 2:00 am will fail to run once a year (Standard -> Daylight Saving when clocks jump ahead an hour) and will run twice once a year (Daylight Saving -> Standard Time when clocks fall back from 2:00 am to 1:00 am).  Weekly jobs that run between 1:00 am and 2:00 am on Sundays will likewise misbehave, while monthly jobs, regardless of day of the month, have a small chance of experiencing one of these issues. In this case, the job runs on the 1st, which happened to be the first Sunday in November, and bang: error.

Needless to say, we modified all the cron jobs to ensure than none of them start between 1:00 am and 2:00 am.

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Botnet on port 23130?

September 19th, 2009 by Sjan Evardsson

Yesterday evening my roommate’s machine was botted. I got a text message to my phone from Pingdom that my site was down and I did a bit of digging and found that his machine had somewhere in the range of 80-100 open outbound connections at all times.

I notified him and he immediately went to TrendMicro House Call to clean it up. He said it found “a few things,” but he didn’t note what they were, nor did he try to isolate them so I could attempt to decompile and inspect them. Ah well, such is the world, and he had work he needed to be able to finish with his machine.

The odd thing was, once his machine was cleaned and no longer in contact I began to get a flood of TCP SYN packets and UDP packets to the server on port 23130. The size of the UDP packets (between 75 and 196 bytes) leads me to believe they were some sort of botnet commands, while the TCP SYN packets were bots trying to reconnect to their lost buddy. This definitely did not have the marks of a DDoS of any sort, as once the bot on the Windows machine was stopped (and I once again had outbound bandwidth) the packets were hitting the server in a fairly steady fashion, but not in any kind of flooding behavior. In other words, each host was trying no more than 5 times to connect via TCP and no host sent 2 identical UDP packets in a row. The reason they were hitting the server is that the packets were being sent to a specific IP address, and trying to create a new connection with that IP means you are trying to connect to the server. Without the established connections in NAT on the router, all these packets were going to the server. Unfortunately the server in question is not beefy enough to run tcpdump, even for a few minutes, and trying to alter my network enough to get my laptop in where it could sniff the packets was out of the question.

While I didn’t have tcpdumps or even extensive firewall logs I did have the abbreviated logging that takes place in messages. (I also had dmesg logs to look at, but I never realized until last night that dmesg logs are not timestamped. I wonder if that is a configuration error on my part. Right now I am too exhausted to try to figure that one out.) So, I had the log entries in /var/log/messages and there is plenty good information there – and here is what I saw, from the hours of Sep 18, 19:16:59 to Sep 19 03:06:49. (Note that the packets are still coming in, but now at a rate of somewhere around 2 attempts per hour.)

There were a total of 178,335 TCP SYN packets to port 23130, along with 33,894 UDP packets to the same port. These requests came from 1,994 unique IP addresses. Below are some interesting statistics.

Top ISPs by number of unique hosts
ISP Country Hosts
Comcast Cable Communications United States 129
Abovenet Communications, Inc United States 119
Road Runner HoldCo LLC United States 92
AT&T United States 77
Shaw Communications Inc. Canada 51
Verizon Internet Services Inc. United States 41
Cox Communications Inc. Canada 34
Rogers Cable Communications Inc. Canada 26
Bell Canada Canada 19
Charter Communications United States 19
All countries by number of unique hosts
Country Hosts
United States 643
Canada 152
United Kingdom 135
India 117
China 67
Philippines 65
Australia 62
Malaysia 39
Japan 33
Russian Federation 32
Mauritius 30
Netherlands 30
Portugal 27
Uruguay 25
Pakistan 22
United Arab Emirates 22
Spain 21
Greece 21
Romania 19
Thailand 19
Poland 19
Saudi Arabia 18
Germany 18
France 18
Bulgaria 16
Norway 16
Singapore 16
Korea, Republic of 15
Taiwan, Province of China 13
Brazil 13
Viet Nam 12
Italy 11
Turkey 11
Mexico 10
Sweden 9
Croatia 9
Finland 9
Israel 9
Ukraine 8
Hong Kong 7
Ireland 7
Argentina 7
Switzerland 6
Denmark 6
Estonia 6
Cyprus 6
Czech Republic 5
Kazakhstan 5
Chile 5
Qatar 5
Belgium 5
Sri Lanka 5
Latvia 4
Iran, Islamic Republic of 4
Indonesia 3
New Zealand 3
Slovakia 3
Dominican Republic 3
Serbia 3
El Salvador 3
Slovenia 3
Unknown 2
Kuwait 2
Trinidad and Tobago 2
Brunei Darussalam 2
Costa Rica 2
Bangladesh 2
Venezuela, Bolivarian Republic of 2
Hungary 2
Moldova, Republic of 2
Barbados 2
Puerto Rico 2
Aruba 1
Malta 1
Ecuador 1
Bahamas 1
Austria 1
Peru 1
Montenegro 1
Angola 1
Guatemala 1
Paraguay 1
Antigua and Barbuda 1
Lithuania 1
South Africa 1
Palestinian Territory, Occupied 1
Aland Islands 1
Macao 1
Jamaica 1
Honduras 1
Oman 1
Iceland 1
Guam 1
Bahrain 1
Albania 1
Nepal 1
Luxembourg 1
Iraq 1
Afghanistan 1

Edit: Mostly I am curious about the botnet in question. If anyone comes across a bot that is communicating on port 23130 please let me know what you find out about it.

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Data Scrubbing: Is there a right way?

September 9th, 2009 by Sjan Evardsson

An article yesterday from ars technica got me wondering. In my former position we often “scrubbed” databases for sample data from which to work. And certainly one can see the value in working with data with personally identifiable information removed for the purposes of business or health-care informatics, service level determinations, quality of service surveys, and so on. Yet, according to a study at Carnegie Mellon University:

87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}.

This seems to be the balance point: 3 pieces of non-anonymized data are enough to identify the majority of the population. (Think, “Three Strikes! You’re out!”) So what do we do when we need solid, anonymous data from which to work?

Taking the example of the health records from the article linked above, I would think that the following steps would be enough to fully scrub the data beyond where “reidentification” would not be possible. Since this is medical records we can safely (I feel) assume that randomizing the gender is a non-starter. (“Wow, according to this data 14% of all men went to the Emergency Room with pregnancy-related complications!”)

And since this data is taken from an area where the zip codes are known we are already at two strikes. So why did they not randomize the dates of birth? It would be difficult to do in the case of infants beyond a few days or weeks, since many of the health issues are related to their age in months. But for anyone over the age of 8 it should be simple enough to randomize the month and date of birth, and allow a set of ranges for randomizing the year of birth. If we assume a 20% range up or down we gain a lattitude of possible years of birth which increases the older the patient actually is. Another possibility is to give everyone the same date of birth, differing only in the year. (Jan 1, xxxx).

This of course means that any reporting done on age is meaningless, but it also means that the data can more safely be widely distributed. In cases where exact age and gender are required for study it would be better to merge data from many different areas, covering as many cities, counties, states and regions as possible. In this case we would still need to weigh the risks, as all three pieces of data would still be available, although at a much higher level of trail and error. In the case mentioned by ars technica the study covered seven zip codes. Perhaps spreading the information over a few hundred would make it much less worth the effort to sort through them all to try to identify individuals, and even then one would expect multiple possible hits.

The need for real data for statistical analysis and study is not going to go away. When you are considering releasing scrubbing data to release a “sanitized” version it would be good to keep the mantra “Three Strikes! You’re out!” in mind. When it comes to data for testing software operation, however, I still think the better method is complete randomization. Totally bogus data that has the look and feel of “real” data. (Which is no doubt why all the bogus users in my test dbs live in different cities, in different states, and at addresses ranging from 123 to 999 on Any Street!)

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Massive Fail!

August 17th, 2009 by Sjan Evardsson

While looking for the source of database backup errors I found it, the hard way. I have been running my database on a separate hard disk from everything else, and have been getting occassional errors from the cron script that does a nightly dump. I was under the impression that this was due to latency causing the script to time out. Not the case.

Looking at the dump it seemed like everything was being written ok, at least it looked that way yesterday. (I have not had much time for site maintenance, so this has been in the “put it off until later” pile.) While looking at the script this afternoon, and trying another run it timed out again (or so I thought). I figured I could put it off until this weekend, until I went to look at the site and got the big “Unable to contact the database” error. I went the server and fired up mysql on the command line and discovered that there were NO DATABASES! I tried to get a file listing of the /var/lib/mysql directory and got nothing. Nada. Since I don’t seem to be able to get anything off of that disc I did a quick modification of the fstab (to remove the line mounting that drive), rebuilt the dbs from the last (failed) backup, and here I am, missing two months worth of data.

Can I cry now?

In case you are wondering, the Margin vs Markup page is still available (as I made it a regular page as well as a post, it was my most popular ever).

EDIT: Ah the joys of using decade-old equipment.

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Error Handling and the PHP @ Operator

June 7th, 2009 by Sjan Evardsson

I have been trying to debug a plugin for WordPress (Shorten2Ping – I will keep plugging this because I think it is so nifty!) and I was running into a problem where the plugin would silently fail with nothing in the logs, no error printing to screen, just dead silence.

I turned on display_errors in php.ini for a while to see if anything would show up. Still nothing. So I started to look through the file again. I knew it was getting as far as creating the short url in bit.ly before it died, but nothing was getting entered into the database. So I started through the make_bitly_url() function and what jumped out and slapped me in the face? $json = @json_decode($response,true); That little, innocuous-looking @ was gulping the error message from a fatal error! (Namely, “Call to undefined function json_decode()“). It turns out that I had PHP compiled with –disable-json, which is default for Gentoo unless you have json in your USE flags.

According to the PHP docs for the Error Control Operator @:

Currently the “@” error-control operator prefix will even disable error reporting for critical errors that will terminate script execution. Among other things, this means that if you use “@” to suppress errors from a certain function and either it isn’t available or has been mistyped, the script will die right there with no indication as to why.

So, if you really must supress error messages, do so, but do so with care. In the case where a suppressed error may be fatal (as in this case) be sure to add documentation to that effect. As in “If this dies a silent death it may very well be that you do not have function xyz() enabled.”

And, note to self, when debugging PHP, the first thing to do is look for and remove the error control operator.

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Interesting log activity

June 6th, 2009 by Sjan Evardsson

While trying to debug the Shorten2Ping plugin (a really nifty thing, if I could get it working) I went digging through my Apache error logs looking for any PHP errors. (Well, okay, I didn’t actually dig, I just did a last on the file.) What I saw was interesting, even though it did not help the debugging at all. In fact it kind of derailed the whole process. What I saw was an obvious attempt to find Horde on my server (which I did run temporarily a few years ago). My first guess was that there was a new exploit out for Horde. I did some digging around and found that, yes, indeedy, there is. I found the details of the exploit at securityvulns.com (which is a mirror of or mirrored by www.security.nnov.ru which is where the first relevant Google link took me.) Oddly enough I have not seen this show up on any other security sites yet, even though I see that the report on securityvulns.com is from March.

Anyhow, in case you are curious, here are the relevant lines from the log. (IPs have not been changed to protect the guilty.)

[Sat Jun 06 01:46:53 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/README
[Sat Jun 06 01:46:53 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/horde
[Sat Jun 06 01:46:54 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/horde2
[Sat Jun 06 01:46:55 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/horde3
[Sat Jun 06 01:46:56 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/horde-3.0.5
[Sat Jun 06 01:46:57 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/horde-3.0.6
[Sat Jun 06 01:46:58 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/horde-3.0.7
[Sat Jun 06 01:46:58 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/horde-3.0.8
[Sat Jun 06 01:46:59 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/horde-3.0.9
[Sat Jun 06 01:47:00 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/mail
[Sat Jun 06 01:47:01 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/email
[Sat Jun 06 01:47:02 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/webmail
[Sat Jun 06 01:47:03 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/newmail
[Sat Jun 06 01:47:03 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/mails
[Sat Jun 06 01:47:04 2009] [error] [client 81.210.76.194] File does not exist: /var/www/localhost/htdocs/evardsson.com/mailz

Post to Twitter Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post