Archive for September, 2009

Botnet on port 23130?

September 19th, 2009 by sjan

Yesterday evening my roommate’s machine was botted. I got a text message to my phone from Pingdom that my site was down and I did a bit of digging and found that his machine had somewhere in the range of 80-100 open outbound connections at all times.

I notified him and he immediately went to TrendMicro House Call to clean it up. He said it found “a few things,” but he didn’t note what they were, nor did he try to isolate them so I could attempt to decompile and inspect them. Ah well, such is the world, and he had work he needed to be able to finish with his machine.

The odd thing was, once his machine was cleaned and no longer in contact I began to get a flood of TCP SYN packets and UDP packets to the server on port 23130. The size of the UDP packets (between 75 and 196 bytes) leads me to believe they were some sort of botnet commands, while the TCP SYN packets were bots trying to reconnect to their lost buddy. This definitely did not have the marks of a DDoS of any sort, as once the bot on the Windows machine was stopped (and I once again had outbound bandwidth) the packets were hitting the server in a fairly steady fashion, but not in any kind of flooding behavior. In other words, each host was trying no more than 5 times to connect via TCP and no host sent 2 identical UDP packets in a row. The reason they were hitting the server is that the packets were being sent to a specific IP address, and trying to create a new connection with that IP means you are trying to connect to the server. Without the established connections in NAT on the router, all these packets were going to the server. Unfortunately the server in question is not beefy enough to run tcpdump, even for a few minutes, and trying to alter my network enough to get my laptop in where it could sniff the packets was out of the question.

While I didn’t have tcpdumps or even extensive firewall logs I did have the abbreviated logging that takes place in messages. (I also had dmesg logs to look at, but I never realized until last night that dmesg logs are not timestamped. I wonder if that is a configuration error on my part. Right now I am too exhausted to try to figure that one out.) So, I had the log entries in /var/log/messages and there is plenty good information there – and here is what I saw, from the hours of Sep 18, 19:16:59 to Sep 19 03:06:49. (Note that the packets are still coming in, but now at a rate of somewhere around 2 attempts per hour.)

There were a total of 178,335 TCP SYN packets to port 23130, along with 33,894 UDP packets to the same port. These requests came from 1,994 unique IP addresses. Below are some interesting statistics.

Top ISPs by number of unique hosts
ISP Country Hosts
Comcast Cable Communications United States 129
Abovenet Communications, Inc United States 119
Road Runner HoldCo LLC United States 92
AT&T United States 77
Shaw Communications Inc. Canada 51
Verizon Internet Services Inc. United States 41
Cox Communications Inc. Canada 34
Rogers Cable Communications Inc. Canada 26
Bell Canada Canada 19
Charter Communications United States 19
All countries by number of unique hosts
Country Hosts
United States 643
Canada 152
United Kingdom 135
India 117
China 67
Philippines 65
Australia 62
Malaysia 39
Japan 33
Russian Federation 32
Mauritius 30
Netherlands 30
Portugal 27
Uruguay 25
Pakistan 22
United Arab Emirates 22
Spain 21
Greece 21
Romania 19
Thailand 19
Poland 19
Saudi Arabia 18
Germany 18
France 18
Bulgaria 16
Norway 16
Singapore 16
Korea, Republic of 15
Taiwan, Province of China 13
Brazil 13
Viet Nam 12
Italy 11
Turkey 11
Mexico 10
Sweden 9
Croatia 9
Finland 9
Israel 9
Ukraine 8
Hong Kong 7
Ireland 7
Argentina 7
Switzerland 6
Denmark 6
Estonia 6
Cyprus 6
Czech Republic 5
Kazakhstan 5
Chile 5
Qatar 5
Belgium 5
Sri Lanka 5
Latvia 4
Iran, Islamic Republic of 4
Indonesia 3
New Zealand 3
Slovakia 3
Dominican Republic 3
Serbia 3
El Salvador 3
Slovenia 3
Unknown 2
Kuwait 2
Trinidad and Tobago 2
Brunei Darussalam 2
Costa Rica 2
Bangladesh 2
Venezuela, Bolivarian Republic of 2
Hungary 2
Moldova, Republic of 2
Barbados 2
Puerto Rico 2
Aruba 1
Malta 1
Ecuador 1
Bahamas 1
Austria 1
Peru 1
Montenegro 1
Angola 1
Guatemala 1
Paraguay 1
Antigua and Barbuda 1
Lithuania 1
South Africa 1
Palestinian Territory, Occupied 1
Aland Islands 1
Macao 1
Jamaica 1
Honduras 1
Oman 1
Iceland 1
Guam 1
Bahrain 1
Albania 1
Nepal 1
Luxembourg 1
Iraq 1
Afghanistan 1

Edit: Mostly I am curious about the botnet in question. If anyone comes across a bot that is communicating on port 23130 please let me know what you find out about it.

Data Scrubbing: Is there a right way?

September 9th, 2009 by sjan

An article yesterday from ars technica got me wondering. In my former position we often “scrubbed” databases for sample data from which to work. And certainly one can see the value in working with data with personally identifiable information removed for the purposes of business or health-care informatics, service level determinations, quality of service surveys, and so on. Yet, according to a study at Carnegie Mellon University:

87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}.

This seems to be the balance point: 3 pieces of non-anonymized data are enough to identify the majority of the population. (Think, “Three Strikes! You’re out!”) So what do we do when we need solid, anonymous data from which to work?

Taking the example of the health records from the article linked above, I would think that the following steps would be enough to fully scrub the data beyond where “reidentification” would not be possible. Since this is medical records we can safely (I feel) assume that randomizing the gender is a non-starter. (“Wow, according to this data 14% of all men went to the Emergency Room with pregnancy-related complications!”)

And since this data is taken from an area where the zip codes are known we are already at two strikes. So why did they not randomize the dates of birth? It would be difficult to do in the case of infants beyond a few days or weeks, since many of the health issues are related to their age in months. But for anyone over the age of 8 it should be simple enough to randomize the month and date of birth, and allow a set of ranges for randomizing the year of birth. If we assume a 20% range up or down we gain a lattitude of possible years of birth which increases the older the patient actually is. Another possibility is to give everyone the same date of birth, differing only in the year. (Jan 1, xxxx).

This of course means that any reporting done on age is meaningless, but it also means that the data can more safely be widely distributed. In cases where exact age and gender are required for study it would be better to merge data from many different areas, covering as many cities, counties, states and regions as possible. In this case we would still need to weigh the risks, as all three pieces of data would still be available, although at a much higher level of trail and error. In the case mentioned by ars technica the study covered seven zip codes. Perhaps spreading the information over a few hundred would make it much less worth the effort to sort through them all to try to identify individuals, and even then one would expect multiple possible hits.

The need for real data for statistical analysis and study is not going to go away. When you are considering releasing scrubbing data to release a “sanitized” version it would be good to keep the mantra “Three Strikes! You’re out!” in mind. When it comes to data for testing software operation, however, I still think the better method is complete randomization. Totally bogus data that has the look and feel of “real” data. (Which is no doubt why all the bogus users in my test dbs live in different cities, in different states, and at addresses ranging from 123 to 999 on Any Street!)