Author Archives: Marco

Intercepting arbitrary network traffic with socat

Socat can be used to intercept view and then forward arbitrary network traffic.

Viewing MySQL protocol

Run this in one terminal:

$ socat -v TCP-LISTEN:3307 TCP:localhost:3306

This will cause socat to listen to port 3307 and forward traffic to localhost port 3306, where you presumably have a MySQL running.

Now you can connect to MySQL via.

$ mysql -u root -h -P 3307 -p

This will cause the MySQL client to connect to the socat listener, which in turn will forward traffic to your MySQL instance and print the data, in this case some of the data are unprintable binary bytes so they are represented as periods but the parts of the MySQL protocol which are text can be gleaned.

\ tables< 2018/06/04 23:02:24.011748  length=117 from=463 to=579
TABLE_NAME\f!................users\a......"...> 2018/06/04 23:02:24.012169  length=11 from=291 to=301
\a....users.< 2018/06/04 23:02:24.061067  length=61 from=580 to=640\f!............\a..........> 2018/06/04 23:02:31.717434  length=24 from=302 to=325 * from users< 2018/06/04 23:02:31.767557  length=102 from=641 to=742\f!................Tom.....Jeff.....Scott\a..\a..."...

Setup MySQL replication

Setup MySQL replication

  MASTER_PORT = 3306,
  MASTER_USER = 'root',
  MASTER_PASSWORD = 'password',
  MASTER_LOG_FILE = 'mysql-bin-changelog.00000x',
  MASTER_LOG_POS = 12345;

Show MySQL replication status

MariaDB [production]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: production-db
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin-changelog.000021
          Read_Master_Log_Pos: 106457060
               Relay_Log_File: mysqld-relay-bin.000031
                Relay_Log_Pos: 64715845
        Relay_Master_Log_File: mysql-bin-changelog.000014
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 64715536
              Relay_Log_Space: 1045994443
              Until_Condition: None
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
        Seconds_Behind_Master: 114732
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
               Last_SQL_Errno: 0
             Master_Server_Id: 65224191
                   Using_Gtid: No
                Parallel_Mode: conservative
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Write_rows_log_event::write_row(-1)
1 row in set (0.00 sec)

ERROR: No query specified

Show master bin log position

mysql> show master status;
| File                       | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
| mysql-bin-changelog.000083 | 118809022 |              |                  |                   |

Comparing EBS and SSD volumes on i3.large (AWS)

As part of a project to explore migrating an AWS’s Aurora MySQL database to an EC2 instance I performed several tests on EC2 SSD/EBS disk IOPS and bandwidth which I’ll share below.

Why migrate from Aurora to something else? Aurora is great but AWS charges $0.20 per million IO requests and for some intensive workloads the IO operations required to serve queries over a month really add up; in my experience it’s not uncommon for heavily loaded database instances to run into the tens of billions of IO requests per month. 10 billion IO requests costs $2,000 and that’s not including the cost of the instance itself. Of course this all depends on your applications pattern and frequency of data access but a few billion IO operations really isn’t that much IO, especially if one is running any sort of periodic ETL jobs against the database or one of it’s replicas.

The tests below compare several EBS volumes against a local instance SSD on an i3.large instance.


Disk Type Disk Size Read IOPS Write IOPS Read Throughput MB/s Write Throughput MB/s Cost $GB/month
EBS – Cold HDD (sc1) 500GB 31 10 45 44 0.025
EBS – Throughput HDD (st1) 500GB 101 33 53 53 0.045
EBS – Standard SSD (gp2) 500GB 2074 (bursting) 692 (bursting) 22-53
(variable perf)
53 0.1
NVMe Instance Store SS3 (i3.large) 442GB 43281 14443 238 200 0.2477


During the duration of the GP2 SSD state the disk would’ve been in burst mode which means it theoretically had access to a 3,000 IOPS limit, up from it’s base 1,500 IOPS limit. I was also very surprised by the lower  read throughput of the GP2 SSD. The volume would consistently deliver 22MB/s for a period and then sometime later it would shoot to 45MB/s. I can’t really explain why the GP2 volume has such a variable performance.

The i3.large has a stated limit of 53.1MB/s of EBS bandwidth which is represented in several of the tests. When testing smaller reads and writes the effective EBS bandwidth may appear to be greater than 53.1MB/s which I assume is due to some caching mechanisms.

For testing methods see Test Disk IOPS and Read/Write Bandwidth

Test Disk IOPS and Read/Write Bandwidth

Command to test disk IOPS

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Sample Output

test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 4096MB)
bs: 1 (f=1): [m(1)] [78.1% done] [8707KB/2753KB/0KB /s] [2176/688/0 iops] [eta 01m:23s]
fio: terminating on signal 2
test: (groupid=0, jobs=1): err= 0: pid=2864: Sun Apr 22 05:38:14 2018
read : io=2398.4MB, bw=8297.9KB/s, iops=2074, runt=295968msec
write: io=819616KB, bw=2769.3KB/s, iops=692, runt=295968msec
cpu : usr=0.49%, sys=1.70%, ctx=200905, majf=0, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=613972/w=204904/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
 READ: io=2398.4MB, aggrb=8297KB/s, minb=8297KB/s, maxb=8297KB/s, mint=295968msec, maxt=295968msec
 WRITE: io=819616KB, aggrb=2769KB/s, minb=2769KB/s, maxb=2769KB/s, mint=295968msec, maxt=295968msec

Disk stats (read/write):
 xvde: ios=613645/214996, merge=0/57, ticks=4398044/4579196, in_queue=8977428, util=100.00%

Test Disk Read Bandwidth

hdparm -t /dev/xvdc

Sample Output

$ hdparm -t /dev/xvdc
 Timing buffered disk reads: 168 MB in 3.02 seconds = 55.60 MB/sec

Test Disk Write Bandwidth

Note: You should probably run this several times or increase the count until the result stabilizes, this is to eliminate the impact of any disk caches.

dd if=/dev/zero of=/mnt/throughtput_hdd/output bs=8k count=100k

Sample Output

$ dd if=/dev/zero of=/mnt/throughtput_hdd/output bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 16.476 s, 50.9 MB/s


Python wrapper for geonames country data.

Geonames Country Data

I recently had the need to map top level domains and languages to country codes, and since I couldn’t find a simple python package to do this I built this wrapper for the geonames country data . I got the idea from the django-countries-plus python package. 

Sure I could have probably done what I needed, cobbling together a few other packages, but this seemed cleaner.

Extracting Phone Numbers from the common crawl with EMR/hadoop

Extracting phone numbers from the November 2015 CC

Motivation: A client wanted to find a large set of phone numbers from the public web; all types of numbers, not just US phone numbers.

The common crawl provided a convenient to way to efficiently process a large portion of the web and extra phone numbers. A big thanks to Ben C. at yelp for writing this article Analyzing the Web For the Price of a Sandwich, it provided a great starting place. While Ben’s goal was to search for only US phone numbers, I needed to extract phone numbers from all countries and so I opted to use the phonenumbers python library, despite it being “slower”.

I found that using the common crawl’s WET files was sufficient to extract phone numbers (text content only, as apposed to full HTML). Processing the WET files meant the total dataset was roughly 8TB, down from the 151TB of the full crawl, despite this it still took ~10 hours to fully process; the phonenumbers library is indeed quite a bit slower/more CPU intensive than a simple regex, but in this case I think it was definitely worth it, because it provided a clean, tested method to extract all forms phone numbers from text.

Initially I thought that I’d have to try and detect the pages’s associated country, in order to extract phone numbers, because if a number isn’t fully qualified eg. ‘555-845-3567’, then it’s interpretation depends on the country it’s being called from, but this turned out to not be necessary, as the client’s requirement was to extract at minimum 1,000 for numbers for every country/zone, and the common crawl contains more than 1,000 fully qualified numbers for every relevant zone.

Optimizing Map Reduce (a little)

As I was using spot instances, in order to avoid losing work, if a job failed, and to avoid reducing over the entirety of the common crawl, I split the crawl into a handful of map reduce jobs.

Initially I tried a batch size of 900 WET files, which took in 20 minutes, not too bad, but I also noticed that the cluster often sat idle, due to a combination of a few straggling map operations and the overhead of starting a job with mrjob.

Upping the batch size to 3648 improved things a bit, it only took 60 minutes to complete that job, 25% faster; mostly due to eliminating the overhead associated with scheduling jobs.

I had tried to optimize the mappers associated with the job to fit onto the available cores in stages (19 machines * 32 cores per machine * 6 =3648), but unfortunately, a few of the jobs failed and were rerun, which added an extra final “stage” where only a few mappers we running at a time. Since each mapper processes an entire WET file this final stage added quite a bit of processing time to the total job run time.

I think the moral of the story here is that it’s not easy to exactly tune a map reduce job to fit onto a cluster, there will be intermittent failures, and that’s just something you have to deal with, but larger batch sizes and smaller individual map operations will reduce the overhead associated with each failure.

Actually if mrjob allowed scheduling multiple jobs onto the same EMR cluster, then It wouldn’t have mattered as much, if one of the jobs was stuck waiting on a few stragglers to complete, the rest of the cluster would be busy processing the next job.