How to setup log rotation on Flume

Flume can be used to rotate logs file into a regular file system.


In BE mode (Best Effort)

exec config 'tail("/tmp/error.log")'     'agentBESink("",35888)'
exec config 'collectorSource( 35888 )'     'collectorSink("/tmp/nlog/%Y/%m/%d/errorlog", "mylog_%{host}_%H_",60000 )'

in E2E (End to End)

exec config 'tail("/tmp/error.log")'      'agentE2ESink("",35888)'
exec config 'collectorSource( 35888 )'     'collector(600000) {collectorSink("/tmp/nloge/%Y/%m/%d/errorlog", "mylog_%{host}_%H_",600000 )}'


dfsdir – in this example the directory in the collector. If it is not exist, Flume will create it.
path – in this example the logs file prefix
rollmillis – the roll time in milliseconds. Set this to be whatever you would like. When it is not provided, Flume uses flume.collector.roll.millis (which configured at /usr/lib/flume/conf/flume-conf.xml).

Needed for the E2E mode. See How to use the Flume E2E mode.

Testing script for generating logs

A script to generate data for testing the Flume’s log rotation can be:

[bash]$ for i in {1..1000000}; do echo "Flume log rotation $i" >> /tmp/error.log; sleep 4; done


In order to see only the body of the log set this in the collector’s configuration file: flume-site.xml

    <description>The output format for the data written by a Flume

How to use the Flume E2E mode

E2E provides the best reliability mode:

You can tune the reliability level of the agents, simply by specifying a different kind of agent sink. There are three levels available. One of them is the E2E (End To End) mode, which relies on an acknowledgement, and will retry if no acknowledgement is received.

E2E limitations:

E2E mode works only with Hadoop. This means you can’t use:

  • text( ” .. ” )

and can only use:

  • collectorSink( “hdfs://…” , “src” )
  • collectorSink( “file:///../..” , “prefix” )
  • Your own customized Sink

Important: Even if you are using your own customized Sink which doesn’t use any Hadoop facilities, you must have Hadoop running on your collector!

The collector is the key for working with E2E:

In E2E mode, the agent side is sending data in the most reliable mode. The collector must to be setup to calculate and send ACKs.
In order to work with E2E you must add a collector wrapper that calculates ACKs.

For example an E2E configuration should look like:

agent  : source | agentE2ESink (machine,port);
collector : collectorSource(port) | collector(time_in_ms) { collectorSink(...)};

The time_in_ms argument in the collector wrapper means every time_in_ms milliseconds, ACKs
are flushed back on the path back to agents. In BE (best effort) and DFO modes, data should continuously arrive. However, in E2E mode
data tends to arrive in clumps. In all cases, after the collector’s time_in_ms argument expires, the collectorSink gets closed, and then reopened.

Detailed examples (using Flume shell commands):

Writing to files:

exec config agent1 'tail("/tmp/")'  'agentE2ESink("agent2",35888)'
exec config agent2 'collectorSource( 35888 )' 'collector(10000) {collectorSink("file:///tmp/yy","zz-")}'

In this example, the logs from agent1 (file “”) are collected and forward to agent2’s directory: “/tmp/yy”

Customized Sink:

exec config agent1 'tail("/tmp/")'  'agentE2ESink("agent2",35888)'
exec config agent2 'collectorSource( 35888 )' 'collector(10000) { mySimpleSink ()}'

in this example, the logs from agent1 (file “”) are collected and forward to agent2’s mySimpleSink.
The mySimpleSink is a simple sink, that flushes the log Body into a specific file in /tmp directory. Please note that since the collector reopened every time_in_ms, you should write your Sink to open the file in append mode.

Metrics information:

The metrics information at an agent or a collector can be viewed at
port 35862 (e.g. http://agent:35862/)

How to set up a MySQL connection pool in Java

MySQL connection pool

A MySQL connection pool is a pool of connections to MySQL database. Opening and maintaining a database connection for each process (or thread), is time costly (connection creation time) and wastes resources (connections).
Connection pool increase the performance of (Java) applications that needs to connect to the database by reusing the connections. In connection pooling, after a connection is created, it is placed in the pool and it is used over and over again so that a new connection does not have to be established. If all the connections are being used, a new connection is made and is added to the pool.

How Do I Implement MySQL Connection Pooling in Java?

I am using the GenericObjectPool that provides robust pooling functionality for arbitrary objects.



public class MySqlPoolableException extends Exception {
     public MySqlPoolableException(final String msg, Exception e) {
         super(msg, e);

import java.sql.DriverManager;
import org.apache.commons.pool.BasePoolableObjectFactory;
public class MySqlPoolableObjectFactory extends BasePoolableObjectFactory {
     private String host;
     private int port;
     private String schema;
     private String user;
     private String password;
     public MySqlPoolableObjectFactory(String host, int port, String schema,
                                                String user, String password) {
 = host;
          this.port = port;
          this.schema = schema;
          this.user = user;
          this.password = password;
     public Object makeObject() throws Exception {
          String url = "jdbc:mysql://" + host + ":" + port + "/"
               + schema + "?autoReconnectForPools=true";
          return DriverManager.getConnection(url, user, password);

public class ExampleClassUsesMySQLConnectionPool {
     private static final Log LOG = LogFactory.getLog(ExampleClassUsesMySQLConnectionPool.class);
     private static final String SQL_SELECT = "SELECT ... FROM ... ORDER BY ... DESC LIMIT ...";
     private final ObjectPool connPool;
     public ExampleClassUsesMySQLConnectionPool(ObjectPool connPool) {
          this.connPool = connPool;
     public List<SomeRecord> getRecords(String sql) throws SQLException, MySqlPoolableException {
          Builder<SomeRecord> builder = new ImmutableList.Builder<SomeRecord>();
          Connection conn = null;
          Statement st = null;
          ResultSet res = null;
          try {
               conn = (Connection)connPool.borrowObject();
               st = conn.createStatement();
               res = st.executeQuery(sql);
               while ( {
                    SomeRecord someRecord = new SomeRecord(String.valueOf(res.getInt(1)),
                    String.valueOf(res.getInt(2)), res.getString(3));
          } catch (SQLException e) {
               throw e;
          }  catch (Exception e) {
               throw new MySqlPoolableException("Failed to borrow connection from the pool", e);
          } finally {
     private void safeClose(Connection conn) {
          if (conn != null) {
               try {
               catch (Exception e) {
                    LOG.warn("Failed to return the connection to the pool", e);
     private void safeClose(ResultSet res) {
          if (res != null) {
               try {
               } catch (SQLException e) {
                    LOG.warn("Failed to close databse resultset", e);
     private void safeClose(Statement st) {
          if (st != null) {
               try {
               } catch (SQLException e) {
                    LOG.warn("Failed to close databse statment", e);

import org.apache.commons.pool.ObjectPool;
import org.apache.commons.pool.PoolableObjectFactory;
import org.apache.commons.pool.impl.GenericObjectPool;
import org.apache.commons.pool.impl.GenericObjectPoolFactory;
import org.apache.commons.pool.impl.GenericObjectPool.Config;
public class MyApplication {
     private static ObjectPool initMySqlConnectionPool() {
          Properties properties = new Properties();
               If you are using Google's Guava:
          String host = properties.getProperty("host");
          String port = properties.getProperty("port");
          String schema = properties.getProperty("schema");
          String user = properties.getProperty("user");
          String password = properties.getProperty("password");
          PoolableObjectFactory mySqlPoolableObjectFactory = new MySqlPoolableObjectFactory(host,
               Integer.parseInt(port), schema, user, password);
          Config config = new GenericObjectPool.Config();
            config.maxActive = 10;
            config.testOnBorrow = true;
            config.testWhileIdle = true;
            config.timeBetweenEvictionRunsMillis = 10000;
            config.minEvictableIdleTimeMillis = 60000;
          GenericObjectPoolFactory genericObjectPoolFactory = new GenericObjectPoolFactory(mySqlPoolableObjectFactory, config);
          pool = genericObjectPoolFactory.createPool();
          return pool;
     public static void main(String[] args) {
          private ObjectPool pool;
          pool = initMySqlConnectionPool();
          ExampleClassUsesMySQLConnectionPool exampleClassUsesMySQLConnectionPool
               = new ExampleClassUsesMySQLConnectionPool(pool);

Important: Don’t forget to include the property file (“”) in the project CLASS_PATH.

1. Good reference for Java Properties:
2. In Guava, reading lines from file can be easily done by:
List terms = Resources.readLines(Resources.getResource(termsFileName), Charsets.UTF_8);
3. Please see comment bellow by Lucas. He added a fix for ‘network hiccup’ problems.

How to install ANT on LINUX

Apache Ant is a Java library and command-line tool that help building software.
How to install:

  • Create downloads directory if you dont have one
  • Download tar version from
  • Extract
  • Rename the new directory as ant
  • Insert the path into ANT_HOME
  • Update the global PATH variable to include ANT_HOME
  • Run the Ant script fetch.xml to install almost all the dependencies the optional Ant tasks need.

See actual commands:

[bash ~]$ mkdir downloads
[bash ~]$ cd downloads/
[bash downloads]$
[bash downloads]$
[bash downloads]$ wget
[bash downloads]$ tar -zxvf apache-ant-1.8.2-bin.tar.gz
[bash downloads]$ mv apache-ant-1.8.2 ant
[bash downloads]$ ANT_HOME=/home/ATOMANT/ilanh/downloads/ant
[bash downloads]$ PATH=$PATH:${ANT_HOME}/bin
[bash downloads]$ ant -f fetch.xml -Ddest=system

Note: see

How to rename a database in MySQL

In MySQL there is no support for database renaming.
In order to rename a MySQL database you can do one of the following:
1. Create new database and rename all tables in the old database to be in the new database:

CREATE database new_db_name;
RENAME TABLE db_name.table1 TO new_db_name,
                     db_name.table2 TO new_db_name;
DROP database db_name;

2. In Linux shell, use mysqldump to back up the old database, then restore the dumped database under a new name using the MySQL utility. Finally, use the
drop database command to drop the old database. This option can preform badly for large database.

mysqldump -uxxxx -pxxxx -h xxxx db_name > db_name_dump.sql
mysql -uxxxx -pxxxx -h xxxx -e "CREATE DATABASE new_db_name"
mysql -uxxxx -pxxxx -h xxxx new_db_name < db_name_dump.sql
mysql -uxxxx -pxxxx -h xxxx -e "DROP DATABASE db_name"

3. Write a simple Linux script (my favorite solution)

mysqlconn="mysql -u xxxx -pxxxx -S /var/lib/mysql/mysql.sock -h localhost"
#$mysqlconn -e "CREATE DATABASE $newdb"
params=$($mysqlconn -N -e "SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE table_schema='$olddb'")
for name in $params; do
      $mysqlconn -e "RENAME TABLE $olddb.$name to $newdb.$name";
#$mysqlconn -e "DROP DATABASE $olddb"

4. If all your tables are MyISAM, you can rename the old database folder name:

  • Shut down the MySQL server,
  • Rename the database folder name to the new name,
  • Start the MySQL server.

Note: This post was improved thanks to the comments I received from:

  • Shlomi Noach
  • VadimTK
  • Cédric PEINTRE

Don’t Alter Table. Do Copy and Rename

In some cases a MySQL MyISAM table structure needs to be alter. This includes adding, removing and changing table columns (or indexes) and even re-ordering the MySQL table.
In these cases, for performance and safety reasons, it is wise to avoid altering the current working MySQL table and adopt the Copy and Rename approach.

The Copy and Rename approach consist from the following steps:

  • Create similar temporary table but with the requested change
  • Disable the temporary table keys,
  • Copy the rows from the original table to the temporary table
  • Enable the temporary table keys,
  • Backup the original table and rename the temporary table to have the original table name

For example, in a case where an existing MySQL table needed new column,
instead of:

Alter tbl add column colx int not null;

do the following:

Create table tbl_tmp like tbl;
Alter table tbl_tmp add column colx int not null;
Alter table tbl_tmp disable keys;
Insert into tble_tmp select *,0 from tbl;
Alter table tbl_tmp enable keys;
Rename table tbl to tbl_old;
Rename table tbl_tmp to tbl;

ALTER TABLE tbl_tmp DISABLE KEYS tells MySQL (for a MyISAM table) to stop updating non-unique indexes. ALTER TABLE tbl_tmp ENABLE KEYS then re-create the missing indexes. MySQL re-create indexes with a special algorithm that is much faster than inserting keys one by one, so disabling keys before performing bulk insert operations should give a considerable speedup.

Some powerful strengths of Regular Expression

Regular expressions provide a powerful, concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.
I will list here some powerful strengths:

Word Boundaries

The meta character \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.
Example: In order to find a name ‘myname‘ in a text you can use:


Word Characters

The characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w in some languages. Notice that Java supports Unicode for \b but not for \w.

Lookahead and Lookbehind

Often called “lookaround” or “zero-width assertions”. They are zero-width.
Lookarounds will actually match characters, but then give up the match and only return the result: match or no match (They do not consume characters in the string, but only assert whether a match is possible or not).
Lookahead: Gives you the ability to match something that follows (positive) or not follows (negative) by something else.

  • Negative lookahead gives you the strength to match something not followed by something else. For example, in order to match a “x” not followed by a “y” you can use:
  • Positive lookahead gives you the strength to match something that followed by something else without consume the something else characters. For example, in order to match a “x” followed by a “y” you can use:

Lookbehind: it has the same effect as Lookahead, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.

  • Negative lookabehind gives you the strength to match something that is not preceded by something else. For example, in order to match a “x” that is not preceded by an “y”, you can use:

    (It will not match zyx, but will match the x (and only the x) in xml or alexander.)

  • Positive Lookbehind gives you the strength to match something that is preceded by something else. For example, in order to match a “x” that is preceded by an “y”, you can use:

    (It will not match axb, but will match the x (and only the x) in zyx )


The question mark tells the engine to attempt match the preceding token zero times or once.

The asterisk or star tells the engine to attempt to match the preceding token zero or more times.

The plus mark tells the engine to attempt to match the preceding token one or more times.

Additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.
To match a word consist of only Big letters in a size of 3 to 5 characters use:


Please remember that the repetition marks are greedy. It causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack.
For example:
The regex <.+> applied on “This is a <em>first</em> test” will match <em>first</em> and not <em>.
The quick fix to this problem is to make the repetition mark lazy instead of greedy (Lazy quantifiers are sometimes also called “ungreedy” or “reluctant”). You can do that by putting a question mark behind the repetition mark in the regex.
For example:
To solve the previews problem use:


The regex <.+?> applied on “This is a <em>first</em> test” will match <em> and not <em>first</em>.

Unicode Character categories

In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to a particular category with \p{} (small letter p). You can match a single character not belonging to a particular category with \P{} (big letter p).
For example:

  • \p{L} or \p{Letter}: any kind of letter from any language
  • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant
  • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant
  • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter
  • \p{Sm} or \p{Math_Symbol}: any mathematical symbol

Recommended resource

A very recommended good resource for Regular Expressions is:

How to select random rows in MySQL

The easiest way to generate random rows in MySQL is to use the ORDER BY RAND() clause.


This can work fine for small tables. However, for big table, it will have a serious performance problem as in order to generate the list of random rows, MySQL need to assign random number to each row and then sort them.
Even if you want only 10 random rows from a set of 100k rows, MySQL need to sort all the 100k rows and then, extract only 10 of them.

My solution for this problem, is to use RAND in the WHERE clause and not in the ORDER BY clause. First, you need to calculate the fragment of your desired result set rows number from the total rows in your table. Second, use this fragment in the WHERE clause and ask only for RAND numbers that smallest (or equal) from this fragment.

For example, suppose you have a table with 200K rows and you need only 100 random rows from the table. The fragment of the result set from the total rows is: 100 / 200k = 0.0005.
The query will look like:

SELECT col1 FROM tbl WHERE RAND()<=0.0005;

In order to get exactly 100 row in the result set, we can increase the fragment number a bit and limit the query:
For example:

SELECT col1 FROM tbl WHERE RAND()<=0.0006 limit 100;

Feel free to leave a comment.

How to dynamically create an object in Java from a class name given as string format

To create an object dynamically from the class name, you need to use a reflection.
If the fully-qualified name of a class is available, it is possible to get the corresponding Class using the static method Class.forName(). However, This cannot be used for primitive types.
Having the Class type, you need to use the newInstance() function. It uses the constructor represented by the class object to create and initialize a new instance of the constructor’s declaring class, with the specified initialization parameters.

For example:

String className = "MyTestClass";
String fullPathOfTheClass = "" + className;
Class cls = Class.forName(fullPathOfTheClass);
MyTestClass myTestObject = (MyTestClass) cls.newInstance();
// now I can use the object methods as usual

In order to dynamically create an object in Java from an inner class name, you should use the $ sign.
For Example:

String className = "MyTestClass";
String innerClassName = "MyInnerTestClass";
String fullPathOfTheClass = "" +
				className +
				"$" +

Inner class names are delimited with $, not with a period. This means that the class name is: MyTestClass$MyInnerTestClass.
The $ syntax is valid for class loading only. Use periods to access instances of the class in source
For example:

private MyTestClass.MyInnerTestClass myInnerObject;

For more information see:
1. Using Java Reflection
2. Retrieving Class Objects

How to determine the leading prefix part of a MySQL index

In MySQL, an index can be limited to use only a leading prefix part of a column values. Prefixes can be specified for CHAR, VARCHAR, BINARY, and VARBINARY columns, while for BLOB and TEXT columns the prefix must be given.

For example, the statement shown here creates an index using the first 20 characters of the mytext column:

CREATE INDEX part_of_mytext ON mytable (mytext(10));

The big question before creating new index is: what is the size of the leading prefix that should be use in the index?
The answer is depend on the data. You should choose the size that is approximately give unique values.

Use this query to see how unique parts are compared with the complete uniqueness (COUNT(DISTINCT mytext)):

    COUNT(DISTINCT mytext),
    COUNT(DISTINCT LEFT(mytext, 5)),
    COUNT(DISTINCT LEFT(mytext, 10)),
    COUNT(DISTINCT LEFT(mytext, 20)),
    COUNT(DISTINCT LEFT(mytext, 40)),
    COUNT(DISTINCT LEFT(mytext, 60)),
    COUNT(DISTINCT LEFT(mytext, 80))

1. Prefix lengths are given in characters for nonbinary string types (CHAR, VARCHAR, and TEXT) and in bytes for binary string types (BINARY, VARBINARY, and BLOB).
2. Using column prefixes for indexes can make the index file much smaller, which could save a lot of disk space and might also speed up INSERT operations.

How to embed Flume node into your Java application

You may need to embed Flume node into your Java Program in a case you would like your application to control the incoming logs (you can then create new Sinks or Sources in the application).

Embedding Flume into Java Application is possible and very easy to do.


1. Install flume node

2. Configure your dependencies (pom.xml):


3. Change your Class Path:


4. Update FLUME_HOME variable


export FLUME_HOME=/usr/lib/flume/

5. Call to the following function from your main function:

private static void StartFlume(){
  try {
     String[] args = null;
  } catch (Exception e) {
  LOG.error("Aborting: Unexpected problem with environment." + e.getMessage(), e);

And that is it. You have embedded Flume into your Java Application.


MySQL select count

COUNT function counts the number of rows retrieved by a SELECT statement. The return value is of type BIGINT.

The COUNT(expr) function count the number of non-NULL values of expr in the rows retrieved by a SELECT statement.


In contrast, COUNT(*) is different in that it returns a count of the number of rows retrieved, whether or not they contain NULL values.


This means that Count(*) and Count(expr) are not equivalents. For example, if you count a not NULL column it is the same as using COUNT(*), however, if you count a column that can contain Null values, MySQL needs to scan the column values for the non-NULL values.
This fact can affect the query performance. The MYISAM table has cached number of rows in a table. This gives the ability to instantly answer COUNT(*) or COUNT of non-NULL column. However, MySQL can’t instantly answer a COUNT of a column that can contain Null values, and it must scan the column values.

Another interesting thing about the Count function, is that it is meaningless to use it together with the LIMIT clause. The LIMIT clause is limiting the result rows number. Count will always return one value, thus limiting it to a number greater than one is useless. In order to limit the count of the rows up to a certain number, you should use Limited Select Count. It is well described at