I wrote a quick script to count all rows in all tables in HBase. This works great for my Dev clusters that have ever-growing tables filled with clutter. The script uses a MapReduce job to go against all HBase tables. I have used this in Prod, but with mixed results: Sometimes the HBase tables are too large for the MR jobs to run within 24 hours.
#!/bin/bash # Filename: rc-start-rowcount.sh # Description: start a row count for each table # # Example: # /opt/scripts/rc-start-rowcount.sh # 1. check if the row count is already running # 2. if the row count is NOT running, then run a row count cd /opt/cloudera/parcels/CDH/bin ScriptDir="/opt/scripts/"; WorkingDir="/opt/scripts/rc-work"; Test=""; ListOfHBaseTables="rc-tables.txt"; ListOfRunningYarnJobs="rc-yarn-jobs.txt"; ScriptToRun="rc-script.sh"; LogDir="/var/log/scripts"; LogFile="rc-start-rowcount.log"; echo "`date`: Start" >> $LogDir/$LogFile; StartTest=`ps ax|grep rc-parserowcount.sh|grep bash` echo $StartTest if [[ ! $StartTest == "" ]]; then echo "`date`: WARNING: rc-parse-rowcount.sh is running, exit" >> $LogDir/$LogFile; echo $StartTest >> $LogDir/$LogFile; exit; fi # create the script echo "#!/bin/bash" > $WorkingDir/$ScriptToRun echo 'list; quit;' | hbase shell > $WorkingDir/$ListOfHBaseTables sed -i '/^$/d' $WorkingDir/$ListOfHBaseTables sed -i '$d' $WorkingDir/$ListOfHBaseTables # get running applications from yarn yarn application -list > $WorkingDir/$ListOfRunningYarnJobs while read table; do #echo 'table:' $table # 1. check if row count is running # if Test is blank=NOT running, anything else=running Test=`grep $table $WorkingDir/$ListOfRunningYarnJobs`; if [[ $Test == "" ]]; then # 2. if the row count is NOT running, then run a row count echo "sleep 10;hbase org.apache.hadoop.hbase.mapreduce.RowCounter $table > $WorkingDir/$table.txt 2>&1 &" >> $WorkingDir/$ScriptToRun 2>&1 #echo 'run this table:' $table echo "`date`: Process Table: $table" >> $LogDir/$LogFile; fi done <$WorkingDir/$ListOfHBaseTables # set the script to be executable chmod +x $WorkingDir/$ScriptToRun # run the script that included all map reduce jobs cat $WorkingDir/$ScriptToRun $WorkingDir/$ScriptToRun echo "`date`: End" >> $LogDir/$LogFile;