I wrote a quick script to count all rows in all tables in HBase. This works great for my Dev clusters that have ever-growing tables filled with clutter. The script uses a MapReduce job to go against all HBase tables. I have used this in Prod, but with mixed results: Sometimes the HBase tables are too large for the MR jobs to run within 24 hours.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | #!/bin/bash # Filename: rc-start-rowcount.sh # Description: start a row count for each table # # Example: # /opt/scripts/rc-start-rowcount.sh # 1. check if the row count is already running # 2. if the row count is NOT running, then run a row count cd /opt/cloudera/parcels/CDH/bin ScriptDir="/opt/scripts/"; WorkingDir="/opt/scripts/rc-work"; Test=""; ListOfHBaseTables="rc-tables.txt"; ListOfRunningYarnJobs="rc-yarn-jobs.txt"; ScriptToRun="rc-script.sh"; LogDir="/var/log/scripts"; LogFile="rc-start-rowcount.log"; echo "`date`: Start" >> $LogDir/$LogFile; StartTest=`ps ax|grep rc-parserowcount.sh|grep bash` echo $StartTest if [[ ! $StartTest == "" ]]; then echo "`date`: WARNING: rc-parse-rowcount.sh is running, exit" >> $LogDir/$LogFile; echo $StartTest >> $LogDir/$LogFile; exit; fi # create the script echo "#!/bin/bash" > $WorkingDir/$ScriptToRun echo 'list; quit;' | hbase shell > $WorkingDir/$ListOfHBaseTables sed -i '/^$/d' $WorkingDir/$ListOfHBaseTables sed -i '$d' $WorkingDir/$ListOfHBaseTables # get running applications from yarn yarn application -list > $WorkingDir/$ListOfRunningYarnJobs while read table; do #echo 'table:' $table # 1. check if row count is running # if Test is blank=NOT running, anything else=running Test=`grep $table $WorkingDir/$ListOfRunningYarnJobs`; if [[ $Test == "" ]]; then # 2. if the row count is NOT running, then run a row count echo "sleep 10;hbase org.apache.hadoop.hbase.mapreduce.RowCounter $table > $WorkingDir/$table.txt 2>&1 &" >> $WorkingDir/$ScriptToRun 2>&1 #echo 'run this table:' $table echo "`date`: Process Table: $table" >> $LogDir/$LogFile; fi done <$WorkingDir/$ListOfHBaseTables # set the script to be executable chmod +x $WorkingDir/$ScriptToRun # run the script that included all map reduce jobs cat $WorkingDir/$ScriptToRun $WorkingDir/$ScriptToRun echo "`date`: End" >> $LogDir/$LogFile; |