BIGDATA SOLUTIONS

Disallowed DataNode Exception

When running the data node process if you see the below error and datanode process is not starting:
ERROR datanode.DataNode: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied communication with namenode: NameNode:50010
Its means that this particular datanode is not recognized by the NameNode. You might be using "dfs.host" property ( should be specifed in hdfs-site.xml) which will be pointing to a file which has the list of IP address of various datanode. You need to provide the entry of this datanode IP in this file.
Example: My hdfs-site.xml has this property mentioned like this: ( not showing other properties)
<property>
<name>dfs.hosts</name>
<value>/usr/local/hadoop/conf/include</value>
</property>

The "include" file in the above location has the following entries:(lets say)  192.168.1.20 192.168.1.29  Now in this file, you provide the IP address of the data node which you wish to make as slave  After the entry is provided,run the following command to tell the NameNode about the new nodes which are getting added  "hadoop dfsadmin -refreshNodes"  Now go to your newly added datanode and start up the process:  hadoop datanode ( it will start datanode process)  hadoop tasktracker ( it will start task tracker process)

Deleting the file older than certain data in HDFS

The following command will delete the files before certain date
hadoop fs -ls shakespeare | tail -n+2 | xargs -n 8 | awk '{ if ( ($(date)-$(date -d $6)) == 0)
cmd="hadoop fs -rm " $8;
system(cmd)
}

Dumping the Hive table as CSV file

This query is suppported in hive > 0.11. I have tested this query in Hive-0.13.1
You can run the below query to dump the hive tables as CSV files on to your local file system
insert overwrite local directory '/home/training/HiveResult' row format delimited fields terminated by ',' select * from sample;

----------------------------------------
# First we have to convert hive result to csv format

hive -e 'select a.user_game_id as usergames, a.cur_user_count as users from fresh1.ipro_sports_user_game as a JOIN fresh1.ipro_sports_game_schedule as b ON a.game_id=b.game_id where a.user_game_id >= '270'' |sed 's/[[:space:]]\+/,/g' >/home/cloud-user/counofusers23

1) create table csvtable(game_id int,coun_ofusers int)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile;

2) load data local inpath '/home/cloud-user/counofusers23' into table csvtable;

3) CREATE TABLE avrotable
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES (
> 'avro.schema.literal'='{
> "namespace": "com.rishav.avro",
> "name": "student_marks",
> "type": "record",
> "fields": [ { "name":"game_id","type":"int"}, { "name":"count_ofusers","type":"int"}]
> }');

4) insert overwrite table avrotable select * from counofusers23;

5) hadoop fs -cat /apps/hive/warehouse/fresh1.db/avrotable/* >countofusers.avro

6) Download avro-tools.1.7.5 from below site
http://mvnrepository.com/artifact/org.apache.avro/avro-tools/1.7.5

7) java -jar /home/cloud-user/avro-tools-1.7.5.jar tojson countofusers.avro >countfusers.json

8) Now you can open countfusers.json to view the json format....

Compiling Java programs for Hive UDF and creating jar

Compiling the source code

javac -cp `hadoop classpath`:$HIVE_HOME/lib/* com/sym/*.java
Now creating the jar  jar cfe session.jar com.sym.IncrEnvVariable com/sym/*.class
If the entry point class name is in a package it may use a '.' (dot) character as the delimiter.
For example, if Main.class is in a package called foo the entry point can be specified in the following ways:

jar cfe Main.jar foo.Main foo/Main.class

Hyphen sign in column names in hive

While dealing with struct, some times you need to give the same as the column name in the schema. And the schema might have hyphen sign and hive doesnot support it. So to mitigate you can use back tick around your column name For example create table x( `my-name` string)

Deleting all the tables in a hive

You can run the below command to delete all the tables from a given database. hive -e 'use sample_db;show tables' | xargs -I '{}' hive -e 'use sample_db;drop table {}' Where sample_Db is the database (Use single quotes not back tick)

What is Big Data?

Big Data

The dictonary meaning is: extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.

Big data means a huge volume data. If you refer huge volume of data, that means the data which can not be stored and can not be processed in given time by using traditional approach.

How much huge?

Generally, when we say huge data, we think that its in GBs/TBs/PBs/EB(exabytes), but its not define big data completely.

Even, a small of data can be refer by big data with respect to application.
example: Emails: we can not send 100 MB file as an email. it means 100MB is big or huge data with respect of email application.

Monday, 27 July 2015