Technology scoops

Sunday, July 17, 2016

Advance SQL Queries

Select 10th highest salary

SELECT DISTINCT TOP (10) Salary FROM Employee ORDER BY Salary DESC

Altertative:

SELECT Salary from Employee limit 10 order by Salary DESC

Combine more than two tables using union all, avoid duplicates

SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X

Find values from one table exclude value from another table without using not in

Select id from table_a

except

Select id from table_b

Alternative:

Select a.id from table_a a left outer join table_b b on a.id = b.id

where a.id <> b.id

Oracle MERGE - Use case - when the conditoin meets, update/delete record otherwise insert the record. Useful in ETL process while syncing two different datasources. Where the updated rows from source table are updated in datawarehouse table and new rows are inserted in the table.

MERGE INTO dw_employee D

USING (SELECT employee_id, salary, department_id FROM employees

ON (D.employee_id = S.employee_id)

WHEN MATCHED THEN UPDATE SET d.salary = S.salary

WHEN NOT MATCHED THEN INSERT (D.employee_id, D.Salary, D.department_id)

VALUES (S.employee_id, S.salary, S.department_id)

For performance use parallel hint for e.g in above query

merge /*+ parallel (dw_employee,6) */

Reference : https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel007.htm#i1009257

Saturday, July 16, 2016

Hive queries

Combining multiple hive tables into one

INSERT OVERWRITE TABLE FINAL_TBL
SELECT main_union.id, main_union.output_string
FROM (
select id, concat_ws("|",col1,col2,col3) output_string from table1

UNION ALL

select id, concat_ws("|",col1,col2,col3) output_string from table2

UNION ALL

select id, concat_ws("|",col1,col2,col3) output_string from table3

UNION ALL

select id, concat_ws("|",col1,col2,col3) output_string from table4

) main_union

CLUSTER BY main_union.id;

Solution in hive instead of exists clause

for e.g
Employee = individual works at institute, it has many departments.
joe works at institute1 - dept1
joe works at institute2 - dept3
jason works at institute1 - dept1
jason works at institute1 - dept2

Query:

Select individual from individual ind join ind_related ir on ind.id = ir.id

join institute ins on ir.relatedkey = inst.instituteid

left semi join

dept d on d.deptid=ir.relatedkey

Monday, December 8, 2014

Hadoop & Big Data - Simple Introduction

When and where you can use Hadoop - Map/Reduce?

Relational databases are best suitable for transactional data and also used widely in data-warehousing. Hadoop and map/reduce is useful when you have massive amount of data processing at faster speed and lower costs because it uses open source technologies can run on commodity servers clustered together to perform faster. It comes with concurrent processing out of the box as it sub-divides data into smaller chunks which can be processed concurrently. Performing analytics for a large dataset is very good example of using hadoop - map/reduce.

How do you design your system to use hadoop - map/reduce?

Often map-reduce program has tutorial to count words in a large text file or a extremely large input. The framework splits the data input and sorts the input for you. The map program takes this input to combine the data, in case of word count it outputs word1 freq1; word2 freq2;etc. Reducer runs in multiple cycle by combining output from mapper and eventually coming up with final output.

Two main approaches - easy way to think of Map and reduce is - Map is nothing but data combiner, it combines data you want to process by similarities. Reducer processes already combined data. Assume, we have very large input for e.g entire years' lego store sells, trying to find out which one is best selling product by finding max money making product.

For e.g, top selling product by location.
AB, EDM1, Lego Friends, 100, 50
AB, EDM2, Lego Creator, 70, 100
ON, GTA1, Duplo, 100, 45
ON, GTA2, bricks, 1000, 20
BC, VAN1, Lego Farm, 150, 35
BC, VAN2, bricks, 750, 10
ON, GTA3, Lego Friends, 400, 40
BC, VAN3, Lego Creator, 200, 110
..

map will combine the input by state and pass it on to reducer. Reducer will find the top selling product in each state and output the result:

AB, EDM2, Lego Creator, 7000
ON, GTA2, bricks, 20000
BC, Lego Creator, 22000

What is solution when reducer runs out of memory? For e.g in word count problem if the file is very large and contains one word very frequently like 'the', the reducer can run out of memory.
Possible solutions:
Increase number or reducers
Increase memory per reducer
Implement a combiner.

PySpark code for above example (WIP):
store = sqlContext.read.parquet("...")
store = store.assign(col6 = lambda x: col4 * x.col5)
store.groupBy(store.col1,store.col2,store.col3).agg({store.col6, max}).show()

Wednesday, October 15, 2014

Shell Script to generat comma seperated output from a file

#!/bin/bash
counter=1
invoice_clause=""
while read line
do
    if [ $count == 1 ]
    then
    #invoice_clause="'$line'"
        echo "if then"
    else
    invoice_clause="$invoice_clause,'$line'"
    fi
    counter=`expr $counter + 1`
done < invoices.num

echo $invoice_clause

Input file :
one
two
three
four
five

Output:
'one','two','three','four','five'

Simple Python script to convert name value to tabular format

#!/usr/bin/env python
import csv

file1reader = csv.reader(open("mapping.csv"), delimiter=",")

header1 = file1reader.next() #header
header2 = file1reader.next() #header

data = dict()
result=[]
row=[]
for Key, Prop, Val, Flag in file1reader:
        if key in data: # found it
        x=data[Key]
        data[Key] = x + "," + Val
        row.append(Val)
    else:
        data[Key] = Val
        result.append(row)
        row=[]
        row.append(Key)
        row.append(Val)

with open('myfile.csv', 'wb') as f:
    w = csv.writer(f,dialect='excel')
    w.writerows(result)

Example mapping.csv file:

Key1, prop1, value1
Key1, prop2, value2
Key1, prop3, value3
Key2, prop1, value4
Key2, prop2, value5
Key2, prop3, value6

Output file:
Key1, value1, value2, value3
Key2, value3, value4, value5

Technical Interview questions

Q: What happens when you enter URL in the browser?
A:
1.    browser checks cache; if requested object is in cache and is fresh, skip to 9
2.    browser asks OS for server's IP address
3.    OS makes a DNS lookup and replies the IP address to the browser
4.    browser opens a TCP connection to server (this step is much more complex with HTTPS)
5.    browser sends the HTTP request through TCP connection
6.    browser receives HTTP response and may close the TCP connection, or reuse it for another request
7.    browser checks if the response is a redirect (3xx result status codes), authorization request (401), error (4xx and 5xx), etc.; these are handled differently from normal responses (2xx)
8.    if cacheable, response is stored in cache
9.    browser decodes response (e.g. if it's gzipped)
10. browser determines what to do with response (e.g. is it a HTML page, is it an image, is it a sound clip?)
11. browser renders response, or offers a download dialog for unrecognized types

Q: How to implement queue using stack?
A: hint: To use two stacks

Q: Design a elevator system in a high rise tower
A: Hint: Use elevation and speed parameters to control stopping of elevator at specific floor.

Q: How to implement a HashMap, what are methods to override?
A: Hint : Override equals, hashCode method of object class
Main methods to implement are put and get

Q: What is dependency injection?
A: Is inject dependency through configuration files - constructor injection, setter injection etc used in spring framework

Q: How spring AOP works?
A: Hint: Uses wrapper (proxy) classes. Another way is to modify byte code but not supported in spring.

Q: How do you keep track of multiple requests on backhand modifying same object?
A: through versions, increment record version in db to avoid race conditions/data courrption

Q: Describe popular design patterns
A: Singleton, Factory (With spring-framwork, you get these out of the box)
visitor, facade, builder, strategy

Thursday, March 14, 2013

Jasper Reports tips and tricks

Jasper Reports basic steps:

1. Design report in iReport designer
2. Save the the template - jrxml file
3. Compile the jrxml to jasper
4. Run from iReport or integrate it with a java program

Load the jasper report from .jasper file

//reportData is an instance of List

(JasperReport)jasperReport = JRLoader.loadObjectFromLocation(jasperReportName);

JasperPrint jasperPrint = JasperFillManager.fillReport(jasperReport,parameters,new JRBeanCollectionDataSource(reportData));

//Export into pdf format

byte[] pdfFile = JasperExportManager.exportReportToPdf(jasperPrint);

//Export into csv format
JRCsvExporter exporter = new JRCsvExporter();
StringBuffer buffer = new StringBuffer();
exporter.setParameter(JRExporterParameter.JASPER_PRINT, jasperPrint);
exporter.setParameter(JRExporterParameter.OUTPUT_STRING_BUFFER,buffer);
exporter.exportReport();

//Export into HTML format

JRHtmlExporter exporter=new JRHtmlExporter();
StringBuffer buffer = new StringBuffer();
  
exporter.setParameter(JRExporterParameter.JASPER_PRINT, jasperPrint);
exporter.setParameter(JRExporterParameter.OUTPUT_STRING_BUFFER, buffer);
exporter.setParameter(JRHtmlExporterParameter.IS_USING_IMAGES_TO_ALIGN, false);
//following parameters are used to avoid the page breaks, empty spaces and display tabular data better in html compare to pdf

exporter.setParameter(JRHtmlExporterParameter.IGNORE_PAGE_MARGINS, true);
exporter.setParameter(JRHtmlExporterParameter.ZOOM_RATIO, 1.5F);
exporter.setParameter(JRHtmlExporterParameter.BETWEEN_PAGES_HTML, "");
exporter.setParameter(JRHtmlExporterParameter.FRAMES_AS_NESTED_TABLES,true);
exporter.setParameter(JRHtmlExporterParameter.IS_REMOVE_EMPTY_SPACE_BETWEEN_ROWS, true);
exporter.exportReport();

iReport Designer tips n tricks:

Create paremeters, fields in the iReport designer and make sure the name match the exact names in the javabean if you plan to use java bean as datasource.

List is also very useful if you do not want to use a subreport.
To create a list, drag list component from the pallet and right-click select datasource :
new net.sf.jasperreports.engine.data.JRBeanCollectionDataSource($F{fieldname})
go to the dataset1 and create all necessary fields and drag and drop them to the area.

If you want to display certain area when some conditions are met, select the band properties for that particular group or section and modify "print when expression" to new Boolean($F{fieldname}>0) for e.g.