Rocking的學習路程

2013年4月1日星期一

雲端計算 (cloud computing) 簡介

最近參加【 2013 騰雲駕霧 Big Data 創意程式大賽】，因應比賽需要了解一些雲端計算的知識，因此把最近學到的雲端計算知識記錄下來

雲端服務分為以下三種
1. SAAS (software as a service)
提供軟體服務，像是GMAIL、google document...等

2. PAAS (paltform as a service)
提供平台服務，像是Hadoop、微軟的windows Azure

3. IAAS (infrastructure as a service)
提供硬體服務，也就是租給你硬體設備給你，這些硬體不一定是真的硬體(用VM虛擬出來的虛擬主機)

這次比賽我們所要做的就是SAAS

關於SAAS應用部分，以下列舉一些例子

趨勢騰雲駕霧 2009年冠軍：Location plus
Location Plus從大量的批踢踢BBS文章中，找出臺灣17個城市的熱門話題，每個城市提供30個熱門話題和相關的參考詞，可以讓使用者用這些熱門話題來搜尋Yahoo知識+、生活+、無名小站等內容，此外，也提供了手機版介面，讓使用者到任何地方就知道當地有哪些熱門話題。

趨勢騰雲駕霧 2011冠軍：http://www.trend.org/event/2011contest/news.html
不知是做什麼的，有空再來查

趨勢：年底前防毒產品將大量導入雲端防毒技術
http://www.ithome.com.tw/itadm/article.php?c=60065
我想這也是趨勢辦比賽的原因吧

根據趨勢提供的教程，以下為可能用到的技術及工具 (更新中)

技術

Map-Reduce：將演算法拆成很多份工作，丟到N台電腦平行執行的技術

H-base：一種分散式資料庫，並非建立在關聯模型上的關聯式資料庫

平台

Zookeeper：Zookeeper是一個分散式系統的架構，整合了許多工具。在這邊只好貼上官網對於zookeeper的介紹

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

學習 Apache ZooKeeper
http://ricky906.blogspot.tw/2012/01/zookeeper.html

Chubby：跟Zookeeper一樣功能的系統，由google開發，詳情參考這篇文章

Hadoop： 可以讓我們實現map-reduce的平台，其檔案系統為HDFS。以下是國網中心的免費hadoop系統 http://hadoop.nchc.org.tw/

工具

Flume：未知，目前還沒有感覺...

Sqoop：將關聯式資料庫(如MYSQL、Oracle) 和 Hadoop的資料互相轉移的工具

Hive： an SQL-like interface to Hadoop，讓使用者可以用SQL語法對HDFS做存取

Pig：一種script language，簡化了Hadoop常見的工作任務，可以使我們寫Map-Reduce程式更加簡單，關於pig和hive的介紹與比較，可以參考http://book.douban.com/annotation/17153277/

Oozie：在Hadoop中執行的任務有時候需要把多個Map/Reduce作業連接到一起，這樣才能夠達到目的。在Hadoop中，有一種workflow scheduler叫做Oozie，它讓我們可以把多個Map/Reduce作業組合到一個邏輯工作單元中，從而完成更大型的任務。
PS：當時資料探勘HW3的canopy + k-mean，整套流程用了很多Map-Reduce，k-mean每迭代一次，就是一次的Map-Reduce，當時這邊我是寫批次檔達成目的，當時批次檔所扮演的角色，我想就是Oozie了。這邊有Oozie的一些簡單範例

Machout：Machout是一個資料探勘的函式庫。在Mahout中，提供了好幾大類的演算法，像是Recommendation System（Mahout中專指Collaborative Filtering的推薦）、分類演算法、分群演算法、Pattern Mining (fp-growth)、Dimension Reduction，以及Vector Similarity等，在每一類裡，又各自提供多種經典演算法的實作。
以分類演算法來說，它目前就提供了像Bayesian、Support Vector Machine、Neural Network，以及Hidden Markov Models等等常見的演算法實作。

HUE：HUE是Hadoop的web UI，可以讓我們更方便的使用hadoop，而不需要面對command line，以下引用官網的一句話

Hue’s target is the Hadoop user experience and lets users avoid the command line interface and have them focus on visibility and getting results quickly.

如何在google blog貼程式碼

首先打開部落格的後台
範本→編輯HTML→勾選小裝置範本
然後在


.post-body {

  position: relative;

}

這段程式碼前面貼上


CODE {

display: block; /* fixes a strange ie margin bug */

font-family: Courier New;

font-size: 8pt;

overflow:auto;

background: #f0f0f0 url(http://klcintw.images.googlepages.com/Code_BG.gif) left top repeat-y;

border: 1px solid #ccc;

padding: 10px 10px 10px 21px;

max-height:200px;

line-height: 1.2em;

}

之後在部落格貼程式碼時，打完程式碼後，進到HTML模式，在程式碼前後加上
<code>和</code>，就會有這篇文章上面的功能了

2013年3月24日星期日

java replaceAll() - 關於反斜線的小技巧

如果要用 java replaceAll() 這函數替換反斜線
一個反斜線 "\" 等於 "\\\\"四個反斜線，
例如
要將str1和str2的反斜線取代

str1="aa\bbb"; str2="aa'bbb";
str1="aa\\bbb";str2="aa\'bbb";

要這樣寫
str1 = str1.replaceAll("\\\\", "\\\\\\\\");
str2 = str2.replaceAll("'", "\\\\'");

為什麼呢? 因為JAVA將"\\\\"解析成"\\"給正規表達式，正規表達式再將"\\"解析成"\"
所以一個反斜線，在正規表達式要寫成四個

2013年3月22日星期五

Eclipse 無法複製貼上

多謝這篇文章救了我 http://otweb.com/phramework/pw/module/blog/index.php?id=935

Window > Preferences > Java > Editor > Typing
取消勾選 Update imports

這樣做之後就能正常複製貼上囉

JAVA parse XML 範例 (DOM parser)

根據這篇文章，可以知道JAVA有4種解析XML的方法

1）DOM（JAXP Crimson解析器）
2）SAX
3）JDOM http://www.jdom.org
4）DOM4J http://dom4j.sourceforge.net

在不求效能的情況下，這裡以最簡單的DOM parser作範例，參考這篇文章

staff.xml

<?xml version="1.0"?>
<company>
 <staff id="1001">
  <firstname>yong</firstname>
  <lastname>mook kim</lastname>
  <nickname>mkyong</nickname>
  <salary>100000</salary>
 </staff>
 <staff id="2001">
  <firstname>low</firstname>
  <lastname>yin fong</lastname>
  <nickname>fong fong</nickname>
  <salary>200000</salary>
 </staff>
</company>

程式碼

ReadXMLFile.java

package com.mkyong.seo;
 
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import java.io.File;
 
public class ReadXMLFile {
 
  public static void main(String argv[]) {
 
    try {
 
 File fXmlFile = new File("/Users/mkyong/staff.xml");
 DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
 DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
 Document doc = dBuilder.parse(fXmlFile);
 
 //optional, but recommended
 //read this - http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
 doc.getDocumentElement().normalize();
 
 System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
 
 NodeList nList = doc.getElementsByTagName("staff");
 
 System.out.println("----------------------------");
 
 for (int temp = 0; temp < nList.getLength(); temp++) {
 
  Node nNode = nList.item(temp);
 
  System.out.println("\nCurrent Element :" + nNode.getNodeName());
 
  if (nNode.getNodeType() == Node.ELEMENT_NODE) {
 
   Element eElement = (Element) nNode;
 
   System.out.println("Staff id : " + eElement.getAttribute("id"));
   System.out.println("First Name : " + eElement.getElementsByTagName("firstname").item(0).getTextContent());
   System.out.println("Last Name : " + eElement.getElementsByTagName("lastname").item(0).getTextContent());
   System.out.println("Nick Name : " + eElement.getElementsByTagName("nickname").item(0).getTextContent());
   System.out.println("Salary : " + eElement.getElementsByTagName("salary").item(0).getTextContent());
 
  }
 }
    } catch (Exception e) {
 e.printStackTrace();
    }
  }
 
}

結果

Root element :company
----------------------------
 
Current Element :staff
Staff id : 1001
First Name : yong
Last Name : mook kim
Nick Name : mkyong
Salary : 100000
 
Current Element :staff
Staff id : 2001
First Name : low
Last Name : yin fong
Nick Name : fong fong
Salary : 200000

2013年4月1日 星期一