Loganis - Data Science - Λογανυς : CSV data

Showing posts with label CSV data. Show all posts

Monday, March 4, 2013

Scatter Plot Matrix for Incanter

A Scatter Plot Matrix chart can be useful when exploring relationship between metrical variables of a data-set. Today we share a working implementation of a scatter plot matrix function written in Clojure using Incanter, an R-like statistical computing and graphics environment.

What features are implemented?

Histogram in the diagonal for each metrics
Variance calculated for each metrics
Spline chart added to each histogram
Scatter Plot for each metric pairs
Correlation calculated for each metric pairs
Grouping option for a categorical dimension
Metrics are sorted according to the correlation with other metrics
Show only the n most correlating metrics
Show only the upper triangle of the plot matrix

Scatter plot matrix of chick-weight experiment data-set using Clojure, Incanter and JFreeChart

Usage of Scatter Plot Matix

If you do not have Leiningen install it.

cd ~/bin
wget https://raw.github.com/technomancy/leiningen/stable/bin/lein
chmod a+x lein

To see the Iris demo, do the following:

cd ${YOURWORKINGDIRECTORY}
git clone git://github.com/loganisarn/scatter-plot-matrix.git
cd scatter-plot-matrix
lein run

To generate and run a jar file:

lein uberjar
java -jar target/spm-0.1.0-standalone.jar

Those who use Emacs:

emacs src/spm/core.clj
M-x clojure-jack-in or 
M-x nrepl-jack-in

More details can be read in the project's github repository.

scatter-plot-matrix function options

(scatter-plot-matrix data & options)

   Options:
   :data data (default $data) the data set for the plot.
   :title s (default "Scatter Plot Matrix").
   :bins n (default 10) number of bins (ie. bars) in histogram.
   :group-by grp (default nil) name of the column for grouping data.
   :only-first n (default 6) show only the first n most correlating columns of the data set.
   :only-triangle b (default false) shows only the upper triangle of the plot matrix.

   Examples:

   (view (scatter-plot-matrix (get-dataset :iris) :bins 20 :group-by :Species ))
   (with-data (get-dataset :iris) (view (scatter-plot-matrix :bins 20 :group-by :Species )))
   (view (scatter-plot-matrix (get-dataset :chick-weight) :group-by :Diet :bins 20))

Detailed usage examples

Defining data source.

;;;Input examples for iris
  ;; Input dataset examples: Incanter data repo, local file, remote file (url)
  (def iris (get-dataset :iris))
  (def iris (read-dataset "data/iris.dat" :delim \space :header true)) ; relative to project home
  (def iris (read-dataset "https://raw.github.com/liebke/incanter/master/data/iris.dat" :delim \space :header true))

Filtering for specific columns.

;; Filter dataset to specific columns only
  (def iris ($ [:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species] (get-dataset :iris)))
  (def iris (sel (get-dataset :iris) :cols [:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species] ))

Defining a chart object with default options.

;;; Scatter plot matrix examples
  ;; Using default options
  (def iris-spm (scatter-plot-matrix iris :group-by :Species))
  ;; filter to metrics only, no categorical dimension for grouping
  (def iris-spm (scatter-plot-matrix :data ($ [:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width] iris)))

Defining a chart object using more options.

(def iris-spm (scatter-plot-matrix iris
                                     :title "Iris Scatter Plot Matrix"
                                     :bins 20 ; number of histogram bars
                                     :group-by :Species
                                     :only-first 4 ; most correlating columns
                                     :only-triangle false))

Viewing and saving scatter plot matrix chart

View on Display. Set chart width and height according to your needs.

(view iris-spm :width 1280 :height 800)

Save as PDF document using save-pdf Incanter function. (Click to see an example PDF output)

(save-pdf  iris-spm "out/iris-spm.pdf" :width 2560 :height 1600)

Save as PNG image using save Incanter function. (Click to see an example PNG output)

 (save iris-spm "out/iris-spm.png" :width 2560 :height 1600)

We get some suggestions that a browser-client output would be a nice alternative to JFreeChart. D3 and C2 were suggested.

Scatter plot matrix of airline data-set using Clojure, Incanter and JFreeChart

As you can see above, the airline shows that a scatter plot matrix function is useful for one metric pair and one categorical dimension.

Feedback is Welcome

Thank you for your comments and feedback. We hope you find our scatter plot matrix function implementation useful. Have a nice day using Clojure.

Sunday, December 23, 2012

Part 3 of Exploring Google Analytics data using Clojure, Incanter and MongoDB

Classifying Google Analytics All Traffic Sources helps us to make a better decision on media selection. You may not spend money on a medium that sends above average bouncing visitors. As described in my previous post we use a simple classification method: below average:0, average:1, above average:2.

 
 ; Let's check column names in :m2                                                                                                                                              
  (fdb :m2 :limit 1)
  [:_id :SourceMedium :Visits :PagesVisit :AvgVisitDuration :NewVisits :BounceRate]
  [# 1 5336 3.75 192 0.41 0.44659999999999994]
  ; Let's classify each metric                                                                                                                                                   
  (classify-row :m2 :PagesVisit)
  (classify-row :m2 :BounceRate)
  (classify-row :m2 :NewVisits)
  (classify-row :m2 :AvgVisitDuration)
  (classify-row :m2 :Visits)
  ; Let's check column names in :m2 again                                                                                                                                        
  (fdb :m2 :limit 1)
  ; We can see that each classified metric has a new column now                                                                                                                  
  [:BounceRate :NewVisits :Visits :SourceMedium :AvgVisitDuration :PagesVisitC :PagesVisit :_id :AvgVisitDurationC :VisitsC :NewVisitsC :BounceRateC]
  [0.6214 0.655 832 2 99 1 2.02 # 1 1 1 0]
  ; Let's use Incanter for charting 3 metrics
  (lm-gchart (fdr :m2 :PagesVisit) (fdr :m2 :AvgVisitDuration) "Pages/Visit" "AvgVisitDuration by BounceRateC"  (fdr :m2 :BounceRateC))

You may try out lm-gchart on your own.

Pages/Visit and AvgVisitDuration scatter-plot grouped by BounceRateC. Red dot means high-bouncing traffic sources, blue is average and green means below average bouncing traffic sources.

Functions of clojure.math may be your friend while exploring GA data.

;; combinations of two metrics
(clojure.math.combinatorics/combinations [:PagesVisit :BounceRate :Visits :AvgVisitDuration :NewVisits] 2)
;; Result:
((:PagesVisit :BounceRate) (:PagesVisit :Visits) (:PagesVisit :AvgVisitDuration) (:PagesVisit :NewVisits) (:BounceRate :Visits) (:BounceRate :AvgVisitDuration) (:BounceRate :NewVisits) (:Visits :AvgVisitDuration) (:Visits :NewVisits) (:AvgVisitDuration :NewVisits))
;; combinations of three metrics
(clojure.math.combinatorics/combinations [:PagesVisit :BounceRate :Visits :AvgVisitDuration :NewVisits] 3)
;; Result:
((:PagesVisit :BounceRate :Visits) (:PagesVisit :BounceRate :AvgVisitDuration) (:PagesVisit :BounceRate :NewVisits) (:PagesVisit :Visits :AvgVisitDuration) (:PagesVisit :Visits :NewVisits) (:PagesVisit :AvgVisitDuration :NewVisits) (:BounceRate :Visits :AvgVisitDuration) (:BounceRate :Visits :NewVisits) (:BounceRate :AvgVisitDuration :NewVisits) (:Visits :AvgVisitDuration :NewVisits))

Saturday, December 22, 2012

Exploring Google Analytics data with Clojure, Incanter and MongoDB Part 2

Now we have some helper functions. Let's use them to import and process Google Analytics CSV export file. In my last post we defined initdb function. Please set path_to_dir according to your environment in util.clj at defn initdb.

 (prn (:out (clojure.java.shell/sh "/path_to_dir/mediana/00import2mongo.sh" "/path_to_dir/Analytics-All-Traffic-20121120-20121220.csv" "c5" "m1")))

desktop.clj

We evaulate functions interactively. Open mediana/src/mediana/core.clj in Emacs. M-x clojure-jack-in starts a slime session and a repl. pressing C-x C-e right after a closing clojure evaluates expression or M-x slime-eval-last-expression. A whole file can be evaluated by pressing C-c C-k or M-x slime-compile-and-load-file. A active region can be evaluated by C-c C-r or M-x slime-eval-region.

(in-ns 'mediana.core)
;;:SourceMedium :PagesVisit :Visits :AvgVisitDuration :NewVisits :BounceRate                                                                                          
;; Desktop                                                                                                                                                                       
(defn nopfn []
 ;; Import csv into :m1 then process it into :m2                                                                                                                                
  (initdb)
  ;; Fetch "google /organic" data from :m1                                                                                                                                       
  (fdb :m1 :where {:SourceMedium "google / organic" })
  ;; Result:                                                                                                                                                                             
  [:_id :SourceMedium :Visits :PagesVisit :AvgVisitDuration :NewVisits :BounceRate]
  [# "google / organic" "25,104" 5.38 "00:05:32" "39.93%" "33.42%"]
  ;; What is the index of "google / organic" string in :SourceMedium row?                                                                                                        
  (iof "google / organic" (fdr :m1 :SourceMedium))
  ;; 352                                                                                                                                                                         
  ;; Get record from :m2                                                                                                                                                         
  (fdb :m2 :where {:SourceMedium 352})
  ;; Result:                                                                                                                                                    
  [:_id :SourceMedium :Visits :PagesVisit :AvgVisitDuration :NewVisits :BounceRate]
  [# 352 25104 5.38 332 0.3993 0.3342]
  ;; Processing works fine.

Now we have a working dataset in :m2 collection. Let's see some data.

;; What is the correlation between :PagesVisit and :AvgVisitDuration ?                                                                                                         
  (correlation (fdr :m2 :PagesVisit) (fdr :m2 :AvgVisitDuration))
  ;; 0.7215517848024166                                                                                                                                                          
  ;; It is a strong correlation. Let's see it in a chart with a linear-model:                                                                                                    
  (lm-chart (fdr :m2 :PagesVisit) (fdr :m2 :AvgVisitDuration) "Pages/Visit" "Avg Visit duration in sec")

Linear model details

  (def lm (linear-model (fdr :m2 :PagesVisit) (fdr :m2 :AvgVisitDuration)))
  (keys lm) ; see what fields are included                                                                                                                                       
  (:design-matrix lm) ; a matrix containing the independent variables, and an intercept columns                                                                                  
  (:coefs lm) ; regression coefficients                                                                                                                                          
  (:t-tests lm) ; t-test values of coefficients                                                                                                                                  
  (:t-probs lm) ; p-values for t-test values of coefficients                                                                                                                     
  (:fitted lm) ; the predicted values of y                                                                                                                                       
  (:residuals lm) ; the residuals of each observation                                                                                                                            
  (:std-errors lm) ; the standard errors of the coeffients                                                                                                                       
  (:sse lm) ; the sum of squared errors,                                                                                                                                         
  (:ssr lm) ; the regression sum of squares                                                                                                                                      
  (:sst lm) ; the total sum of squares                                                                                                                                           
  (:r-square lm) ; coefficient of determination

Histogram

  (view (histogram (fdr :m2 :PagesVisit) :nbins 50))
  (view (histogram (fdr :m2 :BounceRate) :nbins 50))
  (view (histogram (fdr :m2 :NewVisits) :nbins 50))
  (view (histogram (fdr :m2 :AvgVisitDuration) :nbins 50))
  (view (histogram (fdr :m2 :Visits) :nbins 50))

Classifying Metrics

We use a simple classification: average : 1, below average: 0, above average: 2 for a metric as described in classify-row function. In my next post we take a look at it.

Friday, December 21, 2012

Exploring Google Analytics data with Clojure, Incanter and MongoDB Part 1.

There are many ways you can analyse your Google Analytics data. In this post I use GA Traffic Sources exported csv data to try out Incanter, Clojure and MongoDB. This post is part of Loganis GAckup tools. All clojure codes can be used under the same license as Clojure's Eclipse Public License 1.0

Preparing data

Log into Google Analytics (GA) and go to Traffic Sources - All Traffic . Select date range or other options like Show rows: according to your need. Then click Export - CSV above Explorer tab.

Preparing software

You need to install the following software packages:

MongoDB

Go to MongoDB Dowloads page, and install it.

Clojure and Incanter

Create a new project using lein. We use mediana project name in this post.

lein new mediana

Edit your project.clj like this example:

(defproject mediana "0.0.1"
  :description "Mediana Loganis GA tool"
  :dependencies [[org.clojure/clojure "1.3.0"]
                 [incanter "1.3.0" ]]
  :main mediana.core
  :aot :all
  )

Download Clojure, Incanter and dependecies.

lein deps

Create a 00import2mongo.sh file into main project directory. This script removes extra chars from exported CSV file and imports it into MongoDB with a given database and collection name:

#!/bin/bash                                                                                                                                            
# Clean up and import GA exported csv                                                                                                                             
# usage: 00import2mongo.sh ga-export.csv DB-name Collection-name                                                                                       

if [ "$1" != "" ]
then
    # remove comments, empty lines, Day values, then remove slahes,percents,spaces,points from fieldnames                                            
    sed -e '/^#.*/d;/^$/d;/^Day\,/,$d' $1 | sed -e '1s/ \/ //g;1s/%//g;1s/ //g;1s/\.//g' | mongoimport -d $2 -c $3 --drop --headerline --type csv -
else
    echo "not ok"
fi

core.clj

Edit src/mediana/core.clj file. We will use Incanter interactively in an Emacs swank clojure REPL session using M-x clojure-jack-in. If you do not have a configured Emacs environment, please visit Emacs-Live which is easy to configure. In this post we use c5 database name for GA data. We use 1 namespace and 3 clj files. First core.clj that loads util.clj and desktop.clj. util.clj contains helper functions. desktop.clj contains a nopfn or No Operation function where we can evaluate clojure code.

 (ns mediana.core
    (:gen-class)
    (:use (incanter io core stats charts datasets mongodb ))
    (:require [clojure.java.io :as io]
              [clojure.string :as string] )
    (:require [clj-time.core ])
    (:require [incanter.distributions :as dist])
    (:use somnium.congomongo)
    (:require clojure.java.shell)
    (:refer-clojure)
    )
(use 'somnium.congomongo)
(use 'incanter.mongodb)

(mongo! :db "c5")
(load "util")
(load "desktop")
; main                                                                                                                                                 
(defn -main [& args]
  (prn "hello world")
  )

util.clj

Here I prepared some helper functions. iof: index of an item in a collection, sreplace: string replace, parse-int, avg: average of a row, p2n: percent to number, s2n: string to number, t2s time string to seconds.

(in-ns 'mediana.core)

(defn iof
  "Index of an item in a coll"
  ([item coll]
     (iof item coll 0))
  ([item coll from-idx]
     (loop [idx from-idx coll (seq (drop from-idx coll))]
       (if coll (if (= item (first coll)) idx
                    (recur (inc idx) (next coll)))
           -1))))

(defn sreplace
  "Replace r in s srting to nil"
  [s r]
  (clojure.string/replace s (java.util.regex.Pattern/compile r)  ""))

(defn parse-int
  "Parse string into integer"
  [s]
  (Integer. (re-find  #"\d+" s )))

(defn avg
  "Calculate average of a row"
  [row]
  (/ (sum row) (nrow row)))

(defn p2n
  "Percent string to number"
  [s]
  (let [sn (read-string (sreplace s ",|%"))]
    (if (number? sn) (/ sn 100))))

(defn s2n
  "String number to number"
   [s]
  (let [sn (read-string (sreplace s ",|%"))]
    (if (number? sn) sn)))

(defn t2s
  "Time string to seconds"
  [s]
  (let [tv (clojure.string/split s #":")
 sec (if (= 3 (length tv))
              (+ (* (parse-int (nth tv 0)) 3600)
                 (* (parse-int (nth tv 1)) 60)
                 (parse-int (nth tv 2)))
              0)]
    sec))

util.clj MongoDB functions

fdb: fetch database, fdr fetch data row, fdd: fetch distinct data

(defn fdb
  "Fetch DataBase, an alias for mongo db query"
  [& args]
  (apply fetch-dataset args))

(defn fdr
  "Fetch data row, query coll for only row"
  [coll row]
  ($ row (fdb coll :only [row])))

(defn fdd
  "Fetch distinct data, query coll for only distinct values of row"
  [coll row]
  (distinct-values coll (name row)))

util.clj chart and weighted sort

lm-chart: scatter-plot chart with linear model, lm-gchart: lm-chart + group by variable, etv-ws: estimated value for weighted sort.

(defn lm-chart
  "Linear model of x y in a chart"
  [x y xl yl]
  (def sp (scatter-plot  x y
                         :title (str xl " " yl)
                         :x-label (str xl)
                         :y-label (str yl)))
  (def lm (linear-model y x))
  (add-lines sp x (:fitted lm))
  (view sp))

(defn lm-gchart
  "Linear model of x y in a chart groupped by g"
  [x y xl yl g]
  (def sp (scatter-plot x y
                        :group-by g
          :legend true
                        :title (str xl " " yl)
                        :x-label (str xl)
                        :y-label (str yl) ))
  (def lm (linear-model y x))
  (add-lines sp x (:fitted lm))
  (view sp))

(defn etv-ws
  "Estimated weighted sort value for y based on x where mx maximum of Xs, ay average of Ys."
  [x y mx ay]
  (let [etv ($= (x / mx * y) + ((1 - (x / mx)) * ay ))]
    etv))

util.clj init db

Read comments in source. First we call 00import2mongo.sh script that cleans and import GA csv into c5 database m1 collection. Then we parallel process imported collection into an integer/float only m2 collection using pmap.

(defn initdb [] "Initialize DB"
  ;; import GA exported csv into mongo :m1 collection using 00import2mongo shell script for csv cleaning before import                                 
  (prn (:out (clojure.java.shell/sh "/path_to_dir/mediana/00import2mongo.sh" "/path_to_dir/Analytics-All-Traffic-20121120-20121220.csv" "c5" "m1")))
  ;; drop :m2 processed collection                                                                                                                     
  (drop-coll! :m2)
  ;; Distinct values for SourceMedium. A traffic source may occur several times in case of a secondary dimension selected for csv export              
  (def SourceMedium (fdd :m1 :SourceMedium))
  ;; process all object ids of :m1 coll                                                                                                                
  (let [oids ($ :_id (fdb :m1))]
    (pmap
     (fn [y]
       (let [rec (fdb :m1 :where {:_id y})
             ;; SourceMedium as integer index                                                                                                          
             smi (iof ($ :SourceMedium rec) SourceMedium)
             ;; String to integer Number                                                                                                               
             vis (s2n ($ :Visits rec))
             pav ($ :PagesVisit rec)
             avs ($ :AvgVisitDuration rec)
             ;; Time to seconds as integer                                                                                                             
             avd (t2s avs)
             ;; Percent to float                                                                                                                       
             nev (p2n ($ :NewVisits rec))
             bor (p2n ($ :BounceRate rec))
             ;; store cleaned up values                                                                                                                
             log (insert! :m2 {:SourceMedium smi
                               :Visits vis
                               :PagesVisit pav
                               :AvgVisitDuration avd
                               :NewVisits nev
                               :BounceRate bor})
             ]
         ))
     oids)))

util.clj classify-row

We use a simple classification: below average, average, above average using average and standard deviation. BounceRate is a negative metric: lower value means better.

(def negmets "Negative Metrics or lower value means better"
  #{:BounceRate})

(defn classify-row
  "Classify row to avg and sd of row in coll"
  [coll row]
  (let [;; all object ids                                                                                                                              
        oids (fdr coll :_id)
        ;; get all data of row                                                                                                                         
        rowdata (fdr coll row)
        ;; calc average of row                                                                                                                         
        average (avg rowdata)
        ;; calc half os standard deviation for row data                                                                                                
        halfsd (/ (sd rowdata) 2)
        ;; max and min values: avg +/- halfsd                                                                                                          
        maxv (+ average halfsd)
        minv (- average halfsd)
        ]
    ;; parallel process object ids                                                                                                                     
    (pmap
     (fn [y]
       (let [;; one object id                                                                                                                          
             oid y
             ;; get full record of oid                                                                                                                 
             rec (fetch-one coll :where {:_id oid})
             ;; get row value of record                                                                                                                
             yre (row rec)
             ;; classify                                                                                                                               
             cla (;; if row is a negativ metric (like Bounce rate: the higher the worse                                                                
                  if (contains? negmets row)
                   ;; Negative metric: max point when lowest                                                                                           
                   (cond (> yre maxv) 0
                         (< yre minv) 2
                         :else 1)
                   ;; Positive metric: max point when highest                                                                                          
                   (cond (> yre maxv) 2
                         (< yre minv) 0
                         :else 1))
             ;; Merge classification value back to record                                                                                              
             log (update! coll rec (merge rec
                                          ;; Add extra "C" char to the name of metric                                                                  
                                          {(keyword (str (name row) "C")) cla}))
             ]))
     ;; Object ids input                                                                                                                               
     oids)))

util.clj weighted sort collection

This weighted sort function is similar to Google Analytics wieghted sort.

;; weighted sort weight calculation for a given row                                                                                                    
(defn weighted-sort-row
  "Calculate and store weighted sort value in coll using wrow as weight.                                                                               
   Store new weigted values in rowW name as new column"
  [coll wrow row]

  (let [oids (fdr coll :_id)
        ydata (fdr coll row)
        ay (avg ydata)
        xdata (fdr coll wrow)
        mx (reduce max xdata)

        ]
    (pmap (fn [y]
            (let [oid y
                  rec (fetch-one coll :where {:_id oid})
                  x (wrow rec)
                  y (row rec)
                  w (etv-ws x y mx ay)
                  log (update! coll rec (merge rec
                                               {(keyword (str (name row) "W")) w}))
                  ]))
          oids)))

That is all for preparation. In my next post we turn to desktop.clj where we interactively explore real Google Analytics data using the above functions.

Loganis - Data Science - Λογανυς