data_Analytics

Table of Contents

巨量資料分析: 巨量資料分析是指用來從不同的大量高速資料集收集、處理和取得深入解析的方法、工具和應用程式

第一段

pnorm() 算常態分布的機率雖說bigData 但我們通常都只有partial data

autoML 套件取代 data analyst

data:

singal: patterns insight
noise

AI崛起原因

Advanced of learning algorithms 演算法的進步
exponential growth of data(Big data)
cheaper computation

deep learning 向下走原因:不太能推論出因果關係

smart apps:會蒐集使用者資料回傳server 調整模型提供更好的服務

第二段

big data 從哪來?

traditional data(tabular data) :類似關聯式資料庫內的資料

(1) continuous data:連續型資料 e.g.收入 (2) categorical data:間段型資料 e.g.男女

image data: 圖資料
voice data:
text data
network data: 比較大的graph 例 social network 可以用adjacency matrix 表示(研究所考試考過)

資料的結構結構化半結構化(key-value):json xml 非結構化:聲音圖片 email etc.

descriptive: pie chart 之類的 diagnostic: 假設檢定因果關係 predictive: 迴歸 prescriptive: 最佳化決定

ml model: 探討變數關係 simulation model: 定義不同物件然後跑模擬

apply()

lapply(): lapply 的 l 代表 list，也就是透過 lapply 函數操作完之後，會回傳一個 list

sapply(): sapply 的 s 代表 simple，意思是透過函數 sapply 回傳的結果是將 list 形式簡單化 (simplified) 後的 vector

mapply(): mapply 的 m 指的則是 multivariate，意思是可以同時使用多個變數

sd(): 標準差

datapipeline 資料科學家的工作

particle sworm optimization :algo 的一種可以上netlogo web試用

想想看: 為甚麼鴿子不會撞在一起?

‘‘用來框vector ““用來框string

預測要先夠準再去interpretation

R的資料集: mtcars R: lm() linear model :

learner:要學的東西 model:做出來的東西

lime interpreter

multi-modal learning 多模態 (學習用人的想法去學習): 把不同類型的資料混合使用

shortcut learning 從我們非預期的地方學到如何判斷結果

e.g.有尺就是皮膚癌

correlation alone doesnt imply causation

confounder 干擾因子

randomized controlled trial = AB test

simpson paradox survivor bias

federate learning 只訓練模型丟到雲端不需要一些較隱私的data

r sql語法

multi-label learning 多標籤學習同一種東西可能屬於很多類型 e.g.一部電影可能同時是恐怖電影與愛情電影

multi-tasking 不同domain (e.g.年齡教育程度) multi-label 同domain binary class 一種東西只會有兩種分類是貓還是狗 multi class 一種東西可能是a b c

模型可以

預測 prediction
解釋變數間關係 interpretation
最佳化 optimization

horizontal join: 以第一column做合併 vertical join:以第一row做合併 vertical join操作:

union
diff
intersection

r的最小單位是vector 不是scalar

因變數取在~~符號的左邊，右邊由解釋變數組成。下面是一個示例公式，可以說明~~符號的使用。 s <- lhs ~ rhs s lhs ~ rhs

I() identity 建立一個變數 e.g. I(a+b) 建立一個變數=a+b

i.i.d idependent identical distribution

o.o.d out of distribution

data granularity 資料顆粒度 continuous –>categorize–>category

data reduction 資料縮減分類:

dimensionality reduction 維度縮減國英數自國英->語文數自->數理
numerousity reduction:把數字做離散

R資料結構: scalar vector matrix tensor

data augmentation: problem:如果用linear model可能造成collinearity sol:用tree之類的方法就不會collinearity了

data modeling algorithm modeling

建模型的用處:

prediction (accurarcy 要夠準)
intepretation(合不合理算命仙)
optimization

There is no neccesary connection between model acuuracy and model interpretability/complexity

Rashomon sets (羅生門集): 不同算法都得出相同結果 accuracy也相近->選最好解釋的(parsiomous model)

parsiomous model: 參數不多也好解釋的

rule learning

解釋器-幫模型標出來模型是怎麼判斷的

OOD-out of distribution

understanding data 步驟:

Univariate analysis
Bivariate analysis
Multivariate analysis

zero variant = near zero variant:變數沒有變異

Univariate Analysis

how to handle missing value?

complete cases 缺的就刪掉
data imputation 算linear model 然後推算

continuous data補值: 可以用missing indicator 補0/1 告訴模型是不是捕來的

categorical data補值:

merge
new level

feature selection 選適合的變數放進模型

bivariate anaysis

two sample T-test:拿來算兩種不同類型的變數 x1 x2 e.g. cont. v.s cat.

一個categorical:one way ANOVA 兩個categorical:two way ANOVA chi-square:檢定兩個categorical 男女學歷

今天剛好在查詢各種檢定運用的情境，就稍微在這裡整理一下： t test: 主要是用在比較兩組之間“平均值”是否有差異（只能比較兩組）。

Two sample t-test：兩組之間彼此獨立，比如牛吃a牌飼料跟b牌飼料的產乳量有沒有不同。最多只能用a b 兩種類別，再多就不能用了

Paired Sample t-test：兩組之間相互關聯，比如一年前後相同一批學生的數學成績有無差別。

ANOVA: categorical(x) 對 continuous(y) 與t-test做相同的事情，但是可同時比較多組。

One-way Anova：自變數只有一個，比如性別對於智力是否影響。

Two-way Anova：自變數有多個，比如性別與種族對於智力是否影響。

one sample t-test univariate two sample t-test bivariate paired t-test bivariate

statistical learning:

general linear model: anova ancova SLR(simple linear regression)
generalized linear model:logistic regression poisson regression
ML CART Rule-learning DNN (KNN,SVM) bayesian network

regular experssion (Regex)

score = 20+10 * hrs local interpretation解釋一筆資料如何作出決定:因為公式長這樣 20+10* hrs= score global intepretation解釋現象:讀越多書分數越高

discrete:

轉成百分比->密度函數 prob=density 舉例: head 3次 tail 7次 -> head 0.3 tail 0.7

continuous: prob = density* range(範圍)

density func: 160cm-170cm cumulative func: 150以下 quantile func(inverse CDF):給一個機率求點是多少

d:密度函數 p:累積密度函數 cumulative q:quantile pdf->continuous pmf->discrete

chi-square: goodness of fit test 檢驗兩個categorical

distance 有symmetric的特性 e.g. 台北-高雄 = 高雄-台北

divergence 衡量分布的變異度哪個比較大 e.g. d(a,b) != d(b,a)

KL divergent:衡量系統是否開始不穩定

Shapiro-Wilk 常態性檢定: 檢定連續型樣本是否屬於常態分配 Kolmogorov-Smirnov 常態性檢定:檢定連續型樣本是否屬於常態分配 K-W test: 用中位數檢定樣本

fisher-exact test:every cell in crosstab<5 可分析兩組類別categorical資料之間是否有顯著相關 Chi-squared goodness of fit test:檢定 discrete的樣本是否屬於常態分配 chi-square test of independence:

有母數:假設母體有一種分配無母數:假設母體沒有特定分配

現在做資料分析:

quick and dirty:做資料先簡單處理，就丟進去跑，不要花一堆時間整理資料
incrementally
interatively

accuracy 準確 precision 精確

y proxy:可以當成y的其他變數(類似proxy variable)

model architecture design: good model要素:

deep(nested): (1)一個模型的output為另一個模型的input (2)越來越接近資料 (3)prevent overfitting
diverse 從不同角度看
interpretable
generative
causal

mutual information(in bit): how much info in x about y

data processing inequality(DPI): 資料經過轉換之後，Mututal information(越來越高)

相關係數:

Pearson
Spearman
Kendall

相關係數在現實生活效用不高: 例如curve 相關係數就趨近於0

KNN:

lazy algo
類比學派
KNN for regression 從a 點附近兩個點平均後去推測a點
continuous要做feature rescaling 讓尺度要一樣不然會有bias

continuous variable 為scale sensitive 所以要做rescaling rescaling 方法:

標準化
min max standarization

做rescaling 時都要做紀錄，看是怎麼轉換的

traing-test splittng原則: model至少要能train起來，至少要能testing training set:train/select best fit model (選最適合的model) testing set:estimate model generalization

data preprocessing object: package: caret

類比學派:

你就是相似的你->從目標去推測目標(物以類聚)
Euclidean distance
Manhattan distance lazy algo:推論階段才去建模

資料漂移 (Data drift) 指的是從訓練到服務之間的資料改變，也就是特徵的統計性質改變 (資料分佈 X 改變)，像是現在的房子越蓋越小。

概念漂移 (Concept drift) 指的則是世界改變使得 Ground Truth 改變，也就是標籤的統計性質改變 (映射 X → Y 改變)，像是炒房使得同樣大小的房子價格改變。

R-square:只是針對training set沒有針對testing set 所以盡量不要用

categorical rescaling: 要有encoding encoding方法:

one-hot 非常佔記憶體(差)
dummy coding(差)
frequency encoding 用機率來記

用numeric來記:embedding target encoding:將資料壓縮成你要的形式

categorical encoder:

supervised
unsupervised (one hot)

deep learning不一定只能用ANN做也可以用tree之類的做 (但linear model不行)

資料轉換要包含 linear &非linear才是好的 linear model (linearmodel)->還是lnear model

feature rescaling->針對continuous linear model 也是scale sensitive

基於規則的都有->scale sensitive 非基於規則->沒有scale sensitive (e.g.tree xgboost)

indicator variable: 變數1就是有0就是沒有 (binary)

上圖跑出來的係數是負的:原因是因為data是成對(paired)的(同一人每5年做一次)，應該要每個人都畫一條regression 係數才是正的(符合邏輯) sol:跑paired t-test wilcox.test, 兩點做相減再跑回歸

資料跟資料有關係: 統計: correlated ML: network

成對資料可以用 mixed effect model 來衡量: mixed effect = fixed +random fixed effect: 隨著時間不會改變(如id對性別) random effect: 隨著時間會改變(如id對體重)

mixed effect model:

mixed effect linear model
GEE 廣義估計方程
mixed effect neural net
mixed effect random forest

paper類型:

tutorial-like (introducting):頁數很多很像教科書的概念
survey(review): 2-column, <15頁:介紹已經有的概念、讓已經在該領域做研究的人快速了解該領域的發展
viewpoint: 2-column, <5頁 ,多是頂尖期刊才會有
regular: 2-column, 8-10(IEEE),(重視novelty)

如何判斷paper值不值得看:

只看跟自己研究有關係的
盡量不要看超過10年的paper
找不到paper的pdf可以用 scihub.se 把網址貼上去

看paper順序: abstract-> conclusion-> intro deep learning三巨頭: bengio, lecun, geoffrey hinton

跟資料有關係，跟模型沒關係 low noise low signal常發生情況: time series high signal low noise常發生情況:

來自實驗室環境或是工廠
資料裡面就有直接寫出來了 e.g. 預測敗血症結果病例裡面就有敗血症

如果同一人衡量超過2次(非成對、或是每人衡量次數不一樣e.g.過世、衡量的時間點或間隔不一樣)就不能用paired

模型複雜度跟預測準度沒關係 overfitting 就是誤把noise當成signal

如果資料及不符合參數假設(e.g.常態分配) ->可以用Wilcoxon rank-sum test (WRS)

Causal effect 可以用

potential outcome framework
Rubin causal model

cross validation 可以用來驗證training vs testing

extrapolation: 可以在DB裡面加constraints 來解決 e.g.mpg~ weight ,mpg>0

Multi-tasking learning:可以用同樣的資料訓練不同類型模型，幫助加強模型準度 e.g.預測人的年齡，可以先預測性別

PSO particle swarm optimization: 可以用來找最佳化的配置

k-fold Cross validation 隨機分成不同part k=10 會train 10 個model k=20 會train 20 個model 文獻證明: 10-20個fold較適合，至少要有5個fold

Ensemble machine learning :is a kind of modeling technique that promotes deep and diverse models by using different learners and aggregating multiple models to predict the outcome.

一般Linear model 通常不太會overfit 除非signal比較複雜

cross validation 是可以協助挑選好的模型架構，比對overfit 或是underfit，不能直接提升精準度

用y proxy 去預測y noise低&signal高-> 沒意義

time series:通常signal low noise也low

如果validation error 跟testing error差很多，模型就有問題

concept drift 步驟

detection
understanding
adaption (1)model retrain (2)model parameter update (3)model redesign

confirmatory data analysis:驗證性分析確認x影響y的關係 e.g.肺癌 = smoke +gender+age 如果把smoke拿掉非常不像話，所以要做confirmatory analysis，確定x影響y的關係 smoke 叫做voi (variable of interest)

exploratory data analysis 探索性分析就是沒有voi 的confirmatory data analysis

feature selection: 挑模型套件 R: CARET Python: mltax

模型數量= 2^k-1個模型 human(expert) in the loop ML:專家參與模型建立

Feature selection 分為:

backward selection: 1.探索性分析比較適合 2.一個一個刪刪到error開始上升就停了 3.

forward selection: (老師幾乎都用這個) 1.驗證性分析比較適合 2.又稱為RFE recursive feature engineering 3.又稱為 forced-in 一次只加一個變數進去，直到放變數進去error不會再下降

hybrid selection : 1.先用forward 再用 backward 2.第一線人員常用 3.可以結合forward backward 的優點

ranking of variable importance: 1.排出來哪個變數的重要性比較高 2.最好跟domain expert討論

linear model排出來的常常是不準的，因為會有共線性的問題

lasso regression: 就是linear model 後面加一個penalty

logistic func 是把 regression 壓成機率介於0-1之間

odds ratio 介於0-無限大 odds ratio 只要大於1 就可以說有關係小於一比較沒關係

時間序列: 類型: 平穩型stationary

無定向型 drifting

趨勢型 trend 季節型 seasonality

外部影響型exogenous

特徵: 滯後性lag 週期性periodicity 趨勢性trend

logistic regression 採用MLE來計算機率 epiDisplay:用來產生公衛商用報表(adjusted odds ratio比較表)

對glm的coefficient直接取自然對數就可以得到adjusted OR

class imbalance: 定義:因為資料預測超級不平衡，導致accuracy超高，但實際上沒意義 e.g.預測x實際也錯x solution:

opitmal cutoff: 調整thrshold但可能錯殺無辜(把想知道的東西刪掉)
resampling: 分為up-sampling&down-sampling，分為小部分的資料，建模後再合併在一起
model Architecture Redesign: one-class,cost-sensitive learning

AUC介於0-1之間越接近1，classifieer越好實務上 (>0.9)要覺得too good to be true (>0.7-0.9) 這個範圍還不錯

AUC 越往左上越好

讓roc auc平衡:用youden’s J 取得youden’s J 算法-> J= sensitivity+specificity-1

glm(x,data,family) 如果要產出logistic regression family = binomial e.g. gl,(default~balance,data=Default, family = ‘binomial’)

epiDisplay 套件可以把logistic regression 產出一個表

threshold 上升 sensitivity 下降 Speficity下降

trees spliting 代表把tree分支成很多部分 tree的over-spliting = regression 的overfitting

model-based tree: 定義:先用tree把資料切開來，再用linear model等方法預測

越靠近root的變數越重要因為是被優先挑出來的

CART 又叫做 recursive partitioning

tree pruning 限制tree的生長分為

pre-prunning:限制切出來的區塊至少要有_%的數量才能畫出來
post-pruning: (1)cost-complexity (2)cost->MR 錯誤率

decision stump:只有一個分支的tree

rpart 可以拿來算 tree剪枝要剪多少

tree 的優點:

好解釋
跟尺度無關不用編碼

缺點:

不準因為都是用2分法然後取平均數預測
expressive power不好因為現實生活中不是每個都if then的問題

teacher student model 又稱為 knowledge distillation

Bootrapping: 特點: row-sampling

bootrap aggregation (bagging) 流程:先切不同資料->建不同樹->得到不同y-head->再用3個y-head的平均數去預測實務上:

64-128個tree會比較準
r的套件預設randomforest的tree要長500棵
盡量先overfitting 先降低bias 再處理variance
把各個模型做平均 variance就會變小
OOB(out of bag) data:可以拿來當validation data
大部分都是有1/3做validation set
bagging tree最大的缺點就是幾乎都長得很像，因為重要的變數不管在哪一棵樹幾乎都是一樣的。解決辦法:故意把重要變數遮住讓樹去長-> 也就是randomforest

隨機森林: 缺點:變準了，但也變得難解釋了

連結

Tags:

Marketing Data Science

本堂課程教材

/ [ 檢視PDF全文 ]

20240221

(1)Service Quality Scale (SERVQUAL) Parasuraman,A.;Berry,Leonard L.;Zeithaml,Valarie A., “SERVQUAL: A Multiple-Item Scale For Measuring Consumer Perceptions of Service Quality”, Journal of Retailing, 1988, 64, 1, 12-40.