Data Analysis and Inference - 公共工程標價與標案規模之實證研究-以支撐先進工法之橋樑為例

(a) 計算各項的值(平均值、標準差、最大值、最小值)：

由於 Data 只有 20 組，15 n 40≦ ≦ 因此繪出各項的kernel density plot 以常態分佈為底以觀察是否symmetry 及 no outliers，因為採用 histogram 太過敏感較不易判斷。

0.00005.0001.00015.0002.00025Density

8000 10000 12000 14000 16000 18000

Kernel density estimate Normal density

0.1.2.3.4Density

2 4 6 8

x1 Kernel density estimate Normal density

.0002.0003.0004Density .04.06.08.1Density

05.000e-06.00001.000015Density

-50000 0 50000 100000

x4 Kernel density estimate Normal density

如果不是常態分佈的話 (有 outlier) t distribution 的誤差大，因而導致 Multiple Regression 的統計結果誤差也大。

(b) 計算與比較各項之間的相關係數：

y 與 x1、x2、x4 的相關係數 r 分別為 -0.4078 -0.5082 -0.5216 為中度相關，

其中以 y 與 x3 的相關係數為最高其中以 x2 與 x4 的相關係數為最高 (比較不獨立)，所以此時還不能決定要刪除x2或x4，因此進一步地觀察。

下列為各項之間的相關分佈圖。

. scatter y x1 .scatter y x2

800010000120001400016000y

3 4 5 6 7

800010000120001400016000y

30 35 40 45 50

. scatter y x3 . scatter y x4

800010000120001400016000y

0 20000 40000 60000 80000 100000

800010000120001400016000y

0 20000 40000 60000 80000 100000

. scatter x1 x2 . scatter x1 x3

34567x1

0 1000 2000 3000 4000

34567x1

30 35 40 45 50

. scatter x2 x4 . scatter x3 x4

01000200030004000x2

0 20000 40000 60000 80000 100000

3035404550x3

0 20000 40000 60000 80000 100000

(c) Regress、predict hat、regression line (線性)。

x1( t statistic ) =β^{^}1/SEb =-536.3313/196.3344=-2.73 x2( t statistic ) =β^{^}2/SEb =0.5506769/0.713089=0.70 x3( t statistic ) =β^{^}3/SEb =-335.2841/61.94439=-5.41 x4( t statistic ) =β^{^}4/SEb =-0.0331036/0.0254107=-1.30

(f) P-value

(g)重新regress，並與之前比較、predict hat、regression line (線性)。

. predict hat R-squared=0.7995=79.95% 與之前的 80.06%差異很少，且 x1(P-value)：0.012 = 1.2% < 之前的 1.5%

x3(P-value)：0.000 = 0.0% = 之前的 0%

x4(P-value)：0.039 = 3.9% < 之前的 21.2%

所以刪除 x2 是 O.K 的。

(1)殘差圖常態分析

0.0001.0002.0003.0004.0005Density

-2000 -1000 0 1000 2000 3000

Residuals

Kernel density estimate Normal density

Symmetry =>(OK)

(2)殘差圖與 Normal Quantile . predict r,resid

. qnorm r

-1000010002000Residuals

-2000 -1000 0 1000 2000

Inverse Normal

45 度 =>(OK)

(3)殘差圖與預測值 . rvfplot, yline(0)

-1000010002000Residuals

6000 8000 10000 12000 14000

Fitted values

No pattern =>(OK)

一個Root MSE 內含約 68%的 date

(h) 展開 2 次方項

冪級數展開[4] ⇒

∑

(a1*x1+a3*x3+a4*x4)²

=a1*x1+a3*x3+a4*x4+a5*x1^2+a6*x1*x3+a7*x1*x4+a8*x3^2+a9*x3*x4+a10*x4^

(i) 導入二次方項並計算各項的值(平均值、標準差、最大值、最小值) 刪除x2 的 data，並導入 x5(=x1^2)、x6(=x1*x3)、x7(=x1*x4)、x8(=x3^2)、

x9(=x3*x4)、x10(=x4^2)。

(j) 計算各項之間的相關係數。

(k) Regress。

此時的 R-squared = 0.9591。試著導入三次方項觀察結果。

(1)殘差圖殘差分析

. kdensity r,normal (n() set to 20)

0.0002.0004.0006.0008.001Density

-1000 -500 0 500 1000

Residuals

Kernel density estimate Normal density

Symmetry 且 Root MSE 變小 =>(OK)

(2) 殘差圖與 Normal Quantile

-1000-50005001000Residuals

-500 0 500

Inverse Normal

45 度且 Root MSE 由 917 降到 523 =>(OK) (3) 殘差圖與預測值

-1000-5000500Residuals

8000 10000 12000 14000 16000

Fitted values

Root MSE 縮小 NO Pattern =>(OK)

(l) 展開 3 次方項

∑

(m) 導入三次方項並計算各項的值(平均值、標準差、最大值、最小值)

導入x5(=x1^2)、x6(=x1*x3)、x7(=x1*x4)、x8(=x3^2)、x9(=x3*x4)、x10(=x4^2)、

x11(=x1^3)、x12(=x1^2*x3)、x13(=x1^2*x4)、x14(=x1*x3*x4)、x15(=x3^3)、

x16(=x3^2*x1)、x17(=x3^2*x4)、x18(=x4^3)、x19(=x4^2*x1)、x20(=x4^2*x3)。

(n) 計算各項之間的相關係數。

(o) Regress。

(p) 觀察結果與判斷決定。

雖然 R-squared = 0.9964，但由於 x1 的係數 dropped，可能是因為三次方項的值太大而把較小值的係數給吸收掉，所以找slop 小的β^{^} 值開始刪，試著先刪除 x13(=x1^2*x4)、x14(=x1*x3*x4)、x17(=x3^2*x4)、x18(=x4^3)、x19(=x4^2*x1)、

x20(=x4^2*x3)並觀察結果，並且其 F 值>0.05。

(q) predict hat、regression line (非線性)。

R-square 為 98.39%些許降低但修正複判定係數 Adj R-square 提升到 95.62%，Root MSE 降低到 393.46，停止刪除步驟。

. predict hat

(option xb assumed; fitted values) Regression [1]：

(1)殘差圖常態分析 . kdensity r,normal (n() set to 20)

0.001.002.003Density

-1000 -500 0 500

Residuals

Kernel density estimate Normal density

Symmetry 而且圖變的相當集中 =>(OK) (2)殘差圖與 Normal Quantile

. predict r,resid . qnorm r

-5000500Residuals

(3)殘差圖與預測值 . rvfplot, yline(0)

-1000-5000500Residuals

8000 10000 12000 14000 16000

Fitted values

No pattern =>(OK)

(s) 畫圖

以下為y 與各項(x1、x3、x4、x5、x6、x7、x8、x9、x10、x11、x12、x16)之迴歸線。

. twoway (qfit y x1) (scatter y x1) . twoway (qfit y x3) (scatter y x3)

800010000120001400016000

3 4 5 6 7

x1 Fitted values y

800010000120001400016000

30 35 40 45 50

x3 Fitted values y

. twoway (qfit y x4) (scatter y x4) . twoway (qfit y x5) (scatter y x5)

800010000120001400016000

0 20000 40000 60000 80000 100000

100 150 200 250 300

x6 Fitted values y

800010000120001400016000

0 200000 400000 600000

x7 Fitted values y

. twoway (qfit y x8) (scatter y x8) . twoway (qfit y x9) (scatter y x9)

800010000120001400016000

1000 1500 2000 2500

x8 Fitted values y

800010000120001400016000

0 1000000 2000000 3000000 4000000

x9 Fitted values y

. twoway (qfit y x10) (scatter y x10) . twoway (qfit y x11) (scatter y x11)

1400016000 1400016000

. twoway (qfit y x12) (scatter y x12) . twoway (qfit y x16) (scatter y x16)

4000 6000 8000 10000 12000 14000

x16 Fitted values y

(t)討論

只用線性迴歸(x1、x3、x4) 所得 R-square=79.05%

導入二次方項 x5(=x1^2)、x6(=x1*x3)、x7(=x1*x4)、x8(=x3^2)、x9(=x3*x4)、

還原y ^{^} 847.8611 799.2260719 643.6749481

由於Root MSE 無法比較，所以本研究將依變數還原，自行計算 MSE，可明顯看出Root MSE 明顯比開根號及對數來的小，R-square 有明顯的改善。

(1)依變數為原值

0.00005.0001.00015.0002.00025Density

8000 10000 12000 14000 16000 18000

y Kernel density estimate Normal density

(2)依變數作開根號轉換[8]

0.01.02.03.04.05Density

80 90 100 110 120 130

yi Kernel density estimate Normal density

對依變數y 轉換 sqrt 後的 kernel density，可看出比原數值更接近 normal。

(3)依變數作對數轉換[8]

0.511.522.5Density

8.8 9 9.2 9.4 9.6 9.8

yi Kernel density estimate Normal density

對依變數y 轉換 log 後的 kernel density，可明顯看出更趨近於 normal。

(4)依變數作 1/(y^2)轉換(本研究)

本研究除透過文獻[8]建議使用依變數取 log 及 sqrt，在採用 STATA 用 chi-square 找出transformation，使 y 更近似 normal。顯示出最適合的為 1/(y^2)。

我們使用指令 ladder 來觀察哪種 transformation 後擁有最小的 chi-square 值。

當中reciprocal square transform 擁有最小的 chi-square 值。接下來證明轉換後的 圖表使用指令 gladder。

.gladder y

02.0e-134.0e-136.0e-138.0e-13

0 1.00e+122.00e+123.00e+124.00e+125.00e+12 cubic

05.0e-091.0e-081.5e-08

5.00e+071.00e+081.50e+082.00e+082.50e+083.00e+08 square

-1.50e-08-1.00e-08-5.00e-09 0 1/square

02.0e+114.0e+116.0e+118.0e+111.0e+12

-2.00e-12-1.50e-12-1.00e-12-5.00e-13 0 1/cubic

Density

Histograms by transformation

(a)轉換前 y 的 kernel density plot

0.00005.0001.00015.0002.00025Density

8000 10000 12000 14000 16000 18000

y Kernel density estimate Normal density

(b)轉換後 y 的 kernel density plot

20) set to (n()

normal yi,

kdensity .

1/(y^2) yi

generate

. =

0500000001.000e+081.500e+08Density

0 5.000e-09 1.000e-08 1.500e-08 2.000e-08

yi Kernel density estimate Normal density

經過轉換後可看出y 更近似常態分配。

依變數取1/y^2 的 R-square 為 85.49%，比依變數取 sqrt 的 82% 及依變數取 log 的83.72%都來的高。

(c)依變數作 1/(y^2)轉換y 的信心區間 ^{^}

.twoway (lfitci hat y)(scatter hat y ,mlabel(y))

1.34e-08

0 5.00e-09 1.00e-08 1.50e-08

95% CI Fitted values

Fitted values

由於R-square 為 85.49%，所以可看出 95%信心區間外仍有許多筆資料。

(5)自變數作冪級數展開[4](本研究)

(a)冪級數展開y 信賴區間(本研究) ^{^}

.twoway (lfitci hat y)(scatter hat y ,mlabel(y))

8650.7

8000 10000 12000 14000 16000

95% CI Fitted values

Fitted values

本研究冪級數展開後R-square 高達 98.39%，y 的 95%信心區間。由圖可看出 20^{^} 筆當中，僅有案例11 為 10797.6 不在信賴區間當中。

(u)Fitted value y ^

本研究藉由Fitted value y ，來觀察所建立 Model 資料的平均相對誤差。 ^{^}

(1)依變數為原值

y =27112.17-544.13*x1-327.31*x3-0.02*x4 ^

個數 y Fitted value 相對誤差(%)

1 8650.7 9632.954 11.3546187 2 7916.8 7312.915 7.62789258 3 9022.5 9733.938 7.88515378 4 9080.4 9612.512 5.86000617

5 10611.8 11407.1 7.49448727

6 10182.8 10101.52 0.79820874 7 12192.8 11600.12 4.86090152 8 9748.2 8829.965 9.41953386 9 9393.1 8053.496 14.261575 10 11770.2 11576.01 1.64984452 11 10797.6 11556.42 7.02767282 12 11905.6 12366.69 3.87288335

13 10480.9 11294.3 7.7607839

14 9791.9 8857.887 9.53862887 15 8904.7 9994.603 12.2396375 16 11075.2 11334.46 2.34090581 17 10636.5 10740.38 0.97663705 18 12099.9 12010.98 0.73488211 19 16708.2 14805.34 11.3887792 20 10586.2 10734.41 1.40003023

平均相對誤差= 6.42465315

(2)依變數作開根號轉換後進行還原 y =178.92-2.31*x1-1.54*x3-0.00008*x4 ^

個數 y Fitted value y^{^} 相對誤差(%)

1 8650.7 9567.829001 10.60178789 2 7916.8 7221.219695 8.786126765 3 9022.5 9569.268897 6.060062479 4 9080.4 9410.458937 3.63485639 5 10611.8 11385.04365 7.286621812 6 10182.8 10015.27623 1.645240583 7 12192.8 11484.42296 5.809776623 8 9748.2 9012.339803 7.548671651 9 9393.1 8377.330851 10.813999 10 11770.2 11427.03189 2.915652209 11 10797.6 11452.49275 6.065171189 12 11905.6 12325.67132 3.52832355 13 10480.9 11251.99897 7.357113121 14 9791.9 9177.256804 6.277058037 15 8904.7 10030.77564 12.64585234 16 11075.2 11080.03812 0.043638857 17 10636.5 10641.9856 0.051590329 18 12099.9 11917.11076 1.510585992 19 16708.2 14683.02195 12.12085945 20 10586.2 10616.38432 0.285052682

平均相對誤差= 5.749402048

(3)自變數作對數轉換後進行還原

y ＝4.64-0.017*x1-0.013*x3-0.0000007*x4 ^

個數 y Fitted value y ^{^} 相對誤差(%)

1 8650.7 9476.849855 9.550092601 2 7916.8 7496.680464 5.306683799 3 9022.5 9546.210355 5.804492645 4 9080.4 9391.068319 3.421306542 5 10611.8 11266.49945 6.16954203 6 10182.8 9963.868793 2.150009912 7 12192.8 11393.49589 6.555541862 8 9748.2 9128.722387 6.354789762 9 9393.1 8572.687644 8.734202313 10 11770.2 11431.09937 2.881009834 11 10797.6 11458.47971 6.120616666 12 11905.6 12361.03966 3.825423917 13 10480.9 11243.82299 7.279174497 14 9791.9 9214.016805 5.901645278 15 8904.7 10066.41892 13.04613198 16 11075.2 11082.12387 0.062516759 17 10636.5 10653.54145 0.16021682 18 12099.9 11930.81194 1.397433551 19 16708.2 14768.58885 11.60873791 20 10586.2 10538.04382 0.454895923

平均相對誤差= 5.33922323

(4)依變數作 1/y^2 轉換(本研究)

y =-1.39e-8+4.16e-10*x1+5.05e-10*x3+3.36e-14*x4 ^

個數 y Fitted value y ^{^} 相對誤差(%)

1 8650.7 9128.709292 5.525671816 2 7916.8 8032.193289 1.457574892 3 9022.5 9365.858116 3.805576235 4 9080.4 9245.00327 1.812731492 5 10611.8 10752.06661 1.321798483 6 10182.8 9667.36489 5.061821007 7 12192.8 10963.22524 10.0844331 8 9748.2 9284.766909 4.754037578 9 9393.1 8908.708064 5.156891082 10 11770.2 11366.57232 3.42923379 11 10797.6 11403.4649 5.611107089 12 11905.6 12529.40028 5.239553452 13 10480.9 11117.97618 6.078449168 14 9791.9 9205.746179 5.986109141 15 8904.7 10030.13568 12.63867035 16 11075.2 10983.04428 0.832090754 17 10636.5 10564.42818 0.677589582 18 12099.9 11952.28609 1.219959724 19 16708.2 15579.42382 6.755821565 20 10586.2 10179.73197 3.839602774

平均相對誤差= 4.564436154

經由此種轉換方式，可看出明顯比log 及 sqrt 的 R-square 還來的大，Root MSE 來的小，透過chi-square 建議之最佳轉換，有明顯的改善。

(5)自變數作冪級數展開(本研究)

regression line (非線性)[3]：

y =444277.6+78305.7*x1-37061.25*x3-0.5329362*x4-31780.41*x5+4139.27*x6+0.^

0253283*x7+576.1512*x8+0.0076003*x9+1.02e-06*x10+728.6194*x11+473.2038*

x12-107.3874*x16

個數 y Fitted value y ^{^} 相對誤差(%)

1 8650.7 8617.832 0.379946 2 7916.8 7941.909 0.317161 3 9022.5 8873.628 1.650008 4 9080.4 9190.648 1.214132 5 10611.8 10733.34 1.145329 6 10182.8 10207.25 0.240111 7 12192.8 12105.42 0.716652 8 9748.2 9764.294 0.165097 9 9393.1 9370.569 0.239868 10 11770.2 11426.14 2.923145 11 10797.6 11555.55 7.019615 12 11905.6 12029.65 1.041947 13 10480.9 10655.24 1.663407 14 9791.9 9750.694 0.420817 15 8904.7 8982.274 0.871158 16 11075.2 11002.33 0.657957 17 10636.5 10435.51 1.889625 18 12099.9 11618.56 3.978049 19 16708.2 16701.76 0.038544 20 10586.2 10593.4 0.068013

平均相對誤差= 1.332029044

將依變數取對數轉換後之平均相對誤差為5.33922323%，自變數作冪級數展開

(v)預測

由於本研究僅提供25 個案例，其中 20 個案例落入 95%信賴區間，作為 Model，

所以本研究，僅取兩個案例進行分析，透過依變數作對數轉換及自變數作冪級數作之比較。

(1)依變數取對數

y x1 x2 x3 x4 預測值 y^{^} _相對誤差

個數標段日期上段單位造價廠商數橋長跨徑平均總面積

(元/平方公尺) (家數) (公尺) (公尺) (平方公尺) (%)

1 C326 8809 10848.2 6 240 35 1800 12070.91 11.271083 2 C326 8809 10124.7 6 208 33 1598 12819.804 26.619104

^{平均相對誤差=}18.945094

(2)自變數作冪級數展開

y x1 x2 x3 x4 預測值 y^{^} _相對誤差

個數標段日期上段單位造價廠商數橋長跨徑平均總面積

(元/平方公尺) (家數) (公尺) (公尺) (平方公尺) (%)

1 C326 8809 10848.2 6 240 35 1800 12022.63 10.826036 2 C326 8809 10124.7 6 208 33 1598 11673.02 15.292502 平均相對誤差= 13.059269

經由這兩個案例可看出，預測之由18.95%降低至 13.06%相對誤差有大幅的降低 6%。

在文檔中公共工程標價與標案規模之實證研究-以支撐先進工法之橋樑為例 (頁 7-35)