連續型資料的結果依賴採樣設計之配置選擇

全文

(1)國立臺灣師範大學數學系碩士班碩士論文. 指導教授：. 呂翠珊. 博士. Allocation Determination for Outcome-Dependent Sampling Design with Continuous Outcome. 研究生：張雪愉. 中華民國. 一百零一年. 七. 月.

(2) 感謝詞能開心地寫下感謝詞的同時，代表著我已順利通過口試。能完成此篇論文，首先我要衷心感謝我的指導教授─呂翠珊老師，謝謝老師願意收基礎薄弱的我當學生，並於這三年內耐心的指導並給予我研究方向。老師總在我研究遇到瓶頸時，給予我想法及建議；每次 seminar 報告前及寫論文期間，引導我如何將想法清楚地呈現；以及寫程式的訓練，有了這些訓練，我的論文才能順利完成。接著，我要感謝蔡蓉青老師及馬可容老師撥空參加我的論文口試並給予建議，使論文可以更加完善。感謝系上統計組所有老師，在課堂上給予的知識傳授，以及 seminar 報告時給予的意見。還有，曹博盛老師在我修習教育學程期間給予的指導。另外，我還要感謝益儒、炯伊、森任、瑋翔、韶寬和淑貞，謝謝你們在課業上的幫助，以及一起修習教育學程的同學們。多虧有你們，學習路上才能如此順利、有趣。最後，我要感謝我的家人，總是無條件支持我的任何決定，讓我求學期間無後顧之憂。. 中華民國一○一年七月.

(3) 中文摘要在流行病學研究上，結果依賴採樣設計 (outcome-dependent sampling design)已被證實能提高實驗有效性並節省研究成本。雖然 Zhou 等人(2002)已有針對連續型資料的結果依賴採樣設計，在抽樣選擇上提出一些建議，但我們想更深入討論此設計的配置選擇是否具有更一般性且更精確的準則，並研究此設計在實際使用上是否有所限制。我們將進行大量的模擬研究，討論結果依賴採樣設計在不同參數設定下的參數估計表現情形，並以簡單隨機抽樣設計作為對照標準，嘗試使用不同抽樣策略，藉此提高有效性。論文最後，我們會針對此設計在配置選擇上提出一些建議及研究方向，給未來考慮使用此設計進行研究的人做為參考依據。. 關鍵字：結果依賴採樣設計、簡單隨機抽樣設計.

(4) Allocation Determination for Outcome-Dependent Sampling Design with Continuous Outcome. July 23, 2012. Abstract In epidemiological studies, an outcome-dependent sampling (ODS) design has been shown to enhance study efficiency and reduce the cost. Although Zhou et al. (2002) have given some criteria about the sample selection under the ODS scheme, we further discuss whether there are more general and accurate criteria about the choice of the sample selection and explore the limitation on the use of the ODS scheme. In our research, we investigate the performance of the ODS design, comparing to the simple random sampling (SRS) design, under different settings with various sampling strategies to improve efficiency through simulation studies. In the end, we give researchers some suggestions and directions when using the ODS design on the observational studies in the future. Keywords: Outcome-dependent sampling design, simple random sampling design. i.

(5) Contents 1 INTRODUCTION. 1. 2 A SEMIPARAMETRIC EMPIRICAL LIKELIHOOD APPROACH FOR THE ODS DESIGN. 5. 2.1. The ODS design with a continuous outcome . . . . . . . . . . . . . . . . . .. 5. 2.2. Maximum semiparametric empirical likelihood estimator (MSELE). 7. . . . . .. 3 SIMULATION RESULTS FOR THE ODS WITH A CONTINUOUS OUTCOME. 9. 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 3.2. Simulation results for the ODS design with a normal distribution . . . . . .. 9. 3.2.1. Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 3.2.2. The estimator performance of ODS designs . . . . . . . . . . . . . . .. 11. 3.2.3. Unequal supplemental sampling strategies . . . . . . . . . . . . . . .. 14. 3.2.4. Linear regression model with an interaction term . . . . . . . . . . .. 16. Simulation result for the ODS design with a exponential distribution . . . .. 50. 3.3.1. Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 3.3.2. The performance of the MSELE for ODS designs . . . . . . . . . . .. 52. 3.3.3. Unequal supplemental sampling strategies . . . . . . . . . . . . . . .. 55. Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84. 3.3. 3.4. 4 APPLICATION TO THE COLLABORATIVE PERINATAL PROJECT DATA. 88. 4.1. 88. The CPP data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ii.

(6) 4.2. The Conditional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 90. 4.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 90. 5 ADDITIONAL SIMULATION RESULTS AND DISCUSSION. 96. 5.1. Additional coefficient setting to the Section 3.2.2 . . . . . . . . . . . . . . . .. 5.2. Other distributions of covariates . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2.1. X1 ∼ N (0, 4), X2 ∼ Bernoulli(0.5) . . . . . . . . . . . . . . . . . . 101. 5.2.2. X1 ∼ N (0, 4), X2 ∼ N (0, 1). 5.2.3 5.3. 96. . . . . . . . . . . . . . . . . . . . . . . 103. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103. Size of SRS samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112. 6 CONCLUSION AND DIRECTION FOR FUTURE RESEARCH. 114. Reference. 116. iii.

(7) List of Tables 3.1. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.2. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 400. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 400. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.9. 24. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.8. 23. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 800. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.7. 22. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400. 3.6. 21. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.5. 20. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 800. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.4. 19. 26. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 800. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 3.10 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 200. . . . . . . . . . . . . . . . . . . . .. 28. 3.11 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 400. . . . . . . . . . . . . . . . . . . . .. iv. 28.

(8) 3.12 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 800. . . . . . . . . . . . . . . . . . . . .. 28. 3.13 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 200. . . . . . . . . . . . . . . . . . . . . .. 29. 3.14 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400. . . . . . . . . . . . . . . . . . . . . .. 29. 3.15 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 800. . . . . . . . . . . . . . . . . . . . . .. 29. 3.16 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 200. . . . . . . . . . . . . . . . . . . . . .. 30. 3.17 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 400. . . . . . . . . . . . . . . . . . . . . .. 30. 3.18 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 800. . . . . . . . . . . . . . . . . . . . . .. 30. 3.19 Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 400 using unequal supplemental sampling I. . . . . . . . . . . . . . . . . . .. 31. 3.20 Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 400 using unequal supplementary sampleII.. . . . . . . . . . . . . . . . . . .. 32. 3.21 Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400 using unequal supplemental sampling I. . . . . . . . . . . . . . . . . . .. 33. 3.22 Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400 using unequal supplemental sampling II.. . . . . . . . . . . . . . . . . .. 34. 3.23 Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 400 using unequal supplemental sampling I. . . . . . . . . . . . . . . . . . .. v. 35.

(9) 3.24 Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 400 using unequal supplemental sampling II.. . . . . . . . . . . . . . . . . .. 36. 3.25 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 400 using unequal supplemental sampling I.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 3.26 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 400 using unequal supplemental sampling II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 3.27 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400 using unequal supplemental sampling I. 37 3.28 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400 using unequal supplemental sampling II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 3.29 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 400 using unequal supplemental sampling I. 38 3.30 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 400 using unequal supplemental sampling II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 3.31 Simulation results: Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.05) and n = 400.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39. 3.32 Simulation results: Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.1) and n = 400.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. 3.33 Simulation results: Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.5) and n = 400.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. vi. 41.

(10) 3.34 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.05) and n = 400. . . . . . . . . . . . . . . . .. 42. 3.35 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.1) and n = 400. . . . . . . . . . . . . . . . . .. 42. 3.36 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.5) and n = 400. . . . . . . . . . . . . . . . . .. 42. 3.37 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 200.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. 3.38 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. 3.39 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 800.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60. 3.40 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 200.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 61. 3.41 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 62. 3.42 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 800.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63. 3.43 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 200.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64. 3.44 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 65. 3.45 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and and n = 800.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. vii. 66.

(11) 3.46 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 200. . . . . . . . . . . . . . . . . .. 67. 3.47 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400. . . . . . . . . . . . . . . . . .. 67. 3.48 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 800. . . . . . . . . . . . . . . . . .. 67. 3.49 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 200. . . . . . . . . . . . . . . . . . .. 68. 3.50 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400. . . . . . . . . . . . . . . . . . .. 68. 3.51 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 800. . . . . . . . . . . . . . . . . . .. 68. 3.52 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 200. . . . . . . . . . . . . . . . . . .. 69. 3.53 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400. . . . . . . . . . . . . . . . . . .. 69. 3.54 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 800. . . . . . . . . . . . . . . . . . .. 69. 3.55 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400 using unequal supplemental sampling I.. . . . . . . . . . . . . . . .. 70. 3.56 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400 using unequal supplementary sampleII. . . . . . . . . . . . . . . . .. 71. 3.57 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400 using unequal supplemental sampling I.. viii. . . . . . . . . . . . . . . .. 72.

(12) 3.58 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400 using unequal supplementary sampleII. . . . . . . . . . . . . . . . .. 73. 3.59 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400 using unequal supplemental sampling I.. . . . . . . . . . . . . . . .. 74. 3.60 Simulation results: Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400 using unequal supplementary sampleII. . . . . . . . . . . . . . . . .. 75. 3.61 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400 using unequal supplementary sampleI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76. 3.62 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400 using unequal supplementary sampleII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76. 3.63 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400 using unequal supplementary sampleI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76. 3.64 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400 using unequal supplementary sampleII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 77. 3.65 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400 using unequal supplementary sampleI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 77. 3.66 Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400 using unequal supplementary sampleII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ix. 77.

(13) 3.67 Sample size needed for testing H0 : β1 = 0 for a given power for models with Normal outcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. 3.68 Sample size needed for testing H0 : β1 = 0 for a given power for models with Exponential outcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. 87. Results of modeling fitting for the CPP data with all complete samples (n=1806). 92. 4.2. Results of modeling fitting for the CPP data with limited sample size n=800.. 4.3. Results of modeling fitting for the CPP data with limited sample size n=400 but ODS designs with different cut-points. . . . . . . . . . . . . . . . . . . .. 4.4. . . . . . . . . . . . . . . . . . . . . . . .. 98. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, −0.1, −0.5), N = 400, X1 ∼ N (0, 1), and X2 ∼ Bernoulli(0.5). . . . . . . . . . . . . . . . . .. 5.3. 95. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 1, −0.5), N = 400, X1 ∼ N (0, 1), and X2 ∼ Bernoulli(0.5). . . . . . . . . . . . . . . . . . . . .. 5.2. 94. Results of modeling fitting for the CPP data with limited sample size n=200 but ODS designs with different cut-points. . . . . . . . . . . . . . . . . . . .. 5.1. 93. Results of modeling fitting for the CPP data with limited sample size n=400 but ODS designs with different ρ.. 4.5. 92. 99. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5), n = 400, X1 ∼ N (0, 4), and X2 ∼ Bernoulli(0.5). . . . . . . . . . . . . . . . . . . . . 105. 5.4. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5), n = 400, X1 ∼ N (0, 4), and X2 ∼ Bernoulli(0.5). . . . . . . . . . . . . . . . . . . . . 106. 5.5. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 1.0, −0.5), n = 400, X1 ∼ N (0, 4), and X2 ∼ Bernoulli(0.5). . . . . . . . . . . . . . . . . . . . . 107. x.

(14) 5.6. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5), n = 400, X1 ∼ N (0, 4), and X2 ∼ N (0, 1). . . . . . . . . . . . . . . . . . . . . . . . . 108. 5.7. Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5), n = 400, X1 ∼ N (0, 4), and X2 ∼ N (0, 1). . . . . . . . . . . . . . . . . . . . . . . . . 109. 5.8. Additional simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5). 113. 5.9. Additional simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5). 113. xi.

(15) List of Figures 3.1. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 200.. 3.2. 44. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 400.. 3.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 200.. 3.8. 43. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 800.. 3.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400.. 3.6. 43. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 200.. 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 800.. 3.4. 43. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 400.. 3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) and n = 800.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 3.10 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) using unequal supplemental sampling I . . . . . . . . . . . . . . . . . . . . .. 46. 3.11 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) using unequal supplemental sampling II . . . . . . . . . . . . . . . . . . . . .. xii. 46.

(16) 3.12 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) using equal supplemental sampling.. . . . . . . . . . . . . . . . . . . . . . .. 46. 3.13 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) using unequal supplemental sampling I . . . . . . . . . . . . . . . . . . . . .. 47. 3.14 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) using unequal supplemental sampling II . . . . . . . . . . . . . . . . . . . . .. 47. 3.15 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) using equal supplemental sampling. . . . . . . . . . . . . . . . . . . . . . . .. 47. 3.16 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) using unequal supplemental sampling I . . . . . . . . . . . . . . . . . . . . .. 48. 3.17 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) using unequal supplemental sampling II . . . . . . . . . . . . . . . . . . . . .. 48. 3.18 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 ) = (1, 0.5, −0.5) using equal supplemental sampling. . . . . . . . . . . . . . . . . . . . . . . .. 48. 3.19 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.05).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 3.20 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 3.21 Relative efficiencies (V arβˆS /V arβˆZ ): Normal model with (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 3.22 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 78. 3.23 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400. . . . . . . . . . . . . . . . . . . . . . . . . . . .. xiii. 78.

(17) 3.24 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 800. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 78. 3.25 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. 3.26 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. 3.27 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 800. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. 3.28 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. 3.29 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. 3.30 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 800. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. 3.31 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400 using unequal supplementary sampleI. . . . . . .. 81. 3.32 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400 using unequal supplementary sampleII. . . . . . .. 81. 3.33 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.05, −0.4) and n = 400 using equal supplemental sampling.. . . . . . . .. 81. 3.34 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400 using unequal supplemental sampling I . . . . . .. 82. 3.35 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400 using unequal supplemental sampling II . . . . . .. xiv. 82.

(18) 3.36 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.1, −0.4) and n = 400 using equal supplemental sampling. . . . . . . . .. 82. 3.37 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400 using unequal supplemental sampling I . . . . . .. 83. 3.38 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400 using unequal supplemental sampling II . . . . . .. 83. 3.39 Relative efficiencies (V arβˆS /V arβˆZ ): Exponential model with (β0 , β1 , β2 ) = (1, 0.5, −0.4) and n = 400 using equal supplemental sampling. . . . . . . . .. 83. 5.1. Relative efficiency of βZ to βS for βˆ1 and βˆ2 under the model in Table 5.1. . 100. 5.2. Relative efficiency of βZ to βS for βˆ1 and βˆ2 under the model in Table 5.1. . 100. 5.3. Relative efficiency of βZ to βS for βˆ1 and βˆ2 under the model in Table 5.3. . 110. 5.4. Relative efficiency of βZ to βS for βˆ1 and βˆ2 under the model in Table 5.4. . 110. 5.5. Relative efficiency of βZ to βS for βˆ1 and βˆ2 under the model in Table 5.5. . 110. 5.6. Relative efficiency of βZ to βS for βˆ1 and βˆ2 under the model in Table 5.6. . 111. 5.7. Relative efficiency of βZ to βS for βˆ1 and βˆ2 under the model in Table 5.7. . 111. xv.

(19) 1. INTRODUCTION. In epidemiology, observational studies are widely used to investigate the relationships between an outcome and environmental or individual exposures due to advantages for gathering data. Cohort and case-control designs are two commonly used designs in observational studies. The cohort design is to observe individuals with and without exposure for a long time in order to compare the occurrence of disease in exposed individuals with the nonexposed individuals; the case-control design is retrospective and studying the group with the disease (called cases) and the group without the disease (called control) to examine the difference between the frequency of exposure in two group to judge whether the exposure is related to the disease. Although the cohort design clearly demonstrates an appropriate temporal sequence between the exposure and the outcome, some large cohort studies cost hundreds of millions of dollars. Hence, when the budget is not enough, case-control designs are always preferred. However, the case-control design is only applied to binary-outcome data. In practice, there are many continuous outcome data, such as IQ scores and audiometry results. Some recent studies have applied the idea of the case-control design that concentrates resources on where there is the greatest amount of information to continuous outcome data or multicategorical outcome data and show that there are appropriate estimators to analyze such data. Such designs for continuous outcomes are under outcome-dependent sampling (ODS) schemes. The samples of the ODS design are respective and some(even all) samples are dependent on the outcome. Zhou et al. (2002) proposed a general outcome-dependent sampling (ODS) schemes which enhance study efficiency and reduce the cost comparing simple random samples when the outcome is continuous. Moreover, they have shown that the semiparametric empirical likelihood method is a better estimator to analyze such ODS schemes. 1.

(20) Weaver & Zhou (2005) considered the ODS schemes which exposure measurements were only observed in small subsample of the population, and developed the maximum estimated likelihood estimator (MELEs) to analyze such design. Wang & Zhou (2006) proposed that a semiparametric empirical likelihood estimation procedure can be applied to both binary outcome and multicategorical outcome under the ODS schemes. In addition, the ODS designs with a continuous outcome have been extended to the two-stage case-control design. A new two-stage outcome-dependent sampling scheme were proposed and a more efficient estimation for inference about the regression parameters in such design were also developed.(Song, Zhou, and Kosorok, 2009; Zhou et al. 2010) In above studies, it has been shown using an ODS design can actually enhance the efficiency and reduce cost, comparing with using a simple random sampling design. Indeed, in many practical studies, investigators chose a ODS design because of budget limitation. For example, Longnecker et al. (2004) studied the association in humans between the in utero exposure to polychlorinated biphenyls (PCBs) measured from maternal third trimester serum and the audiometric evaluation in offsprings at approximately 8 years old according to the Collaborative Perinatal Project (CPP) data. Due to the expensive blood serum assay, they only select subsamples of the CPP data that include 726 having an audiometry result of 1200 subjects selected at random from the population and a additional 200 subjects from the children whose 8-year audiometry results showed sensorineural hearing loss (SNHL) to analyze. Another example is that Liao et al.(1997) examine the relationship between the familial history of stroke and the prevalence of stroke in the Family Heart study (FHS). The FHS is a population-based, multicenter study designed to identify the genetic and nongenetic determinants of coronary heart disease(CHD), and the participants were recruited from three ongoing parent cohort studies: the Forsyth and Minneapolis cohorts of the ARIC Study,. 2.

(21) the Framingham Offspring Study, and the Utah Family Tree Study. Because the subject in the Liao et al.(1997) study is proband’s personal and familial histories of stroke, more probands selected into the study will cause study costs increased rapidly. To avoid huge cost, the sampling processes for the FHS include two phases:(1) a simple random sample of approximately 500 probands from each study site and another 500 probands with a high family risk score for CHD were sampled to collect personal histories of CHD and related conditions in phase I. (2) 150 randomly selected families and 150 higher-risk families were invited to a detailed clinical examination in phase II. However, the literature we reviewed above focuses on the development of the new efficient statistical method on the ODS design. Therefore, how to select samples from the population data according to the ODS design with a limited budget becomes an important and practical issue. Although Zhou et al. (2002) have given some criteria about the sample selection on the ODS scheme with a continuous outcome, we wonder whether there are more general and accurate criteria about the sample selection and whether there are any restrictions on the use of ODS scheme. The purpose of the thesis is to examine the above two questions through large and systematic simulation studies. In Chapter 2, we derive the maximum semiparametric empirical likelihood estimator (MSELE) of the ODS design with a continuous outcome following the semiparametric empirical likelihood approach method proposed by Zhou et al. (2002). In Chapter 3, we conducted simulation studies to examine various sampling designs when the outcome follows the normal distribution and the exponential distribution. We also discuss the sample sizes required to achieve a given power for varies coefficient parameters setting and sampling designs. In Chapter 4, we apply the ODS design to analyze the CPP data. In Chapter 5, we conduct additional simulation studies under the same model but vary the dis-. 3.

(22) tribution of covariates as in Chapter 4 to discuss the robustness of the MSELE. In Chapter 6, we summarize our research and give some criteria about the sample selection on the ODS scheme with a continuous outcome to the investigators when they choose the ODS design in their studies.. 4.

(23) 2. A SEMIPARAMETRIC EMPIRICAL. LIKELIHOOD APPROACH FOR THE ODS DESIGN In this chapter, we define notations to describe the ODS design and derive the estimators following a semiparametric empirical likelihood approach procedure that Zhou et al. (2002) proposed.. 2.1. The ODS design with a continuous outcome. Notation and Data Structure We denote Y as a continuous outcome and X as a p × 1 vector of covariates, where X can be either continuous or discrete variables. Assume that the relationship between X and Y follows a parametric model fY |X (y|x; β), where β is the regression parameter of interest. We assume that X has a probability density function gX (·) and a cumulative distribution function GX (·), where both GX and gX are left unspecified. Assume that the domain of Y is partitioned into K mutually exclusive intervals, Ck = (ak−1 , ak ], k = 1, ..., K, by known constants ak , where ak satisfy −∞ = a0 < a1 < ... < ak = ∞. The ODS design that Zhou et al. (2002) proposed include two components, an overall simple random sample (the SRS sample) and additional samples from each of the K intervals of Y (supplemental samples). Specifically, we assume that n0 observations are selected at random from the underlying population, called the SRS sample. In addition, nk observations are randomly chosen from the kth interval, respectively, called the supplement sample. Hence, the data we observe of the ODS design can be summarized as:. 5.

(24) The SRS component: {y0i , x0i } for i = 1, ..., n0 . The supplemental components: {ykj , xkj |ykj ∈ Ck } for k = 1, ..., K and j = 1, ..., nk .. Likelihood Function Using the Bayes’ Law, the likelihood function for the SRS component can be written as. LSRS =. n0 Y. fβ (y0i |x0i )gx (x0i ). i=1. and, the likelihood function for the supplemental components is. Lsupple =. nk K Y Y. fβ (ykj , xkj |ykj ∈ Ck ). k=1 j=1. =. nk K Y Y fβ (ykj |xkj )gx (xkj ) · I(ak−1 ,ak ) (ykj ). F (ak ) − F (ak−1 ). k=1 j=1. =. nk K Y Y F (ak |xkj ) − F (ak−1 |xkj ) k=1 j=1. F (ak ) − F (ak−1 ). fβ (ykj |xkj )gx (xkj ) , F (ak |xkj ) − F (ak−1 |xkj ). where F (u)=Pr(Y ≤ u), F (u|x)=Pr(Y ≤ u|x), and IA (y) to be 1 if y ∈ A and 0 otherwise. Hence, the likelihood function of the ODS design can be combined as. L = LSRS × Lsupple (n nk K Y 0 Y Y = fβ (y0i |x0i ) ×. fβ (ykj |xkj ) F (ak |xkj ) − F (ak−1 |xkj ) k=1 j=1. i=1. ×. (n 0 Y i=1. gx (x0i ) ×. nk K Y Y F (ak |xkj ) − F (ak−1 |xkj ). F (ak ) − F (ak−1 ). k=1 j=1. = L1 (β) × L2 (β, GX ).. ). ) gx (xkj ) (2.1). where L1 (β) and L2 (β, GX ) denote the quantities in the first bracket and the second bracket, respectively. In the second bracket, because GX , which is unspecified, cannot be separated 6.

(25) from β in L2 , maximizing L2 with respect to β becomes difficult. Zhou et al. (2002) proposed a semiparametric empirical likelihood approach to solve these problem and we describe the procedure in the next section.. 2.2. Maximum semiparametric empirical likelihood estimator (MSELE). Without loss of generality, assume that K = 3 and n2 = 0, n1 > 0, n3 > 0. To estimate β, Zhou et al. proposed that first we can profile the likelihood function L2 (β, GX ) by fixing β and obtaining the empirical likelihood function of GX over all distributions whose support contains the observed X values. Then we can estimate β by maximizing the overall likelihood function L. Specifically, to maximize GX over all distribution whose support contains the observed X values, we only need to consider discrete distributions with jumps at each of the observed points (Owen, 1988, 1990; Qin and Lawless, 1994). Denote π1 = F (a1 ), π3 = F¯ (a2 ) = 1 − F (a2 ), pi = gX (xi ), xi = x01 , ..., x0n0 , x11 , ..., x1n1 , x31 , ..., x3n3 and n = n0 + n1 + n3 . Then for a fixed β, L(β, GX ) can be written as. L(β, {pi }) =. "n 0 Y. # fβ (y0i |x0i )gx (x0i ). i=1 "n 1 Y. fβ (y1j |x1j )gx (x1j ) × F (a1 ) j=1 ∝. n Y. #" n # 3 Y fβ (y3k |x3k )gx (x3k ) k=1. 1 − F (a2 ). pi π1−n1 π3−n3 ,. (2.2) (2.3). i=1. We search for values {pî , ∀i}, maximizing (2.3) subject to the following constraints:. {pi ≥ 0,. n X i=1. pi = 1,. n X. pi {F (a1 |xi ) − π1 } = 0,. i=1. n X i=1. 7. pi {F¯ (a2 |xi ) − π3 } = 0}.. (2.4).

(26) An explicit expression can be derived using the Lagrange multiplier argument:. H = log L(β, {pi }) + ρ(1 −. n X. pi ) − nλ1. i=1. n X. pi {F (a1 |xi ) − π1 } − nλ3. i=1. n X. pi {F¯ (a2 |xi ) − π3 },. i=1. where ρ, λ1 , and λ3 are Lagrange multipliers. Then taking derivatives of H with respect to pi and applying the constraints in (2.4), we can obtain that ρˆ = n and. pî =. 1 1 . · n 1 + λ1 {F (a1 |xi ) − π1 } + λ3 {F¯ (a2 |xi ) − π3 }. Similarly, taking derivatives of H with respect to π1 and π3 , we obtain that π ˆ1 = π ˆ3 =. n1 n·λ1. and. n3 . n·λ3. Plugging ρˆ, pî , π ˆ1 , π ˆ3 back into the overall likelihood function (2.2), we have the log likelihood function,. `(β, {ˆ pi }) =. n X. logfβ (yi |xi ). i=1. −. n X i=1. log n −. n X. log[1 + λ1 {F (a1 |xi ) −. i=1. n1 n3 −n1 log( ) − n3 log( ). n · λ1 n · λ3. n1 n3 } + λ3 {F¯ (a2 |xi ) − }] n · λ1 n · λ3 (2.5). Then we can obtain the estimator for (β, λ1 , λ3 ) by maximizing the log-likelihood function (2.5), using the Newton-Raphson iterative procedure.. 8.

(27) 3. SIMULATION RESULTS FOR THE ODS WITH A CONTINUOUS OUTCOME. 3.1. Introduction. In this chapter, we consider two outcome distributions - normal distribution and exponential distribution and conduct simulation studies to evaluate the study efficiency of the ODS design. In Section 3.2, we consider the outcome from the normal distribution while in Section 3.3, we demonstrate if the results from Section 3.2 retain as the outcome follows the exponential distribution. In Sections 3.2 and 3.3, we investigate the performance of MSELE for the ODS design, denoted as βZ , comparing to the maximum likelihood estimator, denoted as βS , for the simple random sampling design under different parameter settings. We also investigate the trend about the relative efficiency of the ODS design under different cut-points and different proportions of the SRS samples to the simple random sample design with the same sample size, trying to explore the optimal sample selections when the total sample size of the ODS design is fixed. Then, we study the sample sizes needed to achieve a given power in Section 3.4. All simulations are conducted using programs written in R2.4.1.. 3.2. Simulation results for the ODS design with a normal distribution. 3.2.1. Data generation. Generating population data In this section, we describe the procedure about data generation. First, we consider the. 9.

(28) conditional distribution of Y given the covariates X is a normal distribution:. Y |X ∼ N (µ, σ 2 ).. Without loss of generality, we include two covariates, X = (X1 , X2 )T in our model; thus we have µ = β0 + β1 X1 + β2 X2 . We then generate the population data under the model:. Y = β0 + β1 X1 + β2 X2 + ,. (3.1). where ∼ N (0, σ 2 ). Our goal is to estimate (β0 , β1 , β2 ). We set X1 ∼ N (0, 1), X2 ∼ Bernoulli(0.5) and ∼ N (0, 1). We then consider the coefficient parameters as following: 1. (β0 , β1 , β2 ) = (1, 0.05, −0.5) 2. (β0 , β1 , β2 ) = (1, 0.1, −0.5) 3. (β0 , β1 , β2 ) = (1, 0.5, −0.5) The algorithm is as the following: Step 1: Generate 10,000 X1 and X2 data points independently from N (0, 1) and. Bernoulli(0.5), respectively. Step 2: Generate 10,000 independent error terms from N (0, 1). Step 3: Plug the generated X1 , X2 , and into the model(3.1) to obtain the correspond-. ing Y .. 10.

(29) Sample collection from the ODS designs Let the ODS sample sizes for investigation be n = 200, 400, and 800. And let the domain of Y be partitioned into three mutually exclusive intervals by a1 and a2 , called cut-points. Then we collect the ODS samples by the following steps: Step 1: Randomly draw an overall random sample of sizes n0 from the population. generated; Step 2: Randomly draw a supplemental random sample of size n1 from the lower tail. of Y ({Y < a1 }) and a supplemental random sample of size n3 from the upper tail of Y ({Y > a2 }), respectively. For the ODS design, we simultaneously investigate the impact of 2 factors, the cut-points a1 and a2 , and the proportions of the SRS sample in the ODS design (ρ = n0 /n), in a given total sample size. We allow cut-points at (10th , 90th ), (20th , 80th ), (30th , 70th ) percentiles of Y and increase ρ from 0.1 to 0.9. The situation that we only draw an overall random samples is called the simple random sampling (SRS) design; i.e., ρ=1. Under each setting in our simulation, we use the proposed estimator, denoted as βˆZ , to estimate the regression coefficient for the ODS design; the maximum likelihood estimator, denoted as βˆS , for the SRS design. We replicate 2000 simulations for each setting and use estimated means (Mean) and estimated standard errors (SE) to evaluate the performance of the estimators. We get the ”Mean” and ”SE” by the mean and standard deviation of 2000 simulation results, respectively. The results are summarized in Tables 3.1 - 3.9.. 3.2.2. The estimator performance of ODS designs. Tables description. 11.

(30) Tables 3.1, 3.2, and 3.3 are simulation results for the case with true coefficient (β0 , β1 , β2 )=(1, 0.05, -0.5) but with total sample size n = 200, n = 400, and n = 800, respectively; Tables 3.4, 3.5, and 3.6 are the case with true β1 = 0.1 but with total sample size n = 200, n = 400, and n = 800, respectively; Tables 3.7, 3.8, 3.9 are the case with true β1 = 0.5 but with total sample size n = 200, n = 400, and n = 800, respectively. Unbiasedness and Efficiency of the estimator for the ODS design In the Table 3.1, we can see that the ODS designs with different combinations of (a1 , a2 ) and ρ (except for ρ = 1), all the estimated means for βˆ1 and βˆ2 are close to the true value 0.05 and -0.5, respectively. That is, the proposed estimator for the ODS design is unbiased. Then, we compare the standard errors for the ODS design to the SRS design. All the standard errors of ODS designs with different supplement sampling are smaller than the standard error for the SRS design, 0.070 for βˆ1 and 0.143 for βˆ2 . In addition, under the same cut-points of the ODS design, the case with the smaller ρ has a smaller standard error; on the other hand, under the same ρ, the case with (a1 , a2 ) = (0.1, 0.9) has a smaller standard error. That is, the estimator of the ODS design is more efficient than the estimator of the SRS design, which is as expected. In addition, the ODS design with smaller ρ and (a1 , a2 ) = (0.1, 0.9) will gain more efficiency. In other words, the ODS design with less SRS samples and supplement samples from two extreme tails will gain more efficiency than other ODS design with same total sample sizes. In the Tables 3.2 through 3.9, we obtain similar results as above. The performance of relative efficiencies We further investigate the amount of information gained by the use of the ODS design instead of a simple random sampling design of the same size. Now we denote a relative efficiency (RE) as the ratio of the variance for a SRS design (βˆS ) to the variance for an ODS. 12.

(31) design (βˆZ ); i.e., RE = Var(βˆS )/Var(βˆZ ). We use REs to evaluate efficiency gained and the results are summarized in the Tables 3.10 through 3.18 with different model settings. We also generate graphs for the relative efficiencies, Figure 3.1 through 3.9, according to results of the Tables 3.10 through 3.18 to observe the trend of relative efficiency gain under various allocation of ODS design with same total sample size. Table 3.10 presents results from a relative efficiency study by comparing the ODS design to a simple random sampling design with the same total sample size n = 200. We observe that all REs are greater than 1, indicating that the ODS designs gain more efficiency. We also observe that when the cut-points are the 10th and 90th percentiles of outcome distribution, the RE for βˆ1 increases from 1.000 to 2.330 with ρ decreasing from 1.0 to 0.1. That is, adding more supplement samples (decrease the SRS samples) will gain more efficiency. In addition, under the same ρ, the ODS design with cut-points at the 10th and 90th percentiles of Y obtained the larger RE. The above finding can be clearly observed in the Figure 3.1. Table 3.11 and 3.12 present the results with the same coefficient setting as Table 3.10 but different sample sizes. Similar findings can be observed and the patterns of Figure 3.2 and 3.3 are the same as Figure 3.1. From Tables 3.10 through 3.12, we observe that the largest RE occur in ρ = 0.1 and cut-points at the 10th and 90th percentiles of Y , and values are close to 2.6 (2.330/2.659/2.804) for βˆ1 and 2.4 (2.401/2.669/2.611) for βˆ2 . That is, βˆZ is more efficient than βˆS with gains as large as 160% for βˆ1 and 140% βˆ2 . Tables 3.13 through 3.15 present another coefficient setting, (β0 , β1 , β2 ) = (1, 0.1, −0.5), and the total sample size are n = 200, n = 400 and n = 800, respectively. We observe similar results that REs are larger than 1, and the largest RE occurs in ρ = 0.1 and cut-points at the 10th and 90th percentiles of Y . In addition, the largest RE in three tables are close to 2.5 (2.597/2.516/2.691) for βˆ1 and 2.4 (2.492/2.580/2.648) for βˆ2 . That is, βˆZ is more efficient. 13.

(32) than βˆS with gains as large as 150% for βˆ1 and 140% βˆ2 . Tables 3.16 through 3.18 present another coefficient setting, (β0 , β1 , β2 ) = (1, 0.5, −0.5), and the total sample size are n = 200, n = 400 and n = 800, respectively. We observe similar results that REs are larger than 1, and the ODS design with less SRS samples and the cut-points at the 10th and 90th percentiles of Y gain more efficiency. However, the largest RE in three tables is only close to 1.3 (1.418/1.425/1.341) for βˆ1 and 1.8 (1.812/1.866/1.967) for βˆ2 . That is, βˆZ is more efficient than βˆS with gains as 30% for βˆ1 and 80% βˆ2 . We also observe that the value of true β will impact the relative efficiency gain. About this issue, we will discuss in Chapter 5. Conclusions In this section, we verify that the estimator βˆZ is an unbiased estimator and the estimated standard error is smaller than the estimator for the simple random sampling design. Moreover, more efficiency gains are observed in the ODS design with more supplemental samples and the supplemental samples are sampling from two extreme intervals of the distribution of Y . For practice, we suggest that the proportion of the SRS sample in the ODS design is less than 0.5 and let cut-points be in the 10th and 90th percentiles of Y . Because we only consider the situation that the equal supplemental samples from two tails in this section, we will improve the efficiency by using other unequal supplemental sampling strategies in the next section.. 3.2.3. Unequal supplemental sampling strategies. In this section, we investigate whether another unequal supplemental sampling strategies gain more efficiency than the equal supplemental sampling in Section 3.2.2. Since the simulation results for n=200 and n=800 are similar to it for n=400, we only present the. 14.

(33) simulation results for n=400 here. The coefficient settings of the regression model are as same as Section 3.2.2. Below we consider two unequal supplemental sampling strategies: Unequal supplemental sampling I -. Add more supplemental samples from the lower tail of Y when the total supplemental sample sizes are given : let n1 = 43 (n − n0 ) and n3 = 14 (n − n0 ). Unequal supplemental sampling II -. Add more supplemental samples from the upper tail of Y when the total supplemental sample sizes are given : let n1 = 14 (n − n0 ) and n3 = 34 (n − n0 ). For each strategy, we fix the total sample size n=400. The simulation results are summarized in the Tables 3.19 through 3.24; Tables 3.19 and 3.20 are under the coefficient setting (1), (β0 , β1 , β2 ) = (1, 0.05, −0.5), respectively; Tables 3.21 and 3.22 are under the coefficient setting (2), (β0 , β1 , β2 ) = (1, 0.1, −0.5), respectively; Tables 3.23 and 3.24 are under the coefficient setting (3), (β0 , β1 , β2 ) = (1, 0.5, −0.5), respectively. We observe similar results as described in Section 3.2.2 in the Tables 3.19 through 3.24. Here we focus on the relative efficiency tables and figures since our goal is to explore that whether these two unequal supplemental sampling strategies are more efficiency. Tables 3.25 through 3.30 presented the relative efficiencies calculated from Tables 3.19 through 3.24 and the corresponding figures were shown in Figures 3.10 - 3.18. Below is the summary of our findings: 1. Comparing the curve with ”cut(0.1, 0.9)” in each left subfigure of Figure 3.10, 3.11, and 3.12, we observe that the relative efficiency in Figure 3.10 and 3.11 both increased from 1 to 2.2 as ρ decreased from 1 to 0.1 whereas the relative efficiency in the Figure 15.

(34) 3.12 increased from 1 to 2.6. This means that the ODS design using equal supplemental sampling can gain more efficiency. 2. Comparing the Figures 3.13 through 3.15, the above conclusion is applied to the model with coefficient setting (2). 3. Comparing the Figures 3.16 through 3.18, we obtain similar conclusion as 1 for the model with coefficient setting (3).. 3.2.4. Linear regression model with an interaction term. In this section, we investigate whether conclusions obtained in Section 3.2.2 are applicable to the model with interaction terms. Generating population data We conduct the same simulation studies as Section 3.2.2. and data generation is similar to Section 3.2.1. First, generate a population data set with 10,000 individuals from the model:. Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + ,. (3.2). where X1 ∼ N (0, 1), X2 ∼ Bernoulli(0.5) and ∼ N (0, 1). Then, we set the coefficient parameters as following: 1. (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.05) 2. (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.1) 3. (β0 , β1 , β2 , β3 ) = (1, 0.1, −0.5, 0.5) And the algorithm is as follows, 16.

(35) Step 1: Generate 10,000 independent X1 and X2 from N (0, 1) and Bernoulli(0.5),. respectively. Step 2: Generate 10,000 independent error terms from N (0, 1). Step 3: Plug the generated X1 , X2 , and into the model(3.2) to obtain the correspond-. ing Y . In addition, the procedure for samples collection of the ODS design is same as Section 3.2.1. The simulation results with n = 400 and the coefficient setting (β0 , β1 , β2 , β3 )=(1, 0.1, -0.5, 0.05 ) are presented in Table 3.31; the coefficient setting (β0 , β1 , β2 , β3 )=(1, 0.1, -0.5, 0.1 ) are presented in Table 3.32; the coefficient setting (β0 , β1 , β2 , β3 )=(1, 0.1, -0.5, 0.5 ) are presented in Table 3.33. Each table contains the performance of βˆZ using ODS design with different combinations of cut-points (a1 , a2 ) and proportion of the SRS samples to the total samples (ρ), and the performance βˆS using a simple random sampling design with the same sample size(ρ = 1). We are interested in the coefficient estimator of X1 , so we hereby present the results of βˆ1 and βˆ3 . Tables 3.34 through 3.36 present the relative efficiencies of βˆS to βˆZ calculated from Tables 3.31 through 3.33, and the corresponding figures are presented in the Figures 3.19 through 3.21. Summary of results Below is the summary of our findings: 1. In Table 3.31, all estimated means of βˆ3 for the ODS design were close to the true parameter value, 0.05, and the smaller standard errors were observed in the case that ρ is close to zero or the cut-points are in two extreme tails(10th , 90th ). Similar findings can be observed for βˆ1 . 2. Same results as 1. were observed in Table 3.32 and Table 3.33. 17.

(36) 3. In Figure 3.19, we observe that the relative efficiency of βˆ3 in the curve with ”cut(0.1, 0.9)” increased from 1 to 2.4 as ρ decreasing from 1 to 0.1 whereas the relative efficiency in other two curves increased from 1 to 2.2 and 1.3, respectively. This means that the optimal allocation of the ODS design with a limited sample size is which cut-points were chosen in the 10th and 90th percentiles of the distribution Y and ρ=0.1. 4. In Figures 3.20 and 3.21, similar results as Figure 3.19 were observed. However, it is worthy of note that the largest RE of βˆ3 in Figure 3.20 is 2.3, but it in Figure 3.21 is only 1.7. That is, using the optimal allocation of the ODS design instead of the SRS design can gain more efficiency especially for the model with smaller coefficient parameter values. (β3 =0.01 and 0.1 for Figures 3.19 and 3.20, respectively; β3 =0.5 for Figure 3.21) This phenomenon also occurs in the Section 3.2.2 and we will discuss this in Chapter 5. The findings in this section are similar to the conclusions in Section 3.2.2. This demonstrate that the conclusion obtained in Section 3.2.2 is applicable to general situations. In the next section, we will consider the case where that outcome distribution is asymmetric and we will conduct similar simulation studies to verify if the conclusion is general.. 18.

(37) Table 3.1: Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 200.. βˆ βˆ1. βˆ2. cut at (10th , 90th ). cut at (20th , 80th ). cut at (30th , 70th ). ρ. Mean. SE. Mean. SE. Mean. SE. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. 0.052 0.049 0.050 0.049 0.050 0.051 0.050 0.051 0.050 0.050. 0.070 0.066 0.061 0.057 0.053 0.051 0.050 0.046 0.047 0.046. 0.052 0.049 0.051 0.050 0.053 0.049 0.049 0.051 0.050 0.050. 0.070 0.070 0.064 0.064 0.060 0.058 0.057 0.055 0.053 0.054. 0.052 0.052 0.051 0.050 0.049 0.048 0.049 0.048 0.051 0.051. 0.070 0.070 0.067 0.066 0.064 0.062 0.062 0.061 0.060 0.059. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. -0.495 -0.502 -0.505 -0.505 -0.505 -0.503 -0.503 -0.504 -0.505 -0.504. 0.143 0.128 0.122 0.116 0.110 0.104 0.101 0.099 0.094 0.092. -0.495 -0.503 -0.501 -0.501 -0.495 -0.503 -0.506 -0.501 -0.502 -0.501. 0.143 0.135 0.130 0.127 0.120 0.117 0.111 0.111 0.108 0.103. -0.495 -0.499 -0.502 -0.496 -0.497 -0.495 -0.501 -0.501 -0.502 -0.506. 0.143 0.137 0.139 0.133 0.126 0.127 0.121 0.118 0.119 0.115. 19.




(41) Table 3.5: Simulation results: Normal model with (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400. ρ βˆ1. βˆ2. cut at (10th , 90th ). cut at (20th , 80th ). cut at (30th , 70th ). βˆ. Mean. SE. Mean. SE. Mean. SE. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. 0.100 0.099 0.102 0.101 0.100 0.100 0.101 0.100 0.100 0.100. 0.051 0.046 0.044 0.040 0.039 0.037 0.034 0.033 0.032 0.032. 0.100 0.099 0.102 0.100 0.099 0.099 0.101 0.101 0.099 0.099. 0.051 0.047 0.045 0.044 0.042 0.042 0.040 0.039 0.036 0.036. 0.100 0.100 0.101 0.101 0.100 0.101 0.100 0.101 0.100 0.101. 0.051 0.048 0.048 0.046 0.046 0.046 0.043 0.043 0.042 0.042. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. -0.503 -0.500 -0.507 -0.498 -0.503 -0.501 -0.503 -0.500 -0.502 -0.502. 0.100 0.093 0.085 0.084 0.078 0.074 0.069 0.067 0.066 0.063. -0.503 -0.496 -0.499 -0.501 -0.501 -0.503 -0.501 -0.502 -0.499 -0.502. 0.100 0.097 0.092 0.086 0.085 0.083 0.080 0.079 0.077 0.073. -0.503 -0.502 -0.499 -0.499 -0.498 -0.498 -0.499 -0.498 -0.502 -0.501. 0.100 0.098 0.095 0.092 0.090 0.088 0.088 0.087 0.086 0.083. 23.





(46) Table 3.10: Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 200.. βˆ1 ρ 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. (10th ,90th ) 1.000 1.133 1.339 1.497 1.741 1.868 1.972 2.298 2.241 2.330. (20th ,80th ) 1.000 1.010 1.199 1.224 1.353 1.455 1.535 1.628 1.735 1.715. Normal model with. βˆ2 (30th ,70th ) 1.000 1.006 1.115 1.145 1.216 1.272 1.282 1.339 1.384 1.400. (10th ,90th ) 1.000 1.231 1.366 1.501 1.667 1.893 1.986 2.069 2.306 2.401. (20th ,80th ) 1.000 1.122 1.199 1.267 1.416 1.487 1.652 1.648 1.735 1.911. Table 3.11: Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 400.. (10th ,90th ) 1.000 1.242 1.465 1.584 1.816 1.940 2.169 2.215 2.553 2.659. (20th ,80th ) 1.000 1.122 1.287 1.313 1.446 1.512 1.662 1.719 1.742 1.862. Normal model with. βˆ2. βˆ1 ρ 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. (30th ,70th ) 1.000 1.113 1.178 1.193 1.235 1.344 1.353 1.455 1.473 1.512. (10th ,90th ) 1.000 1.191 1.407 1.504 1.702 1.850 2.082 2.196 2.337 2.669. (20th ,80th ) 1.000 1.136 1.234 1.359 1.468 1.597 1.565 1.682 1.795 1.911. Table 3.12: Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): (β0 , β1 , β2 ) = (1, 0.05, −0.5) and n = 800.. βˆ1 ρ 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. (10th ,90th ) 1.000 1.282 1.470 1.670 1.774 2.009 2.215 2.538 2.623 2.804. (20th ,80th ) 1.000 1.137 1.327 1.427 1.480 1.632 1.798 1.814 1.864 2.022. (30th ,70th ) 1.000 1.082 1.056 1.150 1.270 1.263 1.382 1.449 1.440 1.541. (30th ,70th ) 1.000 1.086 1.067 1.168 1.301 1.312 1.337 1.367 1.517 1.578. Normal model with. βˆ2 (30th ,70th ) 1.000 1.043 1.121 1.220 1.224 1.376 1.501 1.425 1.549 1.541. 28. (10th ,90th ) 1.000 1.146 1.342 1.529 1.674 1.842 1.901 2.232 2.362 2.511. (20th ,80th ) 1.000 1.052 1.187 1.236 1.449 1.545 1.662 1.621 1.762 1.870. (30th ,70th ) 1.000 1.037 1.090 1.186 1.229 1.269 1.354 1.340 1.494 1.505.

(47) Table 3.13: Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 200.. β1 ρ 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. (10th ,90th ) 1.000 1.161 1.431 1.589 1.663 1.892 1.914 2.082 2.476 2.597. (20th ,80th ) 1.000 1.100 1.225 1.287 1.361 1.428 1.628 1.630 1.693 1.875. Normal model with. β2 (30th ,70th ) 1.000 1.008 1.126 1.124 1.182 1.258 1.282 1.415 1.517 1.517. (10th ,90th ) 1.000 1.245 1.441 1.607 1.718 1.907 2.076 2.265 2.331 2.492. (20th ,80th ) 1.000 1.162 1.218 1.330 1.458 1.549 1.664 1.703 1.837 1.891. Table 3.14: Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 400.. βˆ1 ρ 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. (10th ,90th ) 1.000 1.218 1.352 1.662 1.691 1.860 2.200 2.321 2.497 2.516. (20th ,80th ) 1.000 1.165 1.308 1.345 1.466 1.507 1.646 1.665 1.967 2.005. Normal model with. βˆ2 (30th ,70th ) 1.000 1.117 1.145 1.206 1.212 1.228 1.414 1.412 1.462 1.489. (10th ,90th ) 1.000 1.176 1.399 1.444 1.667 1.821 2.116 2.251 2.330 2.580. (20th ,80th ) 1.000 1.066 1.204 1.375 1.386 1.473 1.588 1.617 1.682 1.906. Table 3.15: Simulation results of relative efficiencies (V arβˆS /V arβˆZ ): (β0 , β1 , β2 ) = (1, 0.1, −0.5) and n = 800.. βˆ1 ρ 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1. (10th ,90th ) 1.000 1.303 1.480 1.636 1.786 1.828 2.088 2.343 2.561 2.691. (20th ,80th ) 1.000 1.105 1.303 1.297 1.432 1.570 1.678 1.671 1.753 1.877. (30th ,70th ) 1.000 1.106 1.158 1.175 1.310 1.307 1.420 1.434 1.423 1.547. (30th ,70th ) 1.000 1.047 1.109 1.186 1.247 1.316 1.311 1.329 1.362 1.450. Normal model with. βˆ2 (30th ,70th ) 1.000 1.129 1.141 1.144 1.269 1.343 1.342 1.411 1.485 1.487. 29. (10th ,90th ) 1.000 1.291 1.460 1.651 1.806 2.032 2.248 2.416 2.386 2.648. (20th ,80th ) 1.000 1.178 1.247 1.450 1.456 1.564 1.785 1.796 1.818 1.990. (30th ,70th ) 1.000 1.118 1.188 1.286 1.320 1.330 1.369 1.463 1.585 1.625.