© 2 0 1 2 2 M a t h W o r k s , I n cc .

(1)

MATLAB for Statistic Application

Pony Lai

[email protected]

© 201 2

[email protected]

2 MathWorks, In cc .

(2)

Course Outline Course Outline

● Importing and Organizing Data

● E l i D t

● Exploring Data

● Distributions

● Regression

● A l i f V i

● Analysis of Variance

(3)

Course Outline Course Outline

● Importing and Organizing Data

● E l i D t

● Exploring Data

● Distributions

● Regression

● A l i f V i

● Analysis of Variance

(4)

Outline Outline

• Course example: biomedical data

• Importing data

ArmLength ArmCirc Waist Triceps Subscapular ReportHeight ReportWeight Overweight LikeToWeigh

37.6 45.2 156.3 62 308 Over Less

38.2 34.1 109.5 15.6 25.8 68 190 Over Same

34.1 33.2 95.4 13.3 25.9 67 142 Right Same

43 31 79.5 8.4 9.2 73 175 Right Same

40 32 8 117 18 20 1 68 228 Over Less

• Data types

• Dataset arrays

40 32.8 117 18 20.1 68 228 Over Less

38.9 40.5 145.6 17.6 71 289 Over Less

32.6 30.7 89.7 21.1 30.5 63 138 Right Same

36 35.3 97 24.6 24.8 64 187 Over Less

37 29.9 82.9 13.8 20 68 150 Right Same

34 37.8 109.2 34.5 27.4 63 200 Over Less

37 31 2 98 4 34 8 25 5 63 140 Over Less

Dataset arrays

• Merging data

37 31.2 98.4 34.8 25.5 63 140 Over Less

39 34 106.2 24.6 24 67 180 Right Same

36.6 21.9 78 8 4.6 66 120 Right Same

36.4 32 95.6 12 21.6 66 165 Right Less

41.7 32.6 106.9 12.4 17.4 73 202 Over Less

35.2 33.3 100.9 24.2 11 9999 9999 Under More

39 31 1 95 5 11 6 64 180 Right Same

• Categorical data

39 31.1 95.5 11.6 64 180 Right Same

63 141 Right Less

39.2 37.2 132 18 69 250 Over Less

38.5 36.6 95.4 11.2 15.1 67 185 Over Less

• Working with missing data

(5)

Chapter 2 Learning Outcomes Chapter 2 Learning Outcomes

The attendee will be able to:

• Store data in an appropriate data type.

• Access and manipulate data stored in a statistical data type.

• Ignore, replace, or delete observations with missing data Ignore, replace, or delete observations with missing data

values.

(6)

Course Example: Biomedical Data Course Example: Biomedical Data

NHANES

0 98 0.99 0.997 0.999

Normal Probability Plot

M F

195 203 211

0 02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98

Probability

153 161 170 178 186

130 140 150 160 170 180 190 200 210

0.001 0.003 0.01 0.02

Data

300 200 100 0 100 200 300

136 145

(7)

Importing Data Importing Data

.csv

t t xpt

.txt .xpt

xptread

dlmread csvread

.xls xlsread

textscan

(8)

Data Types Data Types

Variable

logical char numeric cell structure function handle

user class

Statistics Toolbox

abc

integer single double ordinal nominal dataset

(9)

Numeric and Cell Arrays Numeric and Cell Arrays

cell2mat

num2cell cell2mat

num2cell mat2cell

x{3} x(3)

?

x{3} x(3)

[x{3:6}] ?

x{3:6}

[x{3:6}]

, , ,

(10)

Dataset Arrays Dataset Arrays

ArmLength ArmCirc Waist Triceps Subscapular ReportHeight ReportWeight Overweight LikeToWeigh

variables

ArmLength ArmCirc Waist Triceps Subscapular ReportHeight ReportWeight Overweight LikeToWeigh

37.6 45.2 156.3 62 308 Over Less

38.2 34.1 109.5 15.6 25.8 68 190 Over Same

34.1 33.2 95.4 13.3 25.9 67 142 Right Same

43 31 79 5 8 4 9 2 73 175 Ri ht S

43 31 79.5 8.4 9.2 73 175 Right Same

40 32.8 117 18 20.1 68 228 Over Less

38.9 40.5 145.6 17.6 71 289 Over Less

32.6 30.7 89.7 21.1 30.5 63 138 Right Same

n s

36 35.3 97 24.6 24.8 64 187 Over Less

37 29.9 82.9 13.8 20 68 150 Right Same

34 37.8 109.2 34.5 27.4 63 200 Over Less

37 31.2 98.4 34.8 25.5 63 140 Over Less

ervatio n

39 34 106.2 24.6 24 67 180 Right Same

36.6 21.9 78 8 4.6 66 120 Right Same

36.4 32 95.6 12 21.6 66 165 Right Less

41.7 32.6 106.9 12.4 17.4 73 202 Over Less

obs mixed data types

41.7 32.6 106.9 12.4 17.4 73 202 Over Less

35.2 33.3 100.9 24.2 11 9999 9999 Under More

39 31.1 95.5 11.6 64 180 Right Same

63 141 Right Less

39 2 37 2 132 18 69 250 O L

39.2 37.2 132 18 69 250 Over Less

38.5 36.6 95.4 11.2 15.1 67 185 Over Less

(11)

Importing Data into Dataset Arrays Importing Data into Dataset Arrays

.csv

t t xpt

.txt .xpt

dataset

.xls

dataset

(12)

Merging Data Merging Data

data = join(data1 data2 'key' 'SEQN')

SEQN BMDSTATS BMXWT BMXHT BMXBMI BMXLEG BMXARML BMXARMC

41475 3 138.9 154.7 58.04 34.2 37.6 45.2

SEQN WHD010 WHD020 WHQ030 WHQ040 WHD045 WHD050

41475 62 308 1 2 150 308

data = join(data1,data2, key , SEQN )

41476 1 22 120.4 15.18 25.1 17.1

41477 1 83.9 167.1 30.05 32.4 38.2 34.1

41478 1 11.5 14.2 17.3

41479 1 65.7 154.4 27.56 33.3 34.1 33.2

41480 1 27 122.7 17.93 25.8 20.9

41481 1 77.9 182.7 23.34 43.6 43 31

41482 1 101.6 173.8 33.64 43.5 40 32.8

41483 3 133.1 173.8 44.06 36.5 38.9 40.5

41477 68 190 1 3 190

41479 67 142 3 3 148

41481 73 175 3 3 165

41482 68 228 1 2 185 205

41483 71 289 1 2 180 289

41485 63 138 3 3 138

41486 64 187 1 2 160 197

41487 68 150 3 3 141

41484 1 9.3 14.5 15.2

41485 1 64.8 157.9 25.99 34.3 32.6 30.7

41486 1 86.2 166.2 31.21 35 36 35.3

41487 1 67.1 169.2 23.44 40 37 29.9

41488 1 16.1 103.1 15.15 22 17

41489 1 91.8 158.4 36.59 37 34 37.8

41490 1 70.7 159.8 27.69 40.5 37 31.2

41491 1 33.5 146.1 15.69 33.7 29.5 20.8

41489 63 200 1 2 175 208

41490 63 140 1 2 130 135

41492 67 180 3 3 185

41493 66 120 3 3 150

41494 66 165 3 2 150 165

41495 73 202 1 2 180 202

41496 9999 9999 2 1 99999 9999

41498 64 180 3 3 180

41492 1 82.6 168.7 29.02 34 39 34

41493 1 53.3 163.4 19.96 37.1 36.6 21.9

41494 1 70.5 166.4 25.46 39 36.4 32

41495 1 87.2 184.8 25.53 41 41.7 32.6

41496 1 67.8 154.6 28.37 33.5 35.2 33.3

41497 3 12.5 90.6 15.23 17.6 13.8

41498 3 77.4 165.8 28.16 39.4 39 31.1

41499 4

41500 1 91 6 156 5 37 4 38 7 33 35 4

41499 63 141 3 2 125 149

41501 69 250 1 2 220 250

41502 67 185 1 2 165 185

41503 66 165 3 3 165

41504 71 166 3 1 170 165

41506 70 160 3 2 155 160

41509 69 200 1 2 170 220

41510 68 142 1 2 135 147

41511 69 178 3 3 185

41500 1 91.6 156.5 37.4 38.7 33 35.4

41501 3 114.8 177.9 36.27 41 39.2 37.2

41502 1 82.4 168.2 29.13 40.4 38.5 36.6

41503 1 73.1 173.9 24.17 40.1 36.9 30.6

41504 1 74.3 178.5 23.32 41.7 40 29

41505 1 10.2 16 14.5

41506 3 176.9

41507 1 18.7 116.3 13.83 23.4 16

41508 1 9 8 16 5 14 2

41511 69 178 3 3 185

41512 63 95 2 1 110 90

41513 59 96 3 3 96

41514 68 198 1 3 198

41516 64 215 1 2 135 230

41518 68 210 1 2 185 210

41519 74 210 3 2 200 210

41522 70 189 3 3 189

41523 73 235 1 3 211

41508 1 9.8 16.5 14.2

41509 1 101.3 171.9 34.28 36.5 38.9 42

41510 1 66.8 171.8 22.63 37.6 35.2 28.2

41511 1 78.8 173.6 26.15 38.5 36.9 30.6

41523 73 235 1 3 211

41524 61 118 3 3 118

41525 67 165 3 3 175

41526 69 178 1 2 160 178

(13)

Indexing into Dataset Arrays Indexing into Dataset Arrays

>> x = data(5:20,[9,12,18])

>> x = data(5:20,{'BMXHT','BMXLEG','BMXWAIST'})

SEQN BMDSTATS BMXWT BMIWT BMXRECUM BMIRECUM BMXHEAD BMIHEAD BMXHT BMIHT BMXBMI BMXLEG BMILEG BMXARML BMIARML BMXARMC BMIARMC BMXWAIST

41475 3 138.9 154.7 58.04 34.2 37.6 45.2 156.3

41477 1 83.9 167.1 30.05 32.4 38.2 34.1 109.5

41479 1 65.7 154.4 27.56 33.3 34.1 33.2 95.4

41481 1 77.9 182.7 23.34 43.6 43 31 79.5

41482 1 101.6 173.8 33.64 43.5 40 32.8 117

41483 3 133.1 173.8 44.06 36.5 38.9 40.5 145.6

41485 1 64.8 157.9 25.99 34.3 32.6 30.7 89.7

41486 1 86.2 166.2 31.21 35 36 35.3 97

41487 1 67.1 169.2 23.44 40 37 29.9 82.9

41489 1 91.8 158.4 36.59 37 34 37.8 109.2

41490 1 70.7 159.8 27.69 40.5 37 31.2 98.4

41492 1 82.6 168.7 29.02 34 39 34 106.2

41493 1 53.3 163.4 19.96 37.1 36.6 21.9 78

41494 1 70.5 166.4 25.46 39 36.4 32 95.6

41495 1 87 2 184 8 25 53 41 41 7 32 6 106 9

41495 1 87.2 184.8 25.53 41 41.7 32.6 106.9

41496 1 67.8 154.6 28.37 33.5 35.2 33.3 100.9

41498 3 77.4 165.8 28.16 39.4 39 31.1 95.5

41499 4

41501 3 114.8 177.9 3 36.27 41 39.2 37.2 132

41502 1 82.4 168.2 29.13 40.4 38.5 36.6 95.4

41503 1 73.1 173.9 24.17 40.1 36.9 30.6 85.9

41504 1 74.3 178.5 3 23.32 41.7 40 29 95

41506 3 176.9

41509 1 101.3 171.9 34.28 36.5 38.9 42 114.5

41510 1 66.8 171.8 22.63 37.6 35.2 28.2 81.9

41511 1 78.8 173.6 26.15 38.5 36.9 30.6 93.5

41512 2 44.2 154.7 18.47 1 1 1

41513 1 42.6 137.9 22.4 29.2 30 24.4 78.2

41514 1 87.4 171.7 29.65 35.7 41 32.1 111.9

41516 3 95.3 3 160.6 3 36.95 1 40.4 35.5 126.4

1 2 3 4 5 6 7 8 9 10 ...

41518 1 93 178.7 29.12 38.4 37.2 32.6 101.9

41519 1 96.5 184.5 28.35 44.2 42.2 34.2 113.7

41522 1 80.4 174.9 26.28 44.2 41 30.9 107

(14)

Indexing into Dataset Array Variables Indexing into Dataset Array Variables

>> x = data.BPXPLS(5:20)

SEQN PEASCST1 PEASCTM1 PEASCCT1 BPXCHR BPQ150A BPQ150B BPQ150C BPQ150D BPAARM BPACSZ BPXPLS BPXPULS BPXPTY BPXML1 BPXSY1 BPXDI1 BPAEN1 BPXSY2 BPXDI2

41475 1 612 2 2 2 2 1 5 66 1 1 150 128 64 2 122 60

41476 1 9 72 1

41477 1 625 2 2 2 2 1 4 78 1 1 180 144 60 2 148 60

41478 1 278 120 1

41479 1 620 1 2 2 2 1 4 62 1 1 140 112 70 2 108 66

41480 1 79 86 1

41481 1 533 2 2 2 2 1 4 52 1 1 140 112 62 2 104 64

41482 1 561 2 2 2 2 1 4 76 1 1 150 116 78 2 114 80

41483 1 748 2 2 2 2 1 5 60 1 1 140 110 62 2 112 56

41484 1 285 134 1

41485 1 547 2 2 2 2 1 4 68 1 1 140 108 44 2 110 42

41486 1 556 2 2 2 2 1 4 62 1 1 150 126 64 2 120 70

41487 1 539 2 2 2 2 1 4 74 1 1 150 120 84 2 116 80

41488 1 59 110 1

41489 1 570 2 2 2 2 1 4 58 1 1 130 106 66 2 114 72

41490 1 564 1 2 2 2 1 4 72 1 1 150 128 64 2 134 62

41491 1 555 2 2 2 2 1 2 84 1 1 130 120 76 2 122 78

41492 1 704 2 2 2 2 1 4 62 1 1 180 156 94 2 152 86

41493 1 567 2 2 2 2 1 3 68 1 1 150 106 42 2 118 54

41494 1 670 2 2 2 2 1 4 82 1 1 140 120 78 2 108 64

41495 1 676 2 2 2 2 1 4 68 1 1 150 102 64 2 98 62

41496 1 676 2 2 1 2 1 4 78 1 1 210 192 86 2 192 84

41497 1 20 96 1

41498 1 698 1 2 2 2 1 4 60 1 1 240 226 96 2 204 88

41499 1 467 2 2 2 2 1 3 80 1 1 130 116 52 2 114 68

41500 1 531 2 2 2 2 1 4 104 1 1 140 114 58 2 116 56

41501 1 574 2 2 2 2 1 4 60 1 1 170 150 70 2 144 66

41502 1 516 2 2 2 2 1 4 70 1 1 150 124 54 2 116 68

41503 1 625 1 2 2 2 1 4 66 1 1 130 100 52 2 108 46

41504 1 743 2 2 2 2 1 4 72 1 1 180 148 88 2 146 96

41505 1 296 130 1

41506 1 510 2 2 2 2 2 3 60 1 1 160 128 84 2 124 84

(15)

Dataset Array Properties Dataset Array Properties

>> data.Properties p

>> data.Properties.VarNames = NewNames;

SEQN RIAGENDR RIDAGEEX RIDRETH1 BPXPLS BPXSY1 BPXDI1 BMXHT BMXWT BMXLEG BMXARML BMXARMC BMXWAIST

41475 2 752 5 66 128 64 154.7 138.9 34.2 37.6 45.2 156.3

41477 1 860 3 78 144 60 167.1 83.9 32.4 38.2 34.1 109.5

41479 1 630 1 62 112 70 154.4 65.7 33.3 34.1 33.2 95.4

41481 1 254 4 52 112 62 182.7 77.9 43.6 43 31 79.5

41482 1 779 1 76 116 78 173 8 101 6 43 5 40 32 8 117

41482 1 779 1 76 116 78 173.8 101.6 43.5 40 32.8 117

41483 1 804 4 60 110 62 173.8 133.1 36.5 38.9 40.5 145.6

41485 2 361 2 68 108 44 157.9 64.8 34.3 32.6 30.7 89.7

41486 2 735 1 62 126 64 166.2 86.2 35 36 35.3 97

41487 1 332 5 74 120 84 169.2 67.1 40 37 29.9 82.9

41489 2 483 1 58 106 66 158.4 91.8 37 34 37.8 109.2

IDNum Sex Age Ethnicity Pulse BPSyst1 BPDias1 Height Weight LegLength ArmLength ArmCirc Waist

41475 2 752 5 66 128 64 154.7 138.9 34.2 37.6 45.2 156.3

41477 1 860 3 78 144 60 167.1 83.9 32.4 38.2 34.1 109.5

41479 1 630 1 62 112 70 154.4 65.7 33.3 34.1 33.2 95.4

41481 1 254 4 52 112 62 182.7 77.9 43.6 43 31 79.5

41482 1 779 1 76 116 78 173 8 101 6 43 5 40 32 8 117

41490 2 803 4 72 128 64 159.8 70.7 40.5 37 31.2 98.4

41492 1 866 3 62 156 94 168.7 82.6 34 39 34 106.2

41493 2 935 3 68 106 42 163.4 53.3 37.1 36.6 21.9 78

41494 1 481 1 82 120 78 166.4 70.5 39 36.4 32 95.6

41495 1 740 3 68 102 64 184.8 87.2 41 41.7 32.6 106.9

41496 2 777 1 78 192 86 154.6 67.8 33.5 35.2 33.3 100.9

41482 1 779 1 76 116 78 173.8 101.6 43.5 40 32.8 117

41483 1 804 4 60 110 62 173.8 133.1 36.5 38.9 40.5 145.6

41485 2 361 2 68 108 44 157.9 64.8 34.3 32.6 30.7 89.7

41486 2 735 1 62 126 64 166.2 86.2 35 36 35.3 97

41487 1 332 5 74 120 84 169.2 67.1 40 37 29.9 82.9

41489 2 483 1 58 106 66 158.4 91.8 37 34 37.8 109.2

41498 1 41490 826 2 4 803 60 4 226 72 96 128 165.8 64 77.4 159.8 39.4 70.7 39 40.5 31.1 37 95.5 31.2 98.4

41492 1 866 3 62 156 94 168.7 82.6 34 39 34 106.2

41493 2 935 3 68 106 42 163.4 53.3 37.1 36.6 21.9 78

41494 1 481 1 82 120 78 166.4 70.5 39 36.4 32 95.6

41495 1 740 3 68 102 64 184.8 87.2 41 41.7 32.6 106.9

41496 2 777 1 78 192 86 154.6 67.8 33.5 35.2 33.3 100.9

41498 1 826 4 60 226 96 165.8 77.4 39.4 39 31.1 95.5

(16)

Categorical Data Categorical Data

IDNum Sex Age Ethnicity Pulse BPSyst1

41475 2 752 5 66 128 >> data.Sex = ...

41477 1 860 3 78 144

41479 1 630 1 62 112

41481 1 254 4 52 112

41482 1 779 1 76 116

41483 1 804 4 60 110

41485 2 361 2 68 108

nominal(data.Sex,{'M','F'});

41485 2 361 2 68 108

41486 2 735 1 62 126

41487 1 332 5 74 120

41489 2 483 1 58 106

41490 2 803 4 72 128

41492 1 866 3 62 156

41493 2 935 3 68 106

41494 1 481 1 82 120

41495 1 740 3 68 102

41496 2 777 1 78 192

41498 1 826 4 60 226

IDNum Sex Age Ethnicity Pulse BPSyst1

41475 F 752 Other 66 128

41477 M 860 White 78 144

41479 M 630 MexAm 62 112

41481 M 254 Black 52 112

41482 M 779 MexAm 76 116

41483 M 804 Black 60 110

41485 F 361 Hispanic 68 108

41486 F 735 MexAm 62 126

41487 M 332 Other 74 120

1 = ⁴¹⁴⁸⁹ ^F ⁴⁸³ ^MexAm ⁵⁸ ¹⁰⁶

41490 F 803 Black 72 128

41492 M 866 White 62 156

41493 F 935 White 68 106

41494 M 481 MexAm 82 120

41495 M 740 White 68 102

1 =

2 = ⁴¹⁴⁹⁵ ₄₁₄₉₆ ^M _F ⁷⁴⁰ ₇₇₇ _MexAm ^White ⁶⁸ ₇₈ ¹⁰² ₁₉₂

41498 M 826 Black 60 226

2 =

(17)

Dealing with Missing Data Dealing with Missing Data

LegLength ArmLength ArmCirc Waist Triceps SubscapularReportHeight ReportWeight HighBP Overweight LikeToWeigh

34 2 37 6 45 2 156 3 62 308 Y Over Less

34.2 37.6 45.2 156.3 62 308 Y Over Less

32.4 38.2 34.1 109.5 15.6 25.8 68 190 Y Over Same

33.3 34.1 33.2 95.4 13.3 25.9 67 142 N Right Same

43.6 43 31 79.5 8.4 9.2 73 175 N Right Same

43.5 40 32.8 117 18 20.1 68 228 Y Over Less

36.5 38.9 40.5 145.6 17.6 71 289 Y Over Less

34.3 32.6 30.7 89.7 21.1 30.5 63 138 N Right Same

35 36 35.3 97 24.6 24.8 64 187 Y Over Less

40 37 29.9 82.9 13.8 20 68 150 Y Right Same

37 34 37.8 109.2 34.5 27.4 63 200 N Over Less

40.5 37 31.2 98.4 34.8 25.5 63 140 N Over Less

34 39 34 106.2 24.6 24 67 180 Y Right Same

37.1 36.6 21.9 78 8 4.6 66 120 N Right Same

39 36 4 32 95 6 12 21 6 66 165 N Ri h L

39 36.4 32 95.6 12 21.6 66 165 N Right Less

41 41.7 32.6 106.9 12.4 17.4 73 202 N Over Less

33.5 35.2 33.3 100.9 24.2 11 9999 9999 N Under More

39.4 39 31.1 95.5 11.6 64 180 Y Right Same

63 141 N Right Less

41 39.2 37.2 132 18 69 250 Y Over Less

40.4 38.5 36.6 95.4 11.2 15.1 67 185 N Over Less

40.1 36.9 30.6 85.9 13.2 10.7 66 165 N Right Same

41.7 40 29 95 10.4 18 71 166 Y Right More

(18)

Course Outline Course Outline

● Importing and Organizing Data

● E l i D t

● Exploring Data

● Distributions

● Regression

● A l i f V i

● Analysis of Variance

(19)

Outline Outline

• Plotting

• Measures of center, spread, and shape

• Correlations

• Grouped data ¹⁸⁰

200 220

M F

Grouped data

100 120 140 160

Weight [kg]

40 60 80

W

100 130 140 150 160 170 180 190 200 210

20 Height [cm]

(20)

Chapter 3 Learning Outcomes Chapter 3 Learning Outcomes

The attendee will be able to:

• Calculate measures of central tendency, spread, and shape.

• Visualize data using an appropriate statistical plot.

• Calculate linear correlation between variables. Calculate linear correlation between variables.

(21)

Basic Plotting Basic Plotting

180 200

220 44184

46871 49462

80 100 120 140 160

Weight [kg]

51524 48689

130 140 150 160 170 180 190 200 210

20 40 60

Height[cm]

0 200 400 100 200 300

0 50 100 20 40 60 20 40 60 0

100 120 140

0 100 200 0 50 100 20 40 60 20 40 60 0 200 400 100 200 300 0 100 200

40 60 80

Number

0 50 100 150 200 250

0 20

Weight [kg]

(22)

Histograms Histograms

180 >> n = hist(data,bins)

>> hist(data,bins)

120 140 160

>> n = histc(data,bins)

60 80 100

20 40 60

120 130 140 150 160 170 180 190 200 210 220 ( )

0 >> bar(n)

(23)

Measures of Centrality Measures of Centrality

120 140

mode

80 100 hist + max

geomean

60 median

geomean harmmean trimmean

20 median 40

20 40 60 80 100 120 140 160 180 200 220

0 mean

(24)

Measures of Spread Measures of Spread

140 160 180

100 120 140

iqr

mad

min max

60 80 100

std

var moment quantile

20 40

60 q

140 150 160 170 180 190 200 210

0 range

(25)

Measures of Shape

80 Measures of Shape

?

? 50 60

? 70

? ^skewness

160 180

30 40

50 1.1575

100 120 140

10 20 30

skewness 0 0278

60 80 100

40 60 80 100 120 140 160 180 200 220

0 -0.0278

20 40 kurtosis

moment

quantile

140 150 160 170 180 190 200 210

0 prctile

(26)

Correlations Correlations

200

220 44184

46871 49462

120 140 160 180

[kg] 48689

300

scatter

plotmatrix

60 80 100 120

Weight

51524

60 0 200 400 100 200 300

gname

130 140 150 160 170 180 190 200 210

20 40

Height[cm]

50 100 20 40 60 20 40

0 100 200 0 50 100

20 40 60 20 40 60

0 200 400 100 200 300

0 100 200 0 50

>> corr(X)

1.0000 0.4480 0.7442 0.7969 0.2393 0.1714

0.4480 1.0000 0.2456 0.5932 0.8930 0.8844

0.7442 0.2456 1.0000 0.6071 0.0995 -0.0401

0.7969 0.5932 0.6071 1.0000 0.4558 0.4066

0.2393 0.8930 0.0995 0.4558 1.0000 0.8176

0.1714 0.8844 -0.0401 0.4066 0.8176 1.0000

(27)

Grouped Data Grouped Data

>> gscatter(data.Height,data.Weight,data.Sex)

160 180 200 220

M F

100 120 140 160

Weight [kg]

40 60 80

190 200

130 140 150 160 170 180 190 200 210

20 Height [cm]

160 170 180

Height [cm]

140 150

MexAm Hispanic p White Black Other

>> boxplot(data.Height,data.Ethnicity)

(28)

Course Outline Course Outline

● Importing and Organizing Data

● E l i D t

● Exploring Data

● Distributions

● Regression

● A l i f V i

● Analysis of Variance

(29)

Outline Outline

• Density functions p( ) = ?

• Probability distributions

0.05 Weight

• Distribution parameters

• Comparing and fitting ^0.035

0.04 0.045

Weight

 = 3.5

 = 4.0

 = 4.5

 = 5.0

Comparing and fitting distributions

0 015 0.02 0.025 0.03

P ro b a b ilit y

0 0.005 0.01 0.015

0 20 40 60 80 100 120 140 160 180 200

0 Weight [kg]

(30)

Chapter 4 Learning Outcomes Chapter 4 Learning Outcomes

The attendee will be able to:

• Visually compare the distribution of a data set to standard probability distributions.

• Determine distribution parameters for a data set, assuming a given parametric probability distribution.

given parametric probability distribution.

(31)

Density Functions Density Functions

0.025 1

0 015 0.02

0 <= W < 100 100 <= W < 120 120 <= W <



0 6 0.7 0.8 0.9

0.01 0.015

Percentage

0.3 0.4 0.5 0.6

Percentage

PDF ecdf CDF

cdfplot

40 60 80 100 120 140 160 180 200 220

0 0.005

Weight [kg]

20 40 60 80 100 120 140 160 180 200 220

0 0.1 0.2

Weight [kg]

0 <= W < 100 100 <= W < 120 120 <= W <



180 200 220

0 <= W < 100 100 <= W < 120 120 <= W < 

hist

80 100 120 140 160

Weight [kg]

ICDF x  y

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

20 40 60

Percentage

(32)

Probability Distributions Probability Distributions

Beta Binomial

Hypergeometric Lognormal

0.025

-Square

Noncentral -Square Exponential

Extreme Value

Negative Binomial Normal

Poisson Rayleigh

0.015 0.02

Probability

lognpdf

Extreme Value

Generalized Extreme Value F

Noncentral F

Rayleigh Student's t Noncentral t

Uniform (Continuous)

₀

0.005

P 0.01

Gamma

Generalized Pareto Geometric

Uniform (Discrete) Weibull

0 50 100 150 200 250

0

Weight [kg]

1

Empirical CDF 250

0.6 0.7 0.8 0.9 1

y150

200 250

]

logncdf logninv

0.2 0.3 0.4 0.5

Probability

50

Weight [kg] 100

logninv logncdf

20 40 60 80 100 120 140 160 180 200 220

0 0.1

Weight [kg]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

Probability

(33)

Distribution Parameters Distribution Parameters

0 045 0.05

Weight

>> disttool

0.03 0.035 0.04

0.045  = 3.5

 = 4.0

 = 4.5

 = 5.0

0.015 0.02 0.025

P robabi li ty

0 20 40 60 80 100 120 140 160 180 200

0 0.005 0.01

0 02 0.025

Weight

 = 0.20

 = 0.25 Weight [kg]

0.015 0.02

b abi li ty

 = 0.30

 = 0.35

 

2 2

2 )

1 ln(

)

( ^





 W

e W

P

0.005 o Pr b 0.01

2 ) 2

( ^



  e

W W P

0 20 40 60 80 100 120 140 160 180 200

0 Weight [kg]

(34)

Comparing Distributions Comparing Distributions

0 995 0.999 0.9995 0.9999

Probability plot for Normal distribution

0 995 0.999 0.9995 0.9999

Probability plot for Lognormal distribution

0 1 0.25 0.5 0.75 0.9 0.95 0.9950.99

Probability

0 1 0.25 0.5 0.75 0.9 0.95 0.9950.99

Probability

jbtest

lillietest



20 40 60 80 100 120 140 160 180 200 220

0.0001 0.00050.001 0.0050.01 0.05 0.1

Data 100

120 140

10² 0.0001

0.00050.001 0.0050.01 0.05 0.1

?

Data

?



40 60

80

? ?

40 60 80 100 120 140 160 180 200 220

0 20

0.99 0.995 0.999 0.9995 0.9999

Probability plot for Rayleigh distribution

0.999 0.9995 0.9999

Probability plot for Exponential distribution

0.25 0.5 0.75 0.9 0.95

Probability

0 75 0.9 0.95 0.99 0.995

Probability

probplot

normplot l t

20 40 60 80 100 120 140 160 180 200 220

0.00010.01 0.050.1

Data

20 40 60 80 100 120 140 160 180 200 220

0.010.25 0.5 0.75

Data

qqplot

(35)

Fitting Distributions Fitting Distributions

0.02 0.025

 

2 2

2 ) ln(

2 ) 1

( ^







W

W e W

3536 P

.

 4



0.015 ro b a b ilit y

2 



2489 W

.

 0



mle l fi

0.005 P 0.01

lognfit

20 40 60 80 100 120 140 160 180 200 220

0 Weight [kg]

>> dfittool

(36)

Course Outline Course Outline

● Importing and Organizing Data

● E l i D t

● Exploring Data

● Distributions

● Regression

● A l i f V i

● Analysis of Variance

(37)

Outline Outline

160 180 200

0.8 0.9

• Linear regression models

100 120 140 60

ressure [mmHg]

0.5 0.6 0.7 0 8

• Fitting linear models to data

20 40 60 80

Pulse P

0.1 0.2 0.3 0.4

• Evaluating the fit

• Adjusting the model

10 20 30 40 50 60 70 80

0 Age [yr]

0 Adjusting the model

20 40 60

s

Plot of residuals vs. fitted values

-40 -20 0

Residuals

60 80 100 120 140 160 180

-60

Fitted values

(38)

Chapter 7 Learning Outcomes Chapter 7 Learning Outcomes

The attendee will be able to:

• Perform a multiple linear regression.

• Increase robustness of a regression.

• Perform a nonlinear or generalized linear regression. Perform a nonlinear or generalized linear regression.

(39)

Linear Regression Models Linear Regression Models

response predictor

160 180 ^p ^y ^  ^ ^k ^f ^k ^{( x} ⁾ ^p

100 120 140

c um fe renc e [ c m ]

model parameters

design functions

140 160 180

60 80 100

Wai s t Ci rc

80 100 120 140

e s s ur e [ m m H g]

15 20 25 30 35 40 45 50 55 60

40 Arm Circumference [cm]

40 60 80

Pu ls e P re

c mA W  

2 c A c

A c

P   

10 20 30 40 50 60 70 80

0 20

Age [yr]

0 1

2 A c A c

c

P   

(40)

LinearModel Objects LinearModel Objects

180

circfit = LinearModel.fit(x,y)

100 120 140 160

cumference [cm]

15 20 25 30 35 40 45 50 55 60

40 60 80 100

Waist Circ

15 20 25 30 35 40 45 50 55 60

Arm Circumference [cm]

plot(circfit)

160 180

y vs. x1

Data Fit

Confidence bounds

p circfit.Coefficients

80 100 120 140

y

Estimate SE tStat pValue (Intercept) 12.294 0.80999 15.178 5.1775e-51 x1 2.6065 0.02445 106.61 0

15 20 25 30 35 40 45 50 55 60

40 60 80

x1

(41)

Evaluating Goodness of Fit Evaluating Goodness of Fit

Diagnostic properties

circfit.Rsquared

circfit.Diagnostics.CooksDistance

[h,p] = jbtest(circfit.Residuals.Raw)

Diagnostic plot methods

plotResiduals(circfit,'fitted') plotDiagnostics(circfit,'cookd')

Diagnostic plot methods

20 40 60

s

Plot of residuals vs. fitted values

0.03 0.035 0.04 0.045

nce

Case order plot of Cook's distance

-40 -20 0

Residuals

0.005 0.01 0.015 0.02 0.025

Cook's distan

60 80 100 120 140 160 180

-60

Fitted values

0 1000 2000 3000 4000 5000 6000

0

Row number

(42)

Higher Order Models Higher-Order Models

200

y vs. x1

Data

2 A A

P

100 120 140 160 180

y

Fit

Confidence bounds

0 1

2 2 A c A c

c

P   

A 2

P

20 40 60 80 100

y

~ A P

'y ~ x1^2'

10 20 30 40 50 60 70 80

0

x1

2 5 4

2 3 2

1 0 H C H HC C

W            

C H C

H

W ~ ²  ²  :

' 1^2 + 2^2 + 1 2'

'y ~ x1^2 + x2^2 + x1:x2'

(43)

Ignoring Outliers Ignoring Outliers

180 200

Data Fit

0.035

Histogram of residuals

0 01 0.012 0.014 0.016 0.018

ance

120 140 160

ure [mmHg]

Confidence bounds

0.02 0.025 0.03

0.002 0.004 0.006 0.008 0.01

Cook's dista

40 60 80 100

Pulse Pressu

0.005 0.01 0.015

0 1000 2000 3000 4000 5000 6000

0

Row number

180 200

0 9

10 20 30 40 50 60 70 80

0 20

Age [yr]

-1000 -50 0 50 100 150

weighting

120 140 160 180

e [mmHg]

0.6 0.7 0.8 0.9

40 60 80 100

Pulse Pressure

0.2 0.3 0.4 0.5

LinearModel.fit(...,'RobustOpts','on')

10 20 30 40 50 60 70 80

0 20 40

Age [yr]

0

0.1

0.2

(44)

Adjusting the Model Adjusting the Model

2 5 4

2 3 2

1 0 H C H HC C

W            

Analysis of Variance Analysis of Variance

2 5 4

2 3 2

1 0 H C H HC C

W            

Li M d l t i

LinearModel.stepwise

(45)

Course Outline Course Outline

● Importing and Organizing Data

● E l i D t

● Exploring Data

● Distributions

● Regression

● A l i f V i

● Analysis of Variance

(46)

Outline Outline

0.12 Over Under

• Multiple comparisons

0 06 0.08

0.1 Right

• 1-Way ANOVA

0.02 0.04 0.06

• N-Way ANOVA

• MANOVA

-20 -15 -10 -5 0 5 10 15 20

0  weight [kg]

MANOVA

• Nonnormal ANOVA

180 190 200

• Categorical correlations

150 160 170

Height [cm]

Over Under Right

140

(47)

Chapter 6 Learning Outcomes Chapter 6 Learning Outcomes

The attendee will be able to:

• Determine if groups within a data set have significantly different sample means.

• Perform multiple pairwise tests between groups.

(48)

Multiple Comparisons Multiple Comparisons

0.06 M Over M Under

A

0.04 0.05 M Right

F Over F Under F Right

D

0.02 0.03

D

190 200

130 140 150 160 170 180 190 200 210 220

0 0.01

C

170 180

130 140 150 160 170 180 190 200 210 220

Height [cm]

140 150 160

M M M F F F

Over Under Right Over Under Right

(49)

1 Way ANOVA 1-Way ANOVA

>> p = anova1(data.Height,data.Overweight);

200 H ₀ : groups all have same mean ¹⁸⁰

190 200

cm]

0 g p

H ₁ : no they don’t

150 160 170

Height [c

Over Under Right

140

(50)

Who’s the Odd One Out?

Who s the Odd One Out?

>> [p,tbl,stats] = anova1(data.Height,data.Overweight);

  = 1 2744

200  ₁ –  ₃ = -1.2744 95% CI: [-1.92, -0.62]

180 190 200

cm]

>> cmat = multcompare(stats) cmat =

1 0000 2 0000 2 8592 1 5090 0 1588

150 160 170

Height [c

1.0000 2.0000 -2.8592 -1.5090 -0.1588 1.0000 3.0000 -1.9249 -1.2744 -0.6239 2.0000 3.0000 -1.1294 0.2346 1.5987

Over Under Right

140

(51)

N Way ANOVA

200 N-Way ANOVA

180 190

m ]

160 170

H e ight [ c m

140 150

multiple

Over Over Over Under Under Under Right Right Right Less More Same Less More Same Less More Same

multiple factors

>> p = anovan(data Height {data Overweight data LikeToWeigh}

>> p = anovan(data.Height,{data.Overweight,data.LikeToWeigh},...

'model','interaction')

(52)

Multiple Responses Multiple Responses

size pulse

size

(self-view) “heart function” pulse BP

>> [d,p] = manova1([data.Pulse,data.BPSyst1],data.Overweight)

140 145 150

Would like to weigh more/less/the same

135 140 145

Self-view of weight -- wish to weigh more

135 140 145

Self-view of weight -- excluding overweight

120 125 130 135 0

Less More Same

Systolic BP [mmHg]

115 120 125 130 135

UnderOver Right

Systolic BP [mmHg]

115 120 125 130 135

Under Right

Systolic BP [mmHg]

55 60 65 70 75 80 85 90

100 105 110 115

Pulse [bpm]

S

55 60 65 70 75 80 85 90

100 105 110 115

Pulse [bpm]

S

55 60 65 70 75 80 85 90

100 105 110 115

Pulse [bpm]

S

[ p ] [ p ] [ p ]

d = 2 d = 1 d = 0

(53)