MATLAB for Statistic Application
Pony Lai
support@terasoft.com.tw
© 201 2
support@terasoft.com.tw
2 MathWorks, In cc .
Course Outline Course Outline
● Importing and Organizing Data
● E l i D t
● Exploring Data
● Distributions
● Distributions
● Regression
● Regression
● A l i f V i
● Analysis of Variance
Course Outline Course Outline
● Importing and Organizing Data
● E l i D t
● Exploring Data
● Distributions
● Distributions
● Regression
● Regression
● A l i f V i
● Analysis of Variance
Outline Outline
• Course example: biomedical data
• Importing data
ArmLength ArmCirc Waist Triceps Subscapular ReportHeight ReportWeight Overweight LikeToWeigh
37.6 45.2 156.3 62 308 Over Less
38.2 34.1 109.5 15.6 25.8 68 190 Over Same
34.1 33.2 95.4 13.3 25.9 67 142 Right Same
43 31 79.5 8.4 9.2 73 175 Right Same
40 32 8 117 18 20 1 68 228 Over Less
• Data types
• Dataset arrays
40 32.8 117 18 20.1 68 228 Over Less
38.9 40.5 145.6 17.6 71 289 Over Less
32.6 30.7 89.7 21.1 30.5 63 138 Right Same
36 35.3 97 24.6 24.8 64 187 Over Less
37 29.9 82.9 13.8 20 68 150 Right Same
34 37.8 109.2 34.5 27.4 63 200 Over Less
37 31 2 98 4 34 8 25 5 63 140 Over Less
Dataset arrays
• Merging data
37 31.2 98.4 34.8 25.5 63 140 Over Less
39 34 106.2 24.6 24 67 180 Right Same
36.6 21.9 78 8 4.6 66 120 Right Same
36.4 32 95.6 12 21.6 66 165 Right Less
41.7 32.6 106.9 12.4 17.4 73 202 Over Less
35.2 33.3 100.9 24.2 11 9999 9999 Under More
39 31 1 95 5 11 6 64 180 Right Same
• Categorical data
39 31.1 95.5 11.6 64 180 Right Same
63 141 Right Less
39.2 37.2 132 18 69 250 Over Less
38.5 36.6 95.4 11.2 15.1 67 185 Over Less
• Working with missing data
Chapter 2 Learning Outcomes Chapter 2 Learning Outcomes
The attendee will be able to:
• Store data in an appropriate data type.
• Access and manipulate data stored in a statistical data type.
• Ignore, replace, or delete observations with missing data Ignore, replace, or delete observations with missing data
values.
Course Example: Biomedical Data Course Example: Biomedical Data
NHANES
0 98 0.99 0.997 0.999
Normal Probability Plot
M F
195 203 211
0 02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98
Probability
153 161 170 178 186
130 140 150 160 170 180 190 200 210
0.001 0.003 0.01 0.02
Data
300 200 100 0 100 200 300
136 145
Importing Data Importing Data
.csv
t t xpt
.txt .xpt
xptread
dlmread csvread
.xls xlsread
textscan
Data Types Data Types
Variable
logical char numeric cell structure function handle
user class
Statistics Toolbox
abc
integer single double ordinal nominal dataset
Numeric and Cell Arrays Numeric and Cell Arrays
cell2mat
num2cell cell2mat
num2cell mat2cell
x{3} x(3)
?
x{3} x(3)
[x{3:6}] ?
x{3:6}
[x{3:6}]
, , ,
, , ,
Dataset Arrays Dataset Arrays
ArmLength ArmCirc Waist Triceps Subscapular ReportHeight ReportWeight Overweight LikeToWeigh
variables
ArmLength ArmCirc Waist Triceps Subscapular ReportHeight ReportWeight Overweight LikeToWeigh
37.6 45.2 156.3 62 308 Over Less
38.2 34.1 109.5 15.6 25.8 68 190 Over Same
34.1 33.2 95.4 13.3 25.9 67 142 Right Same
43 31 79 5 8 4 9 2 73 175 Ri ht S
43 31 79.5 8.4 9.2 73 175 Right Same
40 32.8 117 18 20.1 68 228 Over Less
38.9 40.5 145.6 17.6 71 289 Over Less
32.6 30.7 89.7 21.1 30.5 63 138 Right Same
n s
36 35.3 97 24.6 24.8 64 187 Over Less
37 29.9 82.9 13.8 20 68 150 Right Same
34 37.8 109.2 34.5 27.4 63 200 Over Less
37 31.2 98.4 34.8 25.5 63 140 Over Less
ervatio n
39 34 106.2 24.6 24 67 180 Right Same
36.6 21.9 78 8 4.6 66 120 Right Same
36.4 32 95.6 12 21.6 66 165 Right Less
41.7 32.6 106.9 12.4 17.4 73 202 Over Less
obs mixed data types
41.7 32.6 106.9 12.4 17.4 73 202 Over Less
35.2 33.3 100.9 24.2 11 9999 9999 Under More
39 31.1 95.5 11.6 64 180 Right Same
63 141 Right Less
39 2 37 2 132 18 69 250 O L
39.2 37.2 132 18 69 250 Over Less
38.5 36.6 95.4 11.2 15.1 67 185 Over Less
Importing Data into Dataset Arrays Importing Data into Dataset Arrays
.csv
t t xpt
.txt .xpt
dataset
dataset
.xls
dataset
Merging Data Merging Data
data = join(data1 data2 'key' 'SEQN')
SEQN BMDSTATS BMXWT BMXHT BMXBMI BMXLEG BMXARML BMXARMC
41475 3 138.9 154.7 58.04 34.2 37.6 45.2
SEQN WHD010 WHD020 WHQ030 WHQ040 WHD045 WHD050
41475 62 308 1 2 150 308
data = join(data1,data2, key , SEQN )
41476 1 22 120.4 15.18 25.1 17.1
41477 1 83.9 167.1 30.05 32.4 38.2 34.1
41478 1 11.5 14.2 17.3
41479 1 65.7 154.4 27.56 33.3 34.1 33.2
41480 1 27 122.7 17.93 25.8 20.9
41481 1 77.9 182.7 23.34 43.6 43 31
41482 1 101.6 173.8 33.64 43.5 40 32.8
41483 3 133.1 173.8 44.06 36.5 38.9 40.5
41477 68 190 1 3 190
41479 67 142 3 3 148
41481 73 175 3 3 165
41482 68 228 1 2 185 205
41483 71 289 1 2 180 289
41485 63 138 3 3 138
41486 64 187 1 2 160 197
41487 68 150 3 3 141
41484 1 9.3 14.5 15.2
41485 1 64.8 157.9 25.99 34.3 32.6 30.7
41486 1 86.2 166.2 31.21 35 36 35.3
41487 1 67.1 169.2 23.44 40 37 29.9
41488 1 16.1 103.1 15.15 22 17
41489 1 91.8 158.4 36.59 37 34 37.8
41490 1 70.7 159.8 27.69 40.5 37 31.2
41491 1 33.5 146.1 15.69 33.7 29.5 20.8
41489 63 200 1 2 175 208
41490 63 140 1 2 130 135
41492 67 180 3 3 185
41493 66 120 3 3 150
41494 66 165 3 2 150 165
41495 73 202 1 2 180 202
41496 9999 9999 2 1 99999 9999
41498 64 180 3 3 180
41492 1 82.6 168.7 29.02 34 39 34
41493 1 53.3 163.4 19.96 37.1 36.6 21.9
41494 1 70.5 166.4 25.46 39 36.4 32
41495 1 87.2 184.8 25.53 41 41.7 32.6
41496 1 67.8 154.6 28.37 33.5 35.2 33.3
41497 3 12.5 90.6 15.23 17.6 13.8
41498 3 77.4 165.8 28.16 39.4 39 31.1
41499 4
41500 1 91 6 156 5 37 4 38 7 33 35 4
41499 63 141 3 2 125 149
41501 69 250 1 2 220 250
41502 67 185 1 2 165 185
41503 66 165 3 3 165
41504 71 166 3 1 170 165
41506 70 160 3 2 155 160
41509 69 200 1 2 170 220
41510 68 142 1 2 135 147
41511 69 178 3 3 185
41500 1 91.6 156.5 37.4 38.7 33 35.4
41501 3 114.8 177.9 36.27 41 39.2 37.2
41502 1 82.4 168.2 29.13 40.4 38.5 36.6
41503 1 73.1 173.9 24.17 40.1 36.9 30.6
41504 1 74.3 178.5 23.32 41.7 40 29
41505 1 10.2 16 14.5
41506 3 176.9
41507 1 18.7 116.3 13.83 23.4 16
41508 1 9 8 16 5 14 2
41511 69 178 3 3 185
41512 63 95 2 1 110 90
41513 59 96 3 3 96
41514 68 198 1 3 198
41516 64 215 1 2 135 230
41518 68 210 1 2 185 210
41519 74 210 3 2 200 210
41522 70 189 3 3 189
41523 73 235 1 3 211
41508 1 9.8 16.5 14.2
41509 1 101.3 171.9 34.28 36.5 38.9 42
41510 1 66.8 171.8 22.63 37.6 35.2 28.2
41511 1 78.8 173.6 26.15 38.5 36.9 30.6
41523 73 235 1 3 211
41524 61 118 3 3 118
41525 67 165 3 3 175
41526 69 178 1 2 160 178
Indexing into Dataset Arrays Indexing into Dataset Arrays
>> x = data(5:20,[9,12,18])
>> x = data(5:20,{'BMXHT','BMXLEG','BMXWAIST'})
SEQN BMDSTATS BMXWT BMIWT BMXRECUM BMIRECUM BMXHEAD BMIHEAD BMXHT BMIHT BMXBMI BMXLEG BMILEG BMXARML BMIARML BMXARMC BMIARMC BMXWAIST
41475 3 138.9 154.7 58.04 34.2 37.6 45.2 156.3
41477 1 83.9 167.1 30.05 32.4 38.2 34.1 109.5
41479 1 65.7 154.4 27.56 33.3 34.1 33.2 95.4
41481 1 77.9 182.7 23.34 43.6 43 31 79.5
41482 1 101.6 173.8 33.64 43.5 40 32.8 117
41483 3 133.1 173.8 44.06 36.5 38.9 40.5 145.6
41485 1 64.8 157.9 25.99 34.3 32.6 30.7 89.7
41486 1 86.2 166.2 31.21 35 36 35.3 97
41487 1 67.1 169.2 23.44 40 37 29.9 82.9
41489 1 91.8 158.4 36.59 37 34 37.8 109.2
41490 1 70.7 159.8 27.69 40.5 37 31.2 98.4
41492 1 82.6 168.7 29.02 34 39 34 106.2
41493 1 53.3 163.4 19.96 37.1 36.6 21.9 78
41494 1 70.5 166.4 25.46 39 36.4 32 95.6
41495 1 87 2 184 8 25 53 41 41 7 32 6 106 9
41495 1 87.2 184.8 25.53 41 41.7 32.6 106.9
41496 1 67.8 154.6 28.37 33.5 35.2 33.3 100.9
41498 3 77.4 165.8 28.16 39.4 39 31.1 95.5
41499 4
41501 3 114.8 177.9 3 36.27 41 39.2 37.2 132
41502 1 82.4 168.2 29.13 40.4 38.5 36.6 95.4
41503 1 73.1 173.9 24.17 40.1 36.9 30.6 85.9
41504 1 74.3 178.5 3 23.32 41.7 40 29 95
41506 3 176.9
41509 1 101.3 171.9 34.28 36.5 38.9 42 114.5
41510 1 66.8 171.8 22.63 37.6 35.2 28.2 81.9
41511 1 78.8 173.6 26.15 38.5 36.9 30.6 93.5
41512 2 44.2 154.7 18.47 1 1 1
41513 1 42.6 137.9 22.4 29.2 30 24.4 78.2
41514 1 87.4 171.7 29.65 35.7 41 32.1 111.9
41516 3 95.3 3 160.6 3 36.95 1 40.4 35.5 126.4
1 2 3 4 5 6 7 8 9 10 ...
41518 1 93 178.7 29.12 38.4 37.2 32.6 101.9
41519 1 96.5 184.5 28.35 44.2 42.2 34.2 113.7
41522 1 80.4 174.9 26.28 44.2 41 30.9 107
Indexing into Dataset Array Variables Indexing into Dataset Array Variables
>> x = data.BPXPLS(5:20)
SEQN PEASCST1 PEASCTM1 PEASCCT1 BPXCHR BPQ150A BPQ150B BPQ150C BPQ150D BPAARM BPACSZ BPXPLS BPXPULS BPXPTY BPXML1 BPXSY1 BPXDI1 BPAEN1 BPXSY2 BPXDI2
41475 1 612 2 2 2 2 1 5 66 1 1 150 128 64 2 122 60
41476 1 9 72 1
41476 1 9 72 1
41477 1 625 2 2 2 2 1 4 78 1 1 180 144 60 2 148 60
41478 1 278 120 1
41479 1 620 1 2 2 2 1 4 62 1 1 140 112 70 2 108 66
41480 1 79 86 1
41481 1 533 2 2 2 2 1 4 52 1 1 140 112 62 2 104 64
41482 1 561 2 2 2 2 1 4 76 1 1 150 116 78 2 114 80
41483 1 748 2 2 2 2 1 5 60 1 1 140 110 62 2 112 56
41484 1 285 134 1
41485 1 547 2 2 2 2 1 4 68 1 1 140 108 44 2 110 42
41486 1 556 2 2 2 2 1 4 62 1 1 150 126 64 2 120 70
41487 1 539 2 2 2 2 1 4 74 1 1 150 120 84 2 116 80
41488 1 59 110 1
41489 1 570 2 2 2 2 1 4 58 1 1 130 106 66 2 114 72
41490 1 564 1 2 2 2 1 4 72 1 1 150 128 64 2 134 62
41491 1 555 2 2 2 2 1 2 84 1 1 130 120 76 2 122 78
41491 1 555 2 2 2 2 1 2 84 1 1 130 120 76 2 122 78
41492 1 704 2 2 2 2 1 4 62 1 1 180 156 94 2 152 86
41493 1 567 2 2 2 2 1 3 68 1 1 150 106 42 2 118 54
41494 1 670 2 2 2 2 1 4 82 1 1 140 120 78 2 108 64
41495 1 676 2 2 2 2 1 4 68 1 1 150 102 64 2 98 62
41496 1 676 2 2 1 2 1 4 78 1 1 210 192 86 2 192 84
41497 1 20 96 1
41498 1 698 1 2 2 2 1 4 60 1 1 240 226 96 2 204 88
41499 1 467 2 2 2 2 1 3 80 1 1 130 116 52 2 114 68
41500 1 531 2 2 2 2 1 4 104 1 1 140 114 58 2 116 56
41501 1 574 2 2 2 2 1 4 60 1 1 170 150 70 2 144 66
41502 1 516 2 2 2 2 1 4 70 1 1 150 124 54 2 116 68
41503 1 625 1 2 2 2 1 4 66 1 1 130 100 52 2 108 46
41504 1 743 2 2 2 2 1 4 72 1 1 180 148 88 2 146 96
41505 1 296 130 1
41506 1 510 2 2 2 2 2 3 60 1 1 160 128 84 2 124 84
41506 1 510 2 2 2 2 2 3 60 1 1 160 128 84 2 124 84
Dataset Array Properties Dataset Array Properties
>> data.Properties p
>> data.Properties.VarNames = NewNames;
SEQN RIAGENDR RIDAGEEX RIDRETH1 BPXPLS BPXSY1 BPXDI1 BMXHT BMXWT BMXLEG BMXARML BMXARMC BMXWAIST
41475 2 752 5 66 128 64 154.7 138.9 34.2 37.6 45.2 156.3
41477 1 860 3 78 144 60 167.1 83.9 32.4 38.2 34.1 109.5
41479 1 630 1 62 112 70 154.4 65.7 33.3 34.1 33.2 95.4
41481 1 254 4 52 112 62 182.7 77.9 43.6 43 31 79.5
41482 1 779 1 76 116 78 173 8 101 6 43 5 40 32 8 117
41482 1 779 1 76 116 78 173.8 101.6 43.5 40 32.8 117
41483 1 804 4 60 110 62 173.8 133.1 36.5 38.9 40.5 145.6
41485 2 361 2 68 108 44 157.9 64.8 34.3 32.6 30.7 89.7
41486 2 735 1 62 126 64 166.2 86.2 35 36 35.3 97
41487 1 332 5 74 120 84 169.2 67.1 40 37 29.9 82.9
41489 2 483 1 58 106 66 158.4 91.8 37 34 37.8 109.2
IDNum Sex Age Ethnicity Pulse BPSyst1 BPDias1 Height Weight LegLength ArmLength ArmCirc Waist
41475 2 752 5 66 128 64 154.7 138.9 34.2 37.6 45.2 156.3
41477 1 860 3 78 144 60 167.1 83.9 32.4 38.2 34.1 109.5
41479 1 630 1 62 112 70 154.4 65.7 33.3 34.1 33.2 95.4
41481 1 254 4 52 112 62 182.7 77.9 43.6 43 31 79.5
41482 1 779 1 76 116 78 173 8 101 6 43 5 40 32 8 117
41490 2 803 4 72 128 64 159.8 70.7 40.5 37 31.2 98.4
41492 1 866 3 62 156 94 168.7 82.6 34 39 34 106.2
41493 2 935 3 68 106 42 163.4 53.3 37.1 36.6 21.9 78
41494 1 481 1 82 120 78 166.4 70.5 39 36.4 32 95.6
41495 1 740 3 68 102 64 184.8 87.2 41 41.7 32.6 106.9
41496 2 777 1 78 192 86 154.6 67.8 33.5 35.2 33.3 100.9
41482 1 779 1 76 116 78 173.8 101.6 43.5 40 32.8 117
41483 1 804 4 60 110 62 173.8 133.1 36.5 38.9 40.5 145.6
41485 2 361 2 68 108 44 157.9 64.8 34.3 32.6 30.7 89.7
41486 2 735 1 62 126 64 166.2 86.2 35 36 35.3 97
41487 1 332 5 74 120 84 169.2 67.1 40 37 29.9 82.9
41489 2 483 1 58 106 66 158.4 91.8 37 34 37.8 109.2
41498 1 41490 826 2 4 803 60 4 226 72 96 128 165.8 64 77.4 159.8 39.4 70.7 39 40.5 31.1 37 95.5 31.2 98.4
41492 1 866 3 62 156 94 168.7 82.6 34 39 34 106.2
41493 2 935 3 68 106 42 163.4 53.3 37.1 36.6 21.9 78
41494 1 481 1 82 120 78 166.4 70.5 39 36.4 32 95.6
41495 1 740 3 68 102 64 184.8 87.2 41 41.7 32.6 106.9
41496 2 777 1 78 192 86 154.6 67.8 33.5 35.2 33.3 100.9
41498 1 826 4 60 226 96 165.8 77.4 39.4 39 31.1 95.5
Categorical Data Categorical Data
IDNum Sex Age Ethnicity Pulse BPSyst1
41475 2 752 5 66 128 >> data.Sex = ...
41477 1 860 3 78 144
41479 1 630 1 62 112
41481 1 254 4 52 112
41482 1 779 1 76 116
41483 1 804 4 60 110
41485 2 361 2 68 108
nominal(data.Sex,{'M','F'});
41485 2 361 2 68 108
41486 2 735 1 62 126
41487 1 332 5 74 120
41489 2 483 1 58 106
41490 2 803 4 72 128
41492 1 866 3 62 156
41493 2 935 3 68 106
41493 2 935 3 68 106
41494 1 481 1 82 120
41495 1 740 3 68 102
41496 2 777 1 78 192
41498 1 826 4 60 226
IDNum Sex Age Ethnicity Pulse BPSyst1
41475 F 752 Other 66 128
41477 M 860 White 78 144
41479 M 630 MexAm 62 112
41481 M 254 Black 52 112
41482 M 779 MexAm 76 116
41483 M 804 Black 60 110
41485 F 361 Hispanic 68 108
41486 F 735 MexAm 62 126
41487 M 332 Other 74 120
1 = 41489 F 483 MexAm 58 106
41490 F 803 Black 72 128
41492 M 866 White 62 156
41493 F 935 White 68 106
41494 M 481 MexAm 82 120
41495 M 740 White 68 102
1 =
2 = 41495 41496 M F 740 777 MexAm White 68 78 102 192
41498 M 826 Black 60 226
2 =
Dealing with Missing Data Dealing with Missing Data
LegLength ArmLength ArmCirc Waist Triceps SubscapularReportHeight ReportWeight HighBP Overweight LikeToWeigh
34 2 37 6 45 2 156 3 62 308 Y Over Less
34.2 37.6 45.2 156.3 62 308 Y Over Less
32.4 38.2 34.1 109.5 15.6 25.8 68 190 Y Over Same
33.3 34.1 33.2 95.4 13.3 25.9 67 142 N Right Same
43.6 43 31 79.5 8.4 9.2 73 175 N Right Same
43.5 40 32.8 117 18 20.1 68 228 Y Over Less
36.5 38.9 40.5 145.6 17.6 71 289 Y Over Less
34.3 32.6 30.7 89.7 21.1 30.5 63 138 N Right Same
35 36 35.3 97 24.6 24.8 64 187 Y Over Less
40 37 29.9 82.9 13.8 20 68 150 Y Right Same
37 34 37.8 109.2 34.5 27.4 63 200 N Over Less
40.5 37 31.2 98.4 34.8 25.5 63 140 N Over Less
34 39 34 106.2 24.6 24 67 180 Y Right Same
37.1 36.6 21.9 78 8 4.6 66 120 N Right Same
39 36 4 32 95 6 12 21 6 66 165 N Ri h L
39 36.4 32 95.6 12 21.6 66 165 N Right Less
41 41.7 32.6 106.9 12.4 17.4 73 202 N Over Less
33.5 35.2 33.3 100.9 24.2 11 9999 9999 N Under More
39.4 39 31.1 95.5 11.6 64 180 Y Right Same
63 141 N Right Less
63 141 N Right Less
41 39.2 37.2 132 18 69 250 Y Over Less
40.4 38.5 36.6 95.4 11.2 15.1 67 185 N Over Less
40.1 36.9 30.6 85.9 13.2 10.7 66 165 N Right Same
41.7 40 29 95 10.4 18 71 166 Y Right More
Course Outline Course Outline
● Importing and Organizing Data
● E l i D t
● Exploring Data
● Distributions
● Distributions
● Regression
● Regression
● A l i f V i
● Analysis of Variance
Outline Outline
• Plotting
• Measures of center, spread, and shape
• Correlations
• Grouped data 180
200 220
M F
Grouped data
100 120 140 160
Weight [kg]
40 60 80
W
100
130 140 150 160 170 180 190 200 210
20
Height [cm]
Chapter 3 Learning Outcomes Chapter 3 Learning Outcomes
The attendee will be able to:
• Calculate measures of central tendency, spread, and shape.
• Visualize data using an appropriate statistical plot.
• Calculate linear correlation between variables. Calculate linear correlation between variables.
Basic Plotting Basic Plotting
180 200
220 44184
46871 49462
80 100 120 140 160
Weight [kg]
51524 48689
130 140 150 160 170 180 190 200 210
20 40 60
Height[cm]
0 200 400 100 200 300
0 50 100 20 40 60 20 40 60 0
100 120 140
0 100 200 0 50 100 20 40 60 20 40 60 0 200 400 100 200 300 0 100 200
40 60 80
Number
0 50 100 150 200 250
0 20
Weight [kg]
Histograms Histograms
180
>> n = hist(data,bins)
>> hist(data,bins)
120 140 160
>> n = histc(data,bins)
60 80 100
20 40 60
120 130 140 150 160 170 180 190 200 210 220 ( )
0 >> bar(n)
Measures of Centrality Measures of Centrality
120 140
mode
80
100 hist + max
geomean
60
median
geomean harmmean trimmean
20
median 40
20 40 60 80 100 120 140 160 180 200 220
0
mean
Measures of Spread Measures of Spread
140 160 180
100 120 140
iqr
mad
min max
60 80 100
std
var moment quantile
20 40
60 q
140 150 160 170 180 190 200 210
0
range
Measures of Shape
80
Measures of Shape
?
? 50 60
? 70
? skewness
160 180
30 40
50 1.1575
100 120 140
10 20 30
skewness 0 0278
60 80 100
40 60 80 100 120 140 160 180 200 220
0
-0.0278
20
40 kurtosis
moment
quantile
140 150 160 170 180 190 200 210
0 prctile
Correlations Correlations
200
220 44184
46871 49462
120 140 160 180
[kg] 48689
300
scatter
plotmatrix
60 80 100 120
Weight
51524
60 0 200 400 100 200 300
gname
130 140 150 160 170 180 190 200 210
20 40
Height[cm]
50 100 20 40 60 20 40
0 100 200 0 50 100
20 40 60 20 40 60
0 200 400 100 200 300
0 100 200 0 50
>> corr(X)
1.0000 0.4480 0.7442 0.7969 0.2393 0.1714
0.4480 1.0000 0.2456 0.5932 0.8930 0.8844
0.7442 0.2456 1.0000 0.6071 0.0995 -0.0401
0.7969 0.5932 0.6071 1.0000 0.4558 0.4066
0.7969 0.5932 0.6071 1.0000 0.4558 0.4066
0.2393 0.8930 0.0995 0.4558 1.0000 0.8176
0.1714 0.8844 -0.0401 0.4066 0.8176 1.0000
Grouped Data Grouped Data
>> gscatter(data.Height,data.Weight,data.Sex)
160 180 200 220
M F
100 120 140 160
Weight [kg]
40 60 80
190 200
130 140 150 160 170 180 190 200 210
20
Height [cm]
160 170 180
Height [cm]
140 150
MexAm Hispanic p White Black Other
>> boxplot(data.Height,data.Ethnicity)
Course Outline Course Outline
● Importing and Organizing Data
● E l i D t
● Exploring Data
● Distributions
● Distributions
● Regression
● Regression
● A l i f V i
● Analysis of Variance
Outline Outline
• Density functions p( ) = ?
• Probability distributions
0.05
Weight
• Distribution parameters
• Comparing and fitting 0.035
0.04 0.045
Weight
= 3.5
= 4.0
= 4.5
= 5.0
Comparing and fitting distributions
0 015 0.02 0.025 0.03
P ro b a b ilit y
0 0.005 0.01 0.015
0 20 40 60 80 100 120 140 160 180 200
0
Weight [kg]
Chapter 4 Learning Outcomes Chapter 4 Learning Outcomes
The attendee will be able to:
• Visually compare the distribution of a data set to standard probability distributions.
• Determine distribution parameters for a data set, assuming a given parametric probability distribution.
given parametric probability distribution.
Density Functions Density Functions
0.025 1
0 015 0.02
0 <= W < 100 100 <= W < 120 120 <= W <
0 6 0.7 0.8 0.9
0.01 0.015
Percentage
0.3 0.4 0.5 0.6
Percentage
PDF ecdf CDF
cdfplot
40 60 80 100 120 140 160 180 200 220
0 0.005
Weight [kg]
20 40 60 80 100 120 140 160 180 200 220
0 0.1 0.2
Weight [kg]
0 <= W < 100 100 <= W < 120 120 <= W <
180 200 220
0 <= W < 100 100 <= W < 120 120 <= W <
hist
80 100 120 140 160
Weight [kg]
ICDF x y
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
20 40 60
Percentage
Probability Distributions Probability Distributions
Beta Binomial
Hypergeometric Lognormal
0.025
-Square
Noncentral -Square Exponential
Extreme Value
Negative Binomial Normal
Poisson Rayleigh
0.015 0.02
Probability
lognpdf
Extreme Value
Generalized Extreme Value F
Noncentral F
Rayleigh Student's t Noncentral t
Uniform (Continuous)
00.005
P 0.01
Gamma
Generalized Pareto Geometric
Uniform (Discrete) Weibull
0 50 100 150 200 250
0
Weight [kg]
1
Empirical CDF 250
0.6 0.7 0.8 0.9 1
y150
200 250
]
logncdf logninv
0.2 0.3 0.4 0.5
Probability
50
Weight [kg] 100
logninv logncdf
20 40 60 80 100 120 140 160 180 200 220
0 0.1
Weight [kg]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
Probability
Distribution Parameters Distribution Parameters
0 045 0.05
Weight
>> disttool
0.03 0.035 0.04
0.045 = 3.5
= 4.0
= 4.5
= 5.0
0.015 0.02 0.025
P robabi li ty
0 20 40 60 80 100 120 140 160 180 200
0 0.005 0.01
0 02 0.025
Weight
= 0.20
= 0.25 Weight [kg]
0.015 0.02
b abi li ty
= 0.30
= 0.35
2 2
2 )
1 ln(
)
(
W
e W
P
0.005
o Pr b 0.01
2
) 2
(
e
W W P
0 20 40 60 80 100 120 140 160 180 200
0
Weight [kg]
Comparing Distributions Comparing Distributions
0 995 0.999 0.9995 0.9999
Probability plot for Normal distribution
0 995 0.999 0.9995 0.9999
Probability plot for Lognormal distribution
0 1 0.25 0.5 0.75 0.9 0.95 0.9950.99
Probability
0 1 0.25 0.5 0.75 0.9 0.95 0.9950.99
Probability
jbtest
lillietest
20 40 60 80 100 120 140 160 180 200 220
0.0001 0.00050.001 0.0050.01 0.05 0.1
Data 100
120 140
102 0.0001
0.00050.001 0.0050.01 0.05 0.1
?
Data?
40 60
80
? ?
40 60 80 100 120 140 160 180 200 220
0 20
0.99 0.995 0.999 0.9995 0.9999
Probability plot for Rayleigh distribution
0.999 0.9995 0.9999
Probability plot for Exponential distribution
0.25 0.5 0.75 0.9 0.95
Probability
0 75 0.9 0.95 0.99 0.995
Probability
probplot
normplot l t
20 40 60 80 100 120 140 160 180 200 220
0.00010.01 0.050.1
Data
20 40 60 80 100 120 140 160 180 200 220
0.010.25 0.5 0.75
Data
qqplot
Fitting Distributions Fitting Distributions
0.02 0.025
2 2
2 ) ln(
2 ) 1
(
W
W e W
3536 P
.
4
0.015
ro b a b ilit y
2
2489 W
.
0
mle l fi
0.005
P 0.01
lognfit
20 40 60 80 100 120 140 160 180 200 220
0
Weight [kg]
>> dfittool
Course Outline Course Outline
● Importing and Organizing Data
● E l i D t
● Exploring Data
● Distributions
● Distributions
● Regression
● Regression
● A l i f V i
● Analysis of Variance
Outline Outline
160 180 200
0.8 0.9
• Linear regression models
100 120 140 60
ressure [mmHg]
0.5 0.6 0.7 0 8
• Fitting linear models to data
20 40 60 80
Pulse P
0.1 0.2 0.3 0.4
• Evaluating the fit
• Adjusting the model
10 20 30 40 50 60 70 80
0
Age [yr]
0
Adjusting the model
20 40 60
s
Plot of residuals vs. fitted values
-40 -20 0
Residuals
60 80 100 120 140 160 180
-60
Fitted values
Chapter 7 Learning Outcomes Chapter 7 Learning Outcomes
The attendee will be able to:
• Perform a multiple linear regression.
• Increase robustness of a regression.
• Perform a nonlinear or generalized linear regression. Perform a nonlinear or generalized linear regression.
Linear Regression Models Linear Regression Models
response predictor
160
180 p y k f k ( x ) p
100 120 140
c um fe renc e [ c m ]
model parameters
design functions
140 160 180
60 80 100
Wai s t Ci rc
80 100 120 140
e s s ur e [ m m H g]
15 20 25 30 35 40 45 50 55 60
40
Arm Circumference [cm]
40 60 80
Pu ls e P re
c mA W
2 c A c
A c
P
10 20 30 40 50 60 70 80
0 20
Age [yr]
0 1
2 A c A c
c
P
LinearModel Objects LinearModel Objects
180
circfit = LinearModel.fit(x,y)
100 120 140 160
cumference [cm]
15 20 25 30 35 40 45 50 55 60
40 60 80 100
Waist Circ
15 20 25 30 35 40 45 50 55 60
Arm Circumference [cm]
plot(circfit)
160 180
y vs. x1
Data Fit
Confidence bounds
p circfit.Coefficients
80 100 120 140
y
Estimate SE tStat pValue (Intercept) 12.294 0.80999 15.178 5.1775e-51 x1 2.6065 0.02445 106.61 0
15 20 25 30 35 40 45 50 55 60
40 60 80
x1
Evaluating Goodness of Fit Evaluating Goodness of Fit
Diagnostic properties
circfit.Rsquared
circfit.Diagnostics.CooksDistance
[h,p] = jbtest(circfit.Residuals.Raw)
Diagnostic plot methods
plotResiduals(circfit,'fitted') plotDiagnostics(circfit,'cookd')
Diagnostic plot methods
20 40 60
s
Plot of residuals vs. fitted values
0.03 0.035 0.04 0.045
nce
Case order plot of Cook's distance
-40 -20 0
Residuals
0.005 0.01 0.015 0.02 0.025
Cook's distan
60 80 100 120 140 160 180
-60
Fitted values
0 1000 2000 3000 4000 5000 6000
0
Row number
Higher Order Models Higher-Order Models
200
y vs. x1
Data
2
A A
P
100 120 140 160 180
y
Fit
Confidence bounds
0 1
2
2 A c A c
c
P
A 2
P
20 40 60 80 100
y
~ A P
'y ~ x1^2'
10 20 30 40 50 60 70 80
0
x1
2 5 4
2 3 2
1
0 H C H HC C
W
C H C
H
W ~ 2 2 :
' 1^2 + 2^2 + 1 2'
'y ~ x1^2 + x2^2 + x1:x2'
Ignoring Outliers Ignoring Outliers
180 200
Data Fit
0.035
Histogram of residuals
0 01 0.012 0.014 0.016 0.018
ance
120 140 160
ure [mmHg]
Confidence bounds
0.02 0.025 0.03
0.002 0.004 0.006 0.008 0.01
Cook's dista
40 60 80 100
Pulse Pressu
0.005 0.01 0.015
0 1000 2000 3000 4000 5000 6000
0
Row number
180 200
0 9
10 20 30 40 50 60 70 80
0 20
Age [yr]
-1000 -50 0 50 100 150
weighting
120 140 160 180
e [mmHg]
0.6 0.7 0.8 0.9
40 60 80 100
Pulse Pressure
0.2 0.3 0.4 0.5
LinearModel.fit(...,'RobustOpts','on')
10 20 30 40 50 60 70 80
0 20 40
Age [yr]
0
0.1
0.2
Adjusting the Model Adjusting the Model
2 5 4
2 3 2
1
0 H C H HC C
W
Analysis of Variance Analysis of Variance
2 5 4
2 3 2
1
0 H C H HC C
W
Li M d l t i
LinearModel.stepwise
Course Outline Course Outline
● Importing and Organizing Data
● E l i D t
● Exploring Data
● Distributions
● Distributions
● Regression
● Regression
● A l i f V i
● Analysis of Variance
Outline Outline
0.12
Over Under
• Multiple comparisons
0 06 0.08
0.1 Right
• 1-Way ANOVA
0.02 0.04 0.06
• N-Way ANOVA
• MANOVA
-20 -15 -10 -5 0 5 10 15 20
0
weight [kg]
MANOVA
• Nonnormal ANOVA
180 190 200
• Categorical correlations
150 160 170
Height [cm]
Over Under Right
140
Chapter 6 Learning Outcomes Chapter 6 Learning Outcomes
The attendee will be able to:
• Determine if groups within a data set have significantly different sample means.
• Perform multiple pairwise tests between groups.
Multiple Comparisons Multiple Comparisons
0.06
M Over M Under
A
0.04
0.05 M Right
F Over F Under F Right
D
0.02 0.03
D
190 200
130 140 150 160 170 180 190 200 210 220
0 0.01
C
170 180
130 140 150 160 170 180 190 200 210 220
Height [cm]
140 150 160
M M M F F F
Over Under Right Over Under Right
1 Way ANOVA 1-Way ANOVA
>> p = anova1(data.Height,data.Overweight);
200
H 0 : groups all have same mean 180
190 200
cm]
0 g p
H 1 : no they don’t
150 160 170
Height [c
Over Under Right
140
Who’s the Odd One Out?
Who s the Odd One Out?
>> [p,tbl,stats] = anova1(data.Height,data.Overweight);
= 1 2744
200
1 – 3 = -1.2744 95% CI: [-1.92, -0.62]
180 190 200
cm]
>> cmat = multcompare(stats) cmat =
1 0000 2 0000 2 8592 1 5090 0 1588
150 160 170
Height [c
1.0000 2.0000 -2.8592 -1.5090 -0.1588 1.0000 3.0000 -1.9249 -1.2744 -0.6239 2.0000 3.0000 -1.1294 0.2346 1.5987
Over Under Right
140
N Way ANOVA
200
N-Way ANOVA
180 190
m ]
160 170
H e ight [ c m
140 150
multiple
Over Over Over Under Under Under Right Right Right Less More Same Less More Same Less More Same
multiple factors
>> p = anovan(data Height {data Overweight data LikeToWeigh}
>> p = anovan(data.Height,{data.Overweight,data.LikeToWeigh},...
'model','interaction')
Multiple Responses Multiple Responses
size pulse
size
(self-view) “heart function” pulse BP
>> [d,p] = manova1([data.Pulse,data.BPSyst1],data.Overweight)
140 145 150
Would like to weigh more/less/the same
135 140 145
Self-view of weight -- wish to weigh more
135 140 145
Self-view of weight -- excluding overweight
120 125 130 135 0
Less More Same
Systolic BP [mmHg]
115 120 125 130 135
UnderOver Right
Systolic BP [mmHg]
115 120 125 130 135
Under Right
Systolic BP [mmHg]
55 60 65 70 75 80 85 90
100 105 110 115
Pulse [bpm]
S
55 60 65 70 75 80 85 90
100 105 110 115
Pulse [bpm]
S
55 60 65 70 75 80 85 90
100 105 110 115
Pulse [bpm]
S
[ p ] [ p ] [ p ]