3.2 Auto-Poisson Method
3.2.1 Auto-Poisson model
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
3.2. AUTO-POISSON METHOD 39
in Subsection 3.3). (3) Randomly permute r to obtain new residuals r∗. (4) Let Z(s) = Pvr∗+ ˆθEM be the new data set, and then execute the EM-Scan method procedure to obtain the value Fs.
Suppose the simulations are executed for G times (99 or 999). The actual F scores are compared with these simulated maximum values. Thus, the simulated p-values are obtained as
P -value = #{Fs >= Fobs}Gs=1+ 1
G + 1 .
The elective clusters are significant if the p-value ≤ 0.05. It should be noted that the significant clusters are possibly overlapped. We will only report the clusters with the largest F value while they are overlapped.
3.2 Frequentist Method for Cluster Detec-tion: Auto-Poisson Model
In previous discussions, we focused on data with Gaussian distributions. In order to generalize the methodologies, the auto-Poisson model is applied to be an illustration in this section.
3.2.1 Auto-Poisson model
Based on the Hammersley–Clifford theorem (Hammersley & Clifford, 1971) and the factorization theorem (Besag, 1974), the distributions of exponential family defined in Markov random field can be generalized to follow the form of the CAR models. In this section, the Poisson distribution is illustrated
‧
how we adopt the similar testing procedures like what we did in the Gaussian CAR case to estimate global dependence and to figure out local clusters.
A generalized CAR model can be expressed as P r(z) = exp(Q(z))/X
y∈ζ
exp(Q(y)), z ∈ ζ, (3.17)
where Q(z), which satisfies the factorization theorem, is expressed as Q(z) = log{P r(z)/P r(0)},
and ζ means all possible values of Z. For the auto-Poisson case, when the conditional distribution is defined as
P r(z(si)|{z(sj) : j 6= i}) = exp(−λi)(λi)z(si)/z(si)!, (3.18)
δij = φwij, wij is the neighborhood information matrix, and φ is the spatial dependence parameter. It should be noted that we use φ to be a different notation as the spatial dependence in the auto-Poisson model compared with using ρ in the Gaussian model. If we only consider the pair-wise dependence structure, the Q(z) and P r(z) can be respectively expressed as
Q(z) =
‧
Although the joint likelihood can be expressed in the above form. The normalized constant C(Θ) = P
y∈ζexp(Q(y)), in which Θ = (θ, φ) means the set of parameters, is intractable and the closed form of the distribution is not available. For this reason, we resort this problem to the pseudo-likelihood instead of the original likelihood, like what we have done for the Gaussian data. The negative log pseudo-likelihood can be expressed as
P L(Θ) = population size ni and overall disease rate θ. Since the parameters of the pseudo-likelihood have a nonlinear form, the numerical estimating methods are applied to solve such a problem, such as the Newton–Raphson method.
3.2.2 EM algorithm
We have found in the previous sections that the local clusters/outliers are devastatingly affected the estimate of global dependence. The EM algorithm is adopted according to the same consideration.
Suppose the data Z can be separated in two sets; one is the set of known observations X and the other one is the set of missing values Y . Conditional on X, the λi for each Y can be expressed as
‧
j∈Xδijz(sj), the known part of observations.
To obtain the conditional density function, the Q(z) function described before is adopted again. The conditional density function of Y given X can be constructed as
Since it is difficult to compute the intractable constant term of the auto-Poisson likelihood, we resort to estimate the parameters by the pseudo-likelihood (equation (3.21)). Based on the EM algorithm, the expectation of conditional pseudo-likelihood is needed to know, that is,
EY |X;Θ(t)(P L(Z; Θ)) = EY |X;Θ(t)( where Θ(t) is the set of currently estimated parameters. However, this ex-pectation could not be directly computed because the conditional density function does not have a closed form. The Markov Chain Monte Carlo (MCMC) method is applied to approximate this expectation. In this thesis, we use the Metropolis–Hastings algorithm (Appendix D) (Hastings, 1970) to generate missing/unknown variables. To apply the Metropolis–Hastings algorithm, we only have to know the joint likelihood of interested variables.
The method does not need the whole likelihood, and only the structure of
‧
variables without considering the exact form of constant term is needed in-stead. the “metrop” function of “mcmc” package in R is applied to execute the Metropolis–Hastings algorithm. Here, 2000 data sets are generated and only the last 1000 data sets are retained to estimate the expectation, that is, the first 1000 data are burned.
Suppose the generated missing data are denoted as Y(g) with each gen-eration. The conditional expectation of pseudo-likelihood can be estimated by param-eter set. Thus, an estimation of E-step is complete.
M-step:
When the expectation is obtained, the M-step is proceeded, that is, to find the estimates by maximizing the expectation (i.e., minimizing the negative log pseudo-likelihood), which can be expressed as
Θ = arg minˆ Θ{EY |X;Θ(t)\(P L(Z; Θ))}
Since the parameters in the expectation are not linear, we solve the mini-mization problem by the numerical method, such as “Nelder–Mead” method or “Newton–Raphson” method. The “optim” function in R is applied to solve this problem in this study. By few iterative steps, convergent estimates will be obtained.
‧
Unlike what we did in the Gaussian model, the generalized least square method is not applicable for the auto-Poisson model. Instead, the notion of likelihood ratio which is analog to that of the SaTScan is applied to the auto-Poisson model. Given the estimated global dependence, ˆφEM, the test statistic is
Tobs = supz∈Z;θc>θ0P L(Z, θc, θ0| ˆφEM)
supθc=θ0P L(Z, θc, θ0| ˆφEM) = P L( ˆZ) P L0
, (3.24)
where the ˆZ is the selected region which maximizes L, and L0 is the likelihood function under p = q, in which p is the incidence inside the selected region Z and q is the one outside the selected region Z; that is, there is no cluster.
Given that Zc is the selected testing region and Z0 is the complement of Zc, the θc and θ0 are estimated as Based on these estimates, the testing statistic Tobs is easily computed.
3.2.4 Monte Carlo testing procedure
Similar to Section 3.4, the Monte Carlo procedure is applied. First, the stan-dardized residuals have to be computed based on the conditional expectation values, that is,
ri = zi − E(Zi|Zj∀j 6= i; ˆΘEM) q
E(Zi|Zj∀j 6= i; ˆΘEM)
. (3.27)