2. Literature Review
3.4 Statistical Approach for RQ2
3.4.3 Benchmark Models
Note: Higher values represent higher tensions (i.e., more conflictive events); lower values represent lower tensions (i.e., more cooperative events). Training data include the first 80% of months; test data include the more recent 20% of events.
Figure 25: Training and test datasets showing monthly average South China Sea tensions from March 2015 to November 2017 (based on GDELT 2.0 GKG)
Note: Higher values represent higher tensions (i.e., more negative tone); lower values represent lower tensions (i.e., more positive tone). Training data include the first 80% of months; test data include the more recent 20% of events.
3.4.3 Benchmark Models
For the purposes of comparison with the four forecast models introduced in {3.4.4 Forecast Models}, four benchmark models are used in the analyses: a random
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
benchmark model, a fixed benchmark model, a linear benchmark model, and an average benchmark model. The following four subsections explain the approaches, historical knowledge, assumptions, fundamental logic, and mathematical notation for each of these benchmark models. They also provide visualizations of the
predictions generated by each model for past time periods in order to allow readers to more readily understand them as well as references to the relevant code as
contained in the appendices.
Random Benchmark Model
The random benchmark model represents predictions based on zero knowledge of the historical levels of tensions except for the possible range of values for the
prediction. The assumption is made that tensions in the following month will not 141 fall outside of the bounds of the historical minimum and maximum levels of
tensions. For analyses using the GDELT 1.0 Event Database, the historical range of South China Sea tensions ( Tensions ) is from -3.019 (lower tensions; more
cooperative) to +5.722 (higher tensions; more conflictive). For analyses using the GDELT 2.0 GKG, the historical range of South China Sea tensions ( Tensions ) is from 0.723 (more positive tone; lower tensions) to 2.137 (more negative tone; higher tensions).
Although the random prediction model is simplistic, its logic can be easily understood. Making random predictions with a knowledge of the range of
possibilities is clearly more likely to be accurate than doing so without a knowledge of the possible range. For example, if one were asked to predict the number of
attendees at an upcoming regular event and given the historical range (i.e., minimum and maximum number of attendees in all previous events), forecasting within this range would be a logical approach.
In mathematical notation, the statement R ~ U([y min ,y max ]) means that R is a random number from a uniform distribution between the historical minimum level
141 The random benchmark model has zero knowledge of data for specific months, but it is provided with the historical range.
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
of tensions y min and maximum level of tensions y max . Thus, a random prediction of the expected tensions ŷ at time t+1 can be made using the following equation:
ŷ t+1 = R ~ U([y min ,y max ])
For analyses using the random benchmark model and based on the GDELT 1.0 Event Database, a series of predictions for each month from February 2011 to
November 2017 is made, converted into a time series object, and then plotted. The random benchmark model’s predictions for the level of tensions in each month from February 2011 to November 2017 are shown in Figure 26. It can be seen that the expected values all fall within the historical range of tensions.
Figure 26: Predicted South China Sea tensions by month using random benchmark model (for analyses based on GDELT 1.0 Event Database)
Note: The solid black line represents predictions based on the model. The dotted black line represents observed tensions. Higher values represent higher tensions (i.e., more conflictive events); lower values represent lower tensions (i.e., more cooperative events).
For analyses based on the GDELT 2.0 GKG, a series of predictions for each month from March 2015 to November 2017 is made, converted into a time series
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
object, and then plotted in the same fashion as the predictions produced above.
Complete code for the data analyses can be found in {Appendix II: GDELT Data Analysis Code for R}. The random benchmark model’s predictions for the level of tensions in each month are shown in Figure 27. It can be seen that the expected values all fall within the historical range of tensions.
Figure 27: Predicted South China Sea tensions by month using random benchmark model (for analyses based on GDELT 2.0 GKG)
Note: The solid black line represents predictions based on the model. The dotted black line represents observed tensions. Higher values represent higher tensions (i.e., more positive tone); lower values represent lower tensions (i.e., more negative tone).
Fixed Benchmark Model
The fixed benchmark model represents a prediction based only on a knowledge of the level of tensions from the previous month. It predicts that tensions in a given month will be equal to tensions in the previous month. Although simplistic, the rationale for making such predictions is clear because having knowledge of one relevant value can enable us to make a more educated forecast than having zero knowledge of previous data. For example, if one were asked to predict the number of trees in a given forest or the price of a certain stock and was given the figure from
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
the previous time period, it would be logical to guess that the values remained unchanged in the following time period.
In mathematical notation, the fixed model predicts the expected tensions ŷ at time t+1 based on the observed level of tensions y at time t using the following equation:
ŷ t+1 = y t
For analyses using the fixed benchmark model and based on the GDELT 1.0 Event Database, a series of predictions for each month from March 2011 to
November 2017 is made, converted into a time series object, and then plotted.
Because there is no data for the preceding month, no prediction for February 2011 is included in the time series.
Figure 28: Predicted South China Sea tensions by month using fixed benchmark model (for analyses based on GDELT 1.0 Event Database)
Note: The solid black line represents predictions based on the model. The dotted black line represents observed tensions. Higher values represent higher tensions (i.e., more conflictive events); lower values represent lower tensions (i.e., more cooperative events).
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
For analyses based on the GDELT 2.0 GKG, a series of predictions for each month from April 2015 to November 2017 is made, converted into a time series object, and then plotted. Because there is no data for the preceding month, no prediction for March 2015 is included in the time series.
Figure 29: Predicted South China Sea tensions by month using fixed benchmark model (for analyses based on GDELT 2.0 GKG)
Note: The solid black line represents predictions based on the model. The dotted black line represents observed tensions. Higher values represent higher tensions (i.e., more positive tone); lower values represent lower tensions (i.e., more negative tone).
Linear Benchmark Model
The linear benchmark model represents a prediction based on a knowledge of the level of tensions from the two previous months. It predicts that tensions in the month to be predicted will follow a linear trend based on the two most recent data points. Because many processes in the real world follow roughly linear trends, informed predictions can o en be made based on a limited knowledge of historical data. For example, one could make a more educated forecast of the price of a
commodity next month if they knew that commodity’s price this month and last month by inferring that its price would continue to increase or decrease at the same rate as before.
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
In mathematical notation, the linear model predicts the expected tensions ŷ at time t+1 based on the observed level of tensions y at times t and t‑1 using the following equation:
ŷ t+1 = (y t ‑ y t‑1 ) + y t
For analyses using the linear benchmark model and based on the GDELT 1.0 Event Database, a series of predictions for each month from April 2011 to November 2017 is made, converted into a time series object, and then plotted. Because data for the preceding two months is required for forecasting with the model, no predictions for February 2011 or March 2011 are included in the time series.
Figure 30: Predicted South China Sea tensions by month using linear benchmark model (for analyses based on GDELT 1.0 Event Database)
Note: The solid black line represents predictions based on the model. The dotted black line represents observed tensions. Higher values represent higher tensions (i.e., more conflictive events); lower values represent lower tensions (i.e., more cooperative events).
For analyses based on the GDELT 2.0 GKG, a series of predictions for each month from May 2015 to November 2017 is made, converted into a time series object,
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
and then plotted. Because data for the preceding two months is required for
forecasting with the model, no predictions for March 2015 or April 2015 are included in the time series.
Figure 31: Predicted South China Sea tensions by month using linear benchmark model (for analyses based on GDELT 2.0 GKG)
Note: The solid black line represents predictions based on the model. The dotted black line represents observed tensions. Higher values represent higher tensions (i.e., more positive tone); lower values represent lower tensions (i.e., more negative tone).
Average Benchmark Model
The average benchmark model represents a prediction based on a knowledge of all historical levels of tensions from earlier time periods within the given timeframe. It predicts that tensions in a given month will be equal to the average tensions of all previous months. As with the models above, the average benchmark model is
simplistic but has a clear rationale. Having knowledge of historical data and making the assumption that forecasts for a given time period are more likely to be near the historical average than unexpected outliers can enable us to make a more educated forecast than having zero knowledge of previous data. For example, if one were asked to predict the number of students in a given classroom and was given the average number of students from all previous time periods, it would be logical to guess that the number would be near that average in the following time period, assuming that
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
there was no knowledge of trends or other additional information to take into account.
In mathematical notation, the average benchmark model predicts the
expected tensions ŷ at time t+1 based on the average level of tensions y at times t , t‑1 , t‑2 , and so on, back to the earliest known time period, using the following equation:
ŷ t+1 = mean(y 1 , y 2 , y 3 , ... y t‑2 , y t‑1 , y t ) or ŷ t+1 = mean(y t‑n , ... y t‑2 , y t‑1 , y t )
For analyses using the average benchmark model and based on the GDELT 1.0 Event Database, a series of predictions for each month from March 2011 to
November 2017 is made, converted into a time series object, and then plotted.
Because there is no data for the preceding month, no prediction for February 2011 is included in the time series.
Figure 32: Predicted South China Sea tensions by month using average benchmark model (for analyses based on GDELT 1.0 Event Database)
‧
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Note: The solid black line represents predictions based on the model. The dotted black line represents observed tensions. Higher values represent higher tensions (i.e., more conflictive events); lower values represent lower tensions (i.e., more cooperative events).
For analyses based on the GDELT 2.0 GKG, a series of predictions for each month from April 2015 to November 2017 is made, converted into a time series object, and then plotted. Because there is no data for the preceding month, no prediction for March 2015 is included in the time series.
Figure 33: Predicted South China Sea tensions by month using average benchmark model (for analyses based on GDELT 2.0 GKG)
Note: The solid black line represents predictions based on the model. The dotted black line represents observed tensions. Higher values represent higher tensions (i.e., more positive tone); lower values represent lower tensions (i.e., more negative tone).