Data Understanding - 以資料探勘評估哥倫比亞開放式合約之績效

1. Show the results and its analysis

The data of this research thesis was collected from SECOP reports (Colombian efficient Purchase Web site), which is the e-procurement platform for public contracting, thus, it is open access. The Data Base used in this study contained 5.611.700 recorded of public contracts making through the e-procurement platform in Colombia since 2013 to 2018 and the data of 4658 government entities and 378.784 suppliers registered on the platform. The data bases are downloaded in Microsoft Excel files.

SECOP is a transactional online contracting platform with accounts for Government Entities and Suppliers. Each account has some users associated. From their accounts the Government Entities could create, evaluate, award hiring processes and enter into electronic contracts. Suppliers may submit their interest declaration, observations, offers and sign contracts. The whole process of contractual management is contained in the platform, which means that Entity and Supplier must include evidences about guarantees, invoices, supervision reports, modifications, etc.

The government entities registered consist of municipalities, city halls, hospitals, pensions fund administrators, army, prisons, foundations, public educational institutions, national agencies, etc.

The information about the suppliers contain name, Nit, company type (large, medium-sized or small), industrial sector, location (address), country, city, UNSPSC code of the supplier among others (See figure 10 and 11)

Figure 10: view of the SECOP Government entities database Source: https://www.colombiacompra.gov.co/compradores/

Figure 11: view of the SECOP suppliers database

Source: https://www.datos.gov.co/SECOP-II-Proveedores-Registrados/qmzu-gj57

The public contract data base content the contract identifier ID, government entity, procurement method, contract object, location, budget, publication date, award date, among others (See Fig.12)

Figure 12: view of the SECOP contracts database Source: https://www.contratos.gov.co/SECOP-II

35 3.3 Data Preparation

The results of the mining process are directly proportional to the quality of the data. The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool) from the initial raw data. In the Data Preparation is decided the data that is going to be used for analysis. The criteria implemented to make this decision, the data cleaning criteria and also the transformation of the data.

3.3.1 Data Selection

To select the data is important to assess its relevance to meet the data mining goals. In the case of this study the relevant information was selected according with the main objectives of the open contracting 1) Effectiveness, 2) Efficiency and 3) Transparency.

These objectives and its indicators are described in the table 3.

Table 5: Measures for Open contracting Open

contracting Goal to evaluate in terms of Indicators

Effectiveness No. of bidders or participants for

contract Type of company Adoption of new suppliers

Efficiency Time Time of the selection process

Transparency findings, corrections and savings

Contracts perfected

Modifications request, complaints or claims regarding to a contract

Citizens consultations Source: own elaboration

Thus, the kept fields for each database are the fields related with the indicators and evaluated goals mentioned in the Table 3.:

1) Kept fields e.g. Contract budget, contract scope, publication date, award date, etc.

2) Deleted fields e.g. UNSPSC suppliers code, entities and suppliers name, entities contact information, entities NIT, Number of users per government entity, descriptions fields, contract object, etc.

As a result of this selection the number of columns (fields) are 78% reduced.

3.3.2 Data Cleaning

There are a number of fields that do not contain data and are stored in the database as null or as an empty string, also there is some datasets to have fields that contain unknown or incorrectly entered information. There are a number of methods (Weiss and Indurkhya,1998) for treating records that contain missing values:

 Omit the incorrect field(s)

 Omit the entire record that contains the incorrect field(s)

 Automatically enter/correct the data with default values: for example, select the mean from the range

 Derive a model to enter/correct the data

 Replace all values with a global constant

Within this work, that fields were omitted from the spatial analysis.

3.3.3 Data Format

In this study the data format includes:

 Construct Data: Derived Attributes. Derived attributes are new attributes that are constructed from one or more existing attributes in the same record.

E.g. Time of the selection process = award date - publication date (days)

 Integrated Data: Merged Data.Merging tables refers to joining together two or more tables that have different information about the same objects.

 Format. In this case all databases are in excel files and has to be transformed into CSV files.

For the data Cleaning and Format the KNIME tool is used. KNIME is a powerful tool to manipulate data including its cleaning, extraction, loading from a database, reading, as well

as the possibility to exchanging and moving fields, rows and columns. It is an open source tool.

3.4 Modeling (Data Analysis and Techniques)

As the first step in modelling, it is necessary to select the data mining technique that will be used, then choose the adequate software, lastly, build de model.

3.4.1 Select the data mining technique

The purpose of this study is to assess the impact of the public contracting platforms implemented in Colombia in terms of efficiency, effectiveness and transparency (See Table 3) through data mining tools, based on a Data Base formed by 5.611.700 recorded of public contracts making through the e-procurement platform in Colombia since 2013 to 2018.

To meet the objective of this study it is necessary to search patterns that explain or summarize the data set being useful for exploring the properties of the examined data, follow a type of unsupervised learning, which consists in acquiring knowledge from the available data, without requiring external influence that indicates a specific behavior, so the model of the data mining to develop this study correspond to a descriptive model.

The methods to use for each model are showed and explained shortly in the figure 13.

Figure 13: model and methods for data mining Source: Adapted from Larose D., 2005

According with this information the chosen method is Association Rules because is useful to discover the interest relations between different variables in the Data base. The association Method detects ways of match between items that appear together at the same time. For each set of elements, the algorithm creates scores representing the support and the trust. These scores can be used to classify and derive interesting rules of element sets.

Next, the algorithm counts the number of times that each set of elements appears and calculate the relative importance of each of them in all transactions. The algorithm uses this information about element sets, to generate rules that can be used to predict associations or make recommendations. To each recommendation a probability is assigned based on the strength of associations (Sumathi and Sivanandam, 2006).

The algorithm chosen is Apriori, which allows generating the association rules and describing facts that occur in common within a given set of data and algorithms. Basically, the criteria to select the algorithm to be used was its compatibility with WEKA software and the available information in the Data Base.

Association rule of data mining involves picking out the unknown inter-dependence of the data and finding out the rules between those items.

A rule is defined as an implication of the form A=>B, where A∩ B≠0. The left-hand side of the rule is called as antecedent. The right-hand side of the rule is called as consequent.

In simple words rules could be:

If A and B, then C If A and not B, then C If A and B and C, then D, etc.

The Apriori algorithm begins obtaining the Frequent item sets, which are those sets formed by the items that get a support from the database higher than the minimum support other request by the user. This algorithm gets first, the sets of frequent items of size 1 and then those of size 2 and so on until there are no more sets whose items have no major support to the minimum support.

An example of how the Apriori algorithm works is as follows, suppose a set of transactions where each transaction can be formed for one or more of the following items;

{a,b,c,d}. The sets of items that the algorithm search will be shown in the Table 4

Table 6: Apriori algorithm search space Item set No. Transactions

The single-item sets in the Figure are those obtained from the first running of the algorithm in the database, those are the items whose support is higher than the minimum support proposed by the user. With those sets of a single item the sets of two items are generated, combining the item {a} with the item {b} to form the set of two items {a, b}, if in this set the support calculated is greater than the minimum support defined, then this 2-item is part of the 2-item set, that process is also performed for items {a} and {b} and so on.

This process follows with the sets of three items that are obtained by combining the sets of 1 item with the sets of 2 items, for example combines the item {a} of the set of 1 item with a set of 2 items {b, c} to form the set {a, b, c}, this set is searched in the transaction database, and if its support is greater than the minimum support then it is part of the set of three items. In this way the association rules are generated

The parameters of the association rules serve to measure how valid and representative the association rules are in reference to the set of data being analyzed. The main quality parameters of the association rules are:

a) Support: The support of an item is the frequency which this item is founded in transactions divided by the number of transactions.

𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴) =𝑁𝑜. 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑡ℎ𝑎𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑡ℎ𝑒 𝑖𝑡𝑒𝑚 𝐴 𝑁𝑜. 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒

b) Confidence: The confidence measure of a decision rule is the division between the support of the decision rule between the support of the antecedent of the decision rule, this is represented by the following equation:

𝐶𝑜𝑛𝑓(𝐴 → 𝐵) =𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴, 𝐵) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐵)

41 3.4.2 Select the software

The condition to select the technological tool is use an open source software due to the resources involving on this study. Therefore, the main options and its characteristics are summarized in the next table:

Table 7: Open source Data mining software comparison

Characteristics RapidMiner Weka Orange

Developer RapidMiner,

Main Purpose General Datamining General Datamining General Datamining

Community support Large Large Moderate

Source: own elaboration

Thus, the selected software is WEKA due to:

 its ease to use

 allows the integration of different modules

 its flexible

 allows cleaning and data transformation

 Delivery graphically results

In short, in this studyKNIME will be used to clean the data, send it to the database and structure them to apply the algorithm, and WEKA will be used to apply the data mining algorithms.

3.4.3 Build the model

After selecting the tools and the data mining technique to be used, proceed the extraction process.

The extraction process consists of the files selection and their subsequent loading to WEKA. To perform the data extraction from the original source and to get the final fields, the Knime tool was used, this tool has options to be able to clean the records, and modify them.

For the data preparation:

- The empty and wrong fields were deleted for the entire record as was explained in the data cleaning Criteria.

- The data was integrated emerging the original databases tables (Contract, supplier, and publisher databases) e.g. the contract database included the government entity ID which publish the process in the e –procurement as well the bidder(s) ID, and the supplier ID which gets that contract, in this way is possible to cross the data to get in the same Contract database the information of interest regarding to the suppliers and government entity as type of company, if the government entity is territorial or national, etc.

- Some new fields are generated as Number of bidders (in the original field appear the Bidders IDs), Time of the selection process = award date - publication date (days), Number of citizens Consultations, No. of modifications, requests and complaints (in the original database was recorded like a description of the observations, modifications, etc.)

- The independent Variable for this study is the Transaction (contract process) depend of the specific contract process the transaction scope, Investment sector, budge… and the rest of attributes are changing

- The selected attributes are Transaction, transaction scope, Investment sector, budget, Number of bidders, supplier’s company type, time of the selection process, year of the contract process, year of the supplier register to the

platform, perfected contract, modifications, requests and complaints, citizen consultations.

- To get a binary information (1,0) in the data, the attributes are switched to items of the database file being the columns, in this manner if the item exist or is yes is marked with 1, otherwise is 0. Some examples are showed in the Figure 14.

Figure 14: Binary Transformation Source: own elaboration

Thus, the transaction 1 for example was awarded with a small company, was a territorial contract and the contract was perfected, etc. Each column is and item A, B, C, D…. this process was conducted for all attributes.

- For some wide range attributes intervals are calculated, this facilitates the handling of the data and reduces the final size of columns and Items in the transactional matrix e.g. for the time of selection (days) the maximum value obtained is 65 days and the minimum is 38 days in this way, so is range = 65 -38 = 27, generating 5 intervals of 6.

- In the case of the Contract budget, the ranges are selected according to the legislation, however to estimate the intervals is taking into account that the minimum wage is different per year, so is necessary to confront this

information with the maximum and minimum budget value in the database to get the final intervals. The final intervals for the budget are presented in the next table how is detailed in the Table 6.

Table 8. Budget Intervals

Source: Own elaboration

The following table shows a view of the final items and item sets:

Table 9. Final Items

Attribute Description Item Attribute Description Item

Investment

For entities with Amount for

public tender Budget Intervals

45 transaction just is possible get only one item per attribute, according with this, the levels for the Apriori algorithm in this case is nine, which corresponds to the number of attributes.

The above work is developed using Knime tool, following the process represented in the next figure:

Figure 15: Knime modules for flat file conversion Source: own elaboration

The first module reads the file that will be transformed by the second module called

“Pivoting”. The bitvector, rules and filters, are using to switch information, set grouping rules and ranges, delete data, among others. The following figure shows the transactional matrix representing each item of the transaction as 0 or 1, to apply the Apriori algorithm:

Figure 16: View of the Transactional Matrix resulted Source: Knime Tool

Once the transactional matrix is obtained, it is loaded into WEKA to proceed to apply the data mining algorithm. A confidence of 0.9 and a minimum support of 0.1 is established for the Apriori algorithm to be implemented, as well as an output of 30 rules. The minimum support and trust value will support that the association rules founded are relevant to the study. Figure 17 shows the parameters used to configure the algorithm. Figure 17 shows the parameters used to configure the algorithm.

Figure 17: parameters used to configure the algorithm Source: Weka

Chapter 4 Results

4.1 Data Result

The data set generates a large number of association rules that are repeated in a certain number of iteration in this case 18, the conceptual filters to validate each rule are confidence and support.

In this particular case, confidence is 0.9% and support 10%. Then, the Figures 18 and 19 shows the 30 best rules generated by WEKA software.

Figure 18: Best Rules found by Weka Source: Weka

Figure 19: Best Rules found by Weka Source: Weka

All the associate rules obtained exceed a confidence level of 0.93 and have a support greater than 50%. To analyze and evaluate the new knowledge in the application domain, the variables are associated for the data interpretation.

Figure 20: Association Rules Source: Weka

Thus, the reading of the previous rules is:

 [budget=331.972.650;400.400.000 bidder=8,14 mod=5,10] = = > [per=yes] conf: (0.93) The procurement with a budget between 331.972.650 and 400.400.000 COP, where involved between 8 and 14 bidders and receive from 5 to 10 inquires of modification, complaints or other request is perfected with a confidence level of 0.93

 [invsec=health scope=nal ctype=large bidder=14,20] = = > [cit=5,10] conf: (0.93) The procurement regarding to the health investment sector with a national scope, where involved between 14 and 20 bidders and most of them are large companies is consulted by citizens between 5 and 10 times with a confidence level of 0.93

 [scope=nal ctype=large time=56,62 per=yes] = = > [mod=5,10] conf: (0.95)

The procurement with a national scope where involved large companies as bidders, are perfected and has a selection time between 56 and 62 days, receive from 5 to 10 inquires of modification, complaints or other request with a confidence level of 0.95

 [invsec=infr scope=nal per=yes bidder=1] = = > [mod=5,10] conf: (0.98)

The procurement regarding to the infrastructure investment sector with a national scope, where involved just one bidder and the contract is perfected, receive from 5 to 10 inquires of modification, complaints or other request with a confidence level of 0.98

 [mod=10,15 per=yes] = = > [cit=1,5] conf: (0.93); lift (1); lev (0.0) [0]; conv (0.78) The procurement which receive from 5 to 10 inquires of modification, complaints or other request and is perfected is consulted by citizens at least 5 times with a confidence level of 0.93

This is how the Association Rules are interpreted to get the conclusions. The major findings of the rules obtained from the algorithm are described below.

4.2 Major findings and discussion

The association rules show several patterns and relationships between different variables with levels of confidence and support which validate them within the data set, for this case the got confidence is between 0.93-0.98 and the support since 52 to 61%, as well as the data processing through data preparation and the elaboration of the transactional matrix gives valuable information such as:

I. The majority of citizen consultations carried out are in the range of 1 and 5, with a more probable participation in contracts of national scope in the fields of health and education.

Only 21% of citizen inquiries generate some kind of modification, request or complaint.

The highest probability of zero participation is shown in the public housing contracts where there is only 1 bidder. Contracts with a budget between $ 331,972,650- $ 400,400,000 are most consulted. In 2017 and 2018, a greater number of citizen consultations are presented.

Although participation has increased in recent years the foregoing evidences shows that the consultation of citizens in SECOP is lack, in the same way it can be inferred that

citizens consult more processes on education and health, perhaps because their relevance as fundamental rights

II. The number of contracts awarded and published has progressively increased in the platform as well as the number of registered suppliers. The number of bidders has increased especially in the education sector at territorial scope in a range 2-8 and in the national health sector in a range 8-15. The lowest number of bidders is found in infrastructure contracts with national scope

In the years 2017 and 2018 there is a greater participation of bidders in the hiring processes, especially in contracts with a budget between 172,480,000 and 331,972,650

在文檔中以資料探勘評估哥倫比亞開放式合約之績效 (頁 41-0)