Statistical analysis of the DPL 2005 election
Rafael Laboissière, DD

Contents

Introduction [⇑]

The DPL 2005 election has been one of the most interesting elections we have had in Debian these last times. Many facts contributed to it. First, this election had a strong set of candidates, who presented interesting platforms. Second, the campaign was done with lively discussions in the debian-vote mailing list and a well organized IRC debate. Finally, the election was surrounded by a quite agitated context: an overly delayed release schedule for sarge, the semi-secret organization of the Vancouver meeting and the creation of Project Scud.

Beyond the obvious who won analysis, one may ask which factors dominated the vote preferences. Answering this question is possible, in part, thanks to the Condorcet voting system used in Debian elections, in which the voting options are numerically ranked by the voters. In this paper, a multivariate statistical technique is applied to the tally sheet of votes cast. The data was pre-processed to replace non-ranked options with numeric values and a Factor Analysis (FA) was applied. FA is typically used to unveil the latent structure of a set of variables, accomplishing it by grouping variables (in our case, the voting options) together such that a limited number of dimensions can explain a large amount of the variance in the data set.

Notice that FA is closely related to Principal Component Analysis (PCA), but FA results are often more interpretable than those of PCA. One drawback of FA is that the number of components that can be extracted is limited to roughly half of the number of variables. We show below that the three dominating factors in the DPL 2005 election were a rejection factor, a Anthony Towns factor and a Project Scud factor (see the Discussion section).

Methods [⇑]

Hereafter, the options appearing in the ballot will be referred by the initials of the candidates: JW = Jonathan Walther, MG = Matthew Garret, BR = Branden Robinson, AT = Anthony Towns, AL = Angus Lees, and AS = Andreas Schuldei. The None of the Above option will be referred as NA. In the R reports below, the variables are ordered in the way they appeared in the ballot.

The tally sheet of votes cast was pre-processed with a Python script to transform the non-ranked option (appearing as - in the ballots) into numeric values. The non-ranked options were replaced by the integer immediately greater than the greater rank. For instance, a ballot like --76--1 is translated into 8876881. Although this particular ballot could also be translated into 4432441, which would have the same effect in the Condorcet system, I preferred to not reorder the ranked option, because this reflects better the voter's intention.

The numeric values where fed to an R script which generated the text output and figures shown in this paper. Each voting option is considered as an independent variable in the analysis. The FA was performed with three factors because this is the maximum number of factors that can be computed from seven variables. Factor rotation was chosen to be promax, because the non-orthogonal rotation matrix which is obtained allows for a greater amount of variance explanation.

Results [⇑]

A preliminary PCA was performed on the data with the following results:

Importance of components:
                          Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
Standard deviation     2.1277626 1.9229316 1.7356354 1.4124399 1.3530305
Proportion of Variance 0.2595623 0.2119937 0.1727079 0.1143761 0.1049568
Cumulative Proportion  0.2595623 0.4715559 0.6442638 0.7586399 0.8635966
                          Comp.6     Comp.7
Standard deviation     1.1049850 1.07619816
Proportion of Variance 0.0700016 0.06640177
Cumulative Proportion  0.9335982 1.00000000

Loadings:
   Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
V1  0.483  0.487  0.290  0.105  0.227  0.614       
V2  0.197 -0.123 -0.469  0.701  0.295 -0.108  0.370
V3 -0.323  0.627        -0.110  0.549 -0.434       
V4  0.255         0.707  0.156 -0.121 -0.487  0.387
V5  0.405  0.157         0.294 -0.116 -0.348 -0.767
V6 -0.261  0.542 -0.111  0.294 -0.709         0.190
V7  0.571  0.162 -0.427 -0.537 -0.171 -0.255  0.287

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000

The contribution of each component to the total variance can be visualized in the following figure:

Variances of PCA

The FA with three factors and promax rotation yields the following results:

Call:
factanal(x = vote.df, factors = 3, rotation = "promax")

Uniquenesses:
   V1    V2    V3    V4    V5    V6    V7 
0.527 0.863 0.437 0.005 0.611 0.812 0.665 

Loadings:
   Factor1 Factor2 Factor3
V1  0.659   0.182   0.250 
V2  0.172  -0.212  -0.221 
V3                  0.757 
V4          0.980  -0.121 
V5  0.603                 
V6         -0.119   0.427 
V7  0.520  -0.166  -0.130 

               Factor1 Factor2 Factor3
SS loadings      1.115   1.092   0.903
Proportion Var   0.159   0.156   0.129
Cumulative Var   0.159   0.315   0.444

Factor Correlations:
         Factor1 Factor2  Factor3
Factor1  1.00000  0.0476 -0.00915
Factor2  0.04759  1.0000  0.20974
Factor3 -0.00915  0.2097  1.00000

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 11.19 on 3 degrees of freedom.
The p-value is 0.0108 

Graphical representations of the three factors are shown in the following figures, where the factor loadings are plotted as the heights of the bars:

loadings of factor #1

loadings of factor #2

loadings of factor #3

Each ballot can be projected in the space formed by the three factors above. The loadings of the factors constitute the coordinates of the vectors, which form a non-orthogonal coordinate system. The R function qr.solve() was used to back-solve the projections of each ballot onto the three-factors space. The results are shown in a separate file. The quartiles for these projections are depicted in the following boxplot graph:

FA intervals

Discussion [⇑]

From the PCA results one can see that a quite high number of components is needed to explain the ballot data. Indeed, the 90% level of variance explanation is only reached at the sixth component. The PCA loadings give us a first indication of how the voting options were grouped together. However, each option tend to have significant loadings in several components and no clear pattern emerges.

The FA results show an interesting combinations of the voting options in each factor. Before going into the interpretation of the factor loadings, we must notice that the FA with three factors is still not statistically sufficient to account for the variation in the data. A p-value of 0.0164 for the chi-square statistics does not allow as to reject the null hypothesis that the three factors are sufficient to describe the data. However, this p-value is not too far from the usual 0.05 threshold and we assume that the factors found did play an important role on the voters decisions.

The three factors could be interpreted as follows:

Factor#1 – the rejection factor:
This factor is the only one which shows a high loading for option NA. Options JW and AL (and, to a lesser extent, option MG) correlates very well with NA in factor#1. The other option have only marginal participation in factor#1. The factor#1 loadings correspond roughly to the performance of each candidate against the NA option (see the beat matrix in the election results). A possible interpretation of this factor is that voters tended to rank candidates JW and AL (and, to a lesser extent, also MG) close to NA. This does not mean that most voters rejected these candidates (many ballots have a negative projection along factor#1). One could say that the rejection of some candidates was the first preoccupation for the majority of the voters.
Factor#2 – the Anthony Towns factor:
This factor has a single high loading for option AT and relatively small loadings for all other options. It may be interpreted as a tendency to differentiate candidate AT, by ranking it either much higher or much lower than the others candidates. What made candidate AT so distinct from the others? We may only speculate here. Anthony Towns is, by far, the candidate which has been most involved in the technical infrastructure of Debian (release management of potato and woody, ftpmaster, britney, package pools, crypto in main, among others). It seems that the preference for AT polarized the voters' choices. The underlying question could be whether the Debian developers think about having a highly technical-skilled person as the DPL. Look at the Empowering leadership section in Andreas Schuldei's comments about social groups for some discussion along this line.
In a personal communication, Steve Greenland suggested that AT's technical skills were irrelevant and that the reaction for or against AT was largely based on these two components:
  1. AT had proposed temporarily limiting access to the mailing lists for people who violated standards of conduct. Steve suspects that this produced a strong reaction of either it's about time or completely out of the question.
  2. As noted above, AT is active at the infrastructure level of Debian, which gives him a lot of de-facto power over the project. Steve would guess that some people did not think that combining this with the office of DPL was a good idea.
Factor#3 – the Project Scud factor:
This factors clearly puts the option MG against the options BR and AS. Several aspects of the campaigning could explain this opposition, but the most obvious one is the Project Scud, of which Branden Robinson and Andreas Schuldei are members. Matthew Garret was the candidate that most clearly expressed disagreement with the need for a DPL team. If this interpretation of factor#3 is correct, we need to explain way option JW has a relatively high loading in factor#3. One could argue that Jonathan Walther did not publicly disagree with Project Scud or that he expressed clearly in his platform that he would work with teams if elected.

As a final analysis, each ballot was classed according to how much it scores along each of the three factors (the results are in a separate file). For doing this, the interval of variation of each factor was subdivided according to the following quantiles:

Quantiles for factor #1 projections:
         0%         25%         45%         55%         75%        100% 
-5.15098284 -1.38429892  0.09465953  0.57435224  1.66821424  6.42683790 

Quantiles for factor #2 projections:
         0%         25%         45%         55%         75%        100% 
-4.13405491 -1.24956196 -0.41324767 -0.07960007  1.02671858  4.84305117 

Quantiles for factor #3 projections:
        0%        25%        45%        55%        75%       100% 
-4.2679077 -1.6282570 -0.7557563 -0.2470494  1.2062412  6.4854074 

Using the limits above and the projection data, each ballot was classed along the factors using one of the symbols: --, -, o, +, and ++. For instance, the two ballots below:

ballot     REJ     AT     PS

7314526     ++     --     ++
1-23---     --      o      +

could be interpreted as:

One might also question whether it is legitimate to use the rank order in the ballots as numerical values for the FA. In a private communication, Chris Lawrence argued that it may be better to use a scaling technique, like Unfolding, which would convert each ballot to a set of distances between the voter's ideal political position and the candidates' ones. Using the distance matrix it would be then possible to find the position of each voter and each candidate in a low-dimensional policy space. Open questions with this approach are how to treat non-ranked option and ties, and whether it is better to use a metric or non-metric Unfolding technique.

Author [⇑]

Rafael Laboissière ⟨rafael AT debian DOT org

DISCLAIMER: Although its format may suggest it, this article should not be considered as a fully scientific work. I have written it mostly for the fun of doing it. The interpretations are obviously subjective and I apologize for offenses that the candidates may take from this text. Comments and suggestions for improvements are welcome.

Source Code [⇑]

The results and the figures in this article were obtained using scripts written in Python and R. The source code, including the HTML source for this web page, is available as a tar.gz file (24K).