The DPL 2005
election has been one of the most interesting elections we have
had in Debian these last times. Many facts contributed to
it. First, this election had a strong set of candidates, who
presented interesting platforms. Second, the campaign was done with
lively discussions in the debian-vote mailing
list and a well organized IRC
debate. Finally, the election was surrounded by a quite agitated
context: an overly delayed release schedule for sarge, the
semi-secret
organization of the Vancouver
meeting and the creation of Project Scud.
Beyond the obvious who won
analysis, one may ask which
factors dominated the vote preferences. Answering this question is
possible, in part, thanks to the Condorcet voting
system used in Debian elections, in which the voting options are
numerically ranked by the voters. In this paper, a multivariate
statistical technique is applied to the tally
sheet of votes cast. The data was pre-processed to replace
non-ranked options with numeric values and a Factor
Analysis (FA) was applied. FA is
typically used to unveil the latent structure of a set of variables,
accomplishing it by grouping variables (in our case, the voting
options) together such that a limited number of dimensions can
explain a large amount of the variance in the data set.
Notice that FA is closely related to Principal
Component Analysis (PCA), but FA
results are often more interpretable than those of PCA.
One drawback of FA is that the number of components
that can be extracted is limited to roughly half of the number of
variables. We show below that the three dominating factors in the
DPL 2005 election were a rejection factor
, a Anthony Towns
factor
and a Project Scud factor
(see the Discussion section).
Hereafter, the options appearing in the ballot will be referred by
the initials of the candidates: JW = Jonathan Walther,
MG = Matthew Garret, BR = Branden
Robinson, AT = Anthony Towns, AL = Angus
Lees, and AS = Andreas Schuldei. The None of the
Above
option will be referred as NA. In the
R reports below, the variables are ordered in the way
they appeared in the ballot.
The tally sheet of votes cast was pre-processed with a Python script to transform the non-ranked option (appearing as - in the ballots) into numeric values. The non-ranked options were replaced by the integer immediately greater than the greater rank. For instance, a ballot like --76--1 is translated into 8876881. Although this particular ballot could also be translated into 4432441, which would have the same effect in the Condorcet system, I preferred to not reorder the ranked option, because this reflects better the voter's intention.
The numeric values where fed to an R script which
generated the text output and figures shown in this paper. Each
voting option is considered as an independent variable in the
analysis. The FA was performed with three factors
because this is the maximum number of factors that can be computed
from seven variables. Factor rotation was chosen to be promax
,
because the non-orthogonal rotation matrix which is obtained allows
for a greater amount of variance explanation.
A preliminary PCA was performed on the data with the following results:
Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.1277626 1.9229316 1.7356354 1.4124399 1.3530305 Proportion of Variance 0.2595623 0.2119937 0.1727079 0.1143761 0.1049568 Cumulative Proportion 0.2595623 0.4715559 0.6442638 0.7586399 0.8635966 Comp.6 Comp.7 Standard deviation 1.1049850 1.07619816 Proportion of Variance 0.0700016 0.06640177 Cumulative Proportion 0.9335982 1.00000000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 V1 0.483 0.487 0.290 0.105 0.227 0.614 V2 0.197 -0.123 -0.469 0.701 0.295 -0.108 0.370 V3 -0.323 0.627 -0.110 0.549 -0.434 V4 0.255 0.707 0.156 -0.121 -0.487 0.387 V5 0.405 0.157 0.294 -0.116 -0.348 -0.767 V6 -0.261 0.542 -0.111 0.294 -0.709 0.190 V7 0.571 0.162 -0.427 -0.537 -0.171 -0.255 0.287 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 0.143 Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 1.000
The contribution of each component to the total variance can be visualized in the following figure:
The FA with three factors and promax rotation yields the following results:
Call: factanal(x = vote.df, factors = 3, rotation = "promax") Uniquenesses: V1 V2 V3 V4 V5 V6 V7 0.527 0.863 0.437 0.005 0.611 0.812 0.665 Loadings: Factor1 Factor2 Factor3 V1 0.659 0.182 0.250 V2 0.172 -0.212 -0.221 V3 0.757 V4 0.980 -0.121 V5 0.603 V6 -0.119 0.427 V7 0.520 -0.166 -0.130 Factor1 Factor2 Factor3 SS loadings 1.115 1.092 0.903 Proportion Var 0.159 0.156 0.129 Cumulative Var 0.159 0.315 0.444 Factor Correlations: Factor1 Factor2 Factor3 Factor1 1.00000 0.0476 -0.00915 Factor2 0.04759 1.0000 0.20974 Factor3 -0.00915 0.2097 1.00000 Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 11.19 on 3 degrees of freedom. The p-value is 0.0108
Graphical representations of the three factors are shown in the following figures, where the factor loadings are plotted as the heights of the bars:
Each ballot can be projected in the space formed by the three factors above. The loadings of the factors constitute the coordinates of the vectors, which form a non-orthogonal coordinate system. The R function qr.solve() was used to back-solve the projections of each ballot onto the three-factors space. The results are shown in a separate file. The quartiles for these projections are depicted in the following boxplot graph:
From the PCA results one can see that a quite high number of components is needed to explain the ballot data. Indeed, the 90% level of variance explanation is only reached at the sixth component. The PCA loadings give us a first indication of how the voting options were grouped together. However, each option tend to have significant loadings in several components and no clear pattern emerges.
The FA results show an interesting combinations of the voting options in each factor. Before going into the interpretation of the factor loadings, we must notice that the FA with three factors is still not statistically sufficient to account for the variation in the data. A p-value of 0.0164 for the chi-square statistics does not allow as to reject the null hypothesis that the three factors are sufficient to describe the data. However, this p-value is not too far from the usual 0.05 threshold and we assume that the factors found did play an important role on the voters decisions.
The three factors could be interpreted as follows:
rejection factor:
Anthony Towns factor:
britney, package pools, crypto in main, among others). It seems that the preference for AT polarized the voters' choices. The underlying question could be whether the Debian developers think about having a highly technical-skilled person as the DPL. Look at the
Empowering leadershipsection in Andreas Schuldei's comments about social groups for some discussion along this line.
it's about timeor
completely out of the question.
Project Scud factor:
As a final analysis, each ballot was classed according to how much it scores along each of the three factors (the results are in a separate file). For doing this, the interval of variation of each factor was subdivided according to the following quantiles:
Quantiles for factor #1 projections: 0% 25% 45% 55% 75% 100% -5.15098284 -1.38429892 0.09465953 0.57435224 1.66821424 6.42683790 Quantiles for factor #2 projections: 0% 25% 45% 55% 75% 100% -4.13405491 -1.24956196 -0.41324767 -0.07960007 1.02671858 4.84305117 Quantiles for factor #3 projections: 0% 25% 45% 55% 75% 100% -4.2679077 -1.6282570 -0.7557563 -0.2470494 1.2062412 6.4854074
Using the limits above and the projection data, each ballot was classed along the factors using one of the symbols: --, -, o, +, and ++. For instance, the two ballots below:
ballot REJ AT PS 7314526 ++ -- ++ 1-23--- -- o +
could be interpreted as:
most rejectedcandidates; has a neutral position as regards AT; moderately supports Project Scud.
One might also question whether it is legitimate to use the rank order in the ballots as numerical values for the FA. In a private communication, Chris Lawrence argued that it may be better to use a scaling technique, like Unfolding, which would convert each ballot to a set of distances between the voter's ideal political position and the candidates' ones. Using the distance matrix it would be then possible to find the position of each voter and each candidate in a low-dimensional policy space. Open questions with this approach are how to treat non-ranked option and ties, and whether it is better to use a metric or non-metric Unfolding technique.
Rafael Laboissière 〈rafael AT debian DOT org〉
DISCLAIMER: Although its format may suggest it, this article should not be considered as a fully scientific work. I have written it mostly for the fun of doing it. The interpretations are obviously subjective and I apologize for offenses that the candidates may take from this text. Comments and suggestions for improvements are welcome.
The results and the figures in this article were obtained using scripts written in Python and R. The source code, including the HTML source for this web page, is available as a tar.gz file (24K).