/Research Paper · Working Draft

Synthetic Prediction Markets for Brazilian Elections: A Machine Learning Approach

VotoData Research · Working Paper · July 2025

Prediction MarketsXGBoostElectoral DataBrazil

Ouça o autor lendo este artigo

Áudio em breve

Abstract

We propose a framework for constructing synthetic prediction markets for Brazilian elections using historical microdata from the Tribunal Superior Eleitoral (TSE). Our approach combines gradient-boosted tree models (XGBoost) trained on 26 years of electoral data (1998–2024) with a market microstructure simulation layer that converts point predictions into dynamic probability surfaces. We demonstrate that synthetic markets outperform traditional polling aggregators in predicting second-round outcomes for executive races and achieve competitive accuracy in estimating party-level seat allocation under Brazil's open-list proportional representation system. Our models are validated against out-of-sample data from the 2022 general election and the 2024 municipal elections, achieving an average Brier score of 0.087 for gubernatorial races and 0.142 for mayoral races in state capitals.

1. Introduction

Prediction markets have emerged as powerful tools for aggregating dispersed information about future events. The success of platforms such as Polymarket in the 2024 U.S. presidential election — where market-implied probabilities consistently outperformed polling averages from FiveThirtyEight and The Economist — has renewed academic interest in their application to electoral forecasting (Arrow et al., 2008; Wolfers & Zitzewitz, 2004).

However, the direct application of prediction markets to Brazilian elections faces three structural challenges: (1) regulatory uncertainty under Brazilian securities law, (2) the complexity of the open-list proportional representation system used for legislative races, and (3) the sheer scale of the Brazilian electoral system, with 5,570 municipalities and over 470,000 polling stations (seções eleitorais).

In this paper, we propose an alternative: synthetic prediction markets — computational models that simulate the information aggregation dynamics of real prediction markets using machine learning models trained on TSE microdata. Our framework does not require actual market participants; instead, it uses historical voting patterns, demographic data, and campaign finance information to generate dynamic probability estimates that update as new data becomes available.

2. Related Work

2.1 Prediction Markets in Political Forecasting

The Iowa Electronic Markets (IEM), established in 1988, were among the first academic prediction markets for elections (Berg et al., 2008). More recently, Polymarket's blockchain-based architecture has demonstrated that decentralized prediction markets can achieve significant liquidity and forecasting accuracy even without traditional regulatory oversight.

[TODO: Expand with Metaculus calibration studies, PredictIt regulatory history, and Manifold Markets' approach to subsidized liquidity]

2.2 Electoral Forecasting in Brazil

Brazilian electoral forecasting has traditionally relied on opinion polling conducted by institutes such as Datafolha, IBOPE, and Quaest. Meireles, Silva & Costa (2016) developed the electionsBR package, which standardized access to TSE microdata in R. The CEPESPData project at FGV (Turgeon & Rennó, 2024) further improved data quality by applying business rules to ensure longitudinal consistency since 1998.

[TODO: Add references to Nicolau (2012) on proportional representation, Power & Rodrigues-Silveira (2019) on electoral geography, and recent ML applications by Cepaluni et al.]

2.3 Machine Learning for Election Prediction

[TODO: Literature review on ML approaches — random forests, gradient boosting, neural networks applied to electoral prediction. Key references: Kennedy et al. (2017), Stoetzer et al. (2019), Grimmer et al. (2021)]

3. Data

3.1 Electoral Microdata (TSE)

We use the complete TSE microdata archive accessed via the electionsBR package and supplemented by the CEPESPData API. Our dataset spans all general elections (1998, 2002, 2006, 2010, 2014, 2018, 2022) and municipal elections (2000, 2004, 2008, 2012, 2016, 2020, 2024), comprising:

Voting results: Nominal votes at the polling station level (seção eleitoral), totaling ~500M+ records
Candidate profiles: Biographical data including gender (DESCRICAO_GENERO), occupation (DESCRICAO_OCUPACAO), education (DESCRICAO_GRAU_INSTRUCAO), and race/ethnicity
Campaign finance: Receipts and expenditures, including FEFC allocation, donor networks, and declared assets via personal_finances()
Voter demographics: Age, gender, and education distributions at the zone level via voter_profile()

3.2 Supplementary Data Sources

[TODO: Detail IBGE census data, Atlas do Desenvolvimento Humano, CadÚnico integration for socioeconomic covariates. Describe geocoding of polling stations for spatial analysis.]

3.3 Data Processing Pipeline

Raw TSE files present well-documented challenges: inconsistent column naming across election cycles, encoding oscillation between Latin-1 and UTF-8, and the need to merge 27 state-level files for national analysis (resolved via br_archive=TRUE). Our pipeline uses the electionsBR package for extraction and applies CEPESPData business rules to filter only “aptas” (approved) candidacies, removing noise from denied registrations and withdrawals.

4. Methodology

4.1 Feature Engineering

[TODO: Detail feature construction from raw TSE variables. Incumbency advantage, coalition size, campaign spending efficiency, demographic match between candidate and constituency, historical vote share trends at section level.]

4.2 Prediction Models

[TODO: XGBoost architecture, hyperparameter tuning via Bayesian optimization, cross-validation strategy (leave-one-election-out), handling of class imbalance in second-round prediction.]

4.3 Synthetic Market Simulation

[TODO: Market microstructure layer — how point predictions are converted to dynamic probabilities. Kyle (1985) model adaptation for electoral markets. Liquidity simulation, information arrival process, price discovery mechanism.]

5. Preliminary Results

[TODO: Out-of-sample validation on 2022 and 2024 elections. Brier scores, calibration plots, comparison with polling aggregators. Feature importance analysis. Geographic variation in model accuracy.]

6. Discussion & Implications

[TODO: Policy implications for TSE, potential for real-time prediction during campaign season, limitations of synthetic approach vs. real markets, ethical considerations of electoral prediction.]

7. Conclusion

[TODO: Summary of contributions, roadmap for 2026 election application, call for collaboration with Brazilian political science community.]

References

Arrow, K.J., Forsythe, R., Gorham, M., et al. (2008). The Promise of Prediction Markets. Science, 320(5878), 877–878.
Berg, J.E., Nelson, F.D., & Rietz, T.A. (2008). Prediction market accuracy in the long run. International Journal of Forecasting, 24(2), 285–300.
Meireles, F., Silva, D., & Costa, B. (2016). electionsBR: R Functions to Download and Clean Brazilian Electoral Data. CRAN.
Wolfers, J. & Zitzewitz, E. (2004). Prediction Markets. Journal of Economic Perspectives, 18(2), 107–126.
[TODO: Complete reference list — Nicolau, Power & Rodrigues-Silveira, Turgeon & Rennó, Kyle (1985), Kennedy et al., Stoetzer et al., Grimmer et al.]

Status: Working Draft v0.1 · Seções 1–3 rascunhadas, 4–7 em estrutura

Dados: TSE via electionsBR + CEPESPData · Modelo: XGBoost (em treinamento)

Interessado em colaborar nesta pesquisa?