Séminaire Jeunes chercheurs (Emma Kopp, jeudi 7 novembre 2024)

7 novembre 24

La prochaine séance du séminaire Jeunes chercheurs du CEREMADE aura lieu le jeudi 7 novembre 2024 à 17h en salle A707. Nous aurons le plaisir d'écouter Emma Kopp (CEREMADE), qui nous parlera de

How far can we trust a phylogeny ?

Abstract
Computational methods have been used to reconstruct the history of languages over several millennia, based on data from modern languages. Using stochastic models of evolution along a phylogenetic tree, these methods infer language relationships (the topology of the tree) along with the ages of ancestral languages, usually in the Bayesian setting. Language phylogenies in the literature rarely reconstruct ages beyond 8 to 10 thousand years; additionally, all the more ancient proposed language groupings are subject to debate within the scientific community. We investigate the threshold beyond which phylolinguistics trees reconstruction is unreliable. We apply theoretical results from the mathematics of phylogenies literature, which give upper bounds on the probability of correct reconstruction of the tree topology and the values at the root. In particular, we show that for languages evolving at the rates typically reported in the literature, it is impossible to reconstruct the topology of a tree whose root age is older than 12,000 years. For trees older than this threshold, the inferred topology will not be more reliable than a random guess. To arrive at this result, we reproduce three previous analyses on cognatized lexical data from 50 Sino-Tibetan languages, 422 Bantu languages and from 161 Indo-European. We use Markov Chain Monte Carlo to produce samples from the posterior distribution of model parameters.We then apply results from percolation theory and information theory to bound the probability of correct reconstruction. In both cases, we find that the bound decreases rapidly from 1 to 0, with a threshold between 9 and 12 thousand years. To our knowledge, this is the first theoretical quantitative bound on phylolinguistics methods. It demonstrates that reconstructing the deep topology of more ancient language families based on cognatized lexical data is a hopeless enterprise.