The AIM workshop on exponential random network models was an experiment, bringing together people in applied social sciences, biologists, statisticians, and mathematicians who are interested in the emerging field of graph limit theory. All of us think about networks in some form or other, but the language, examples and aims are often very different. There has been spectacular progress in the mathematics of large networks, mostly for dense graphs (if there are n vertices there are order n2 edges). A good deal of this is captured in Lovasz’s recent book. There has also been spectacular growth in the availability and use of real network data. Some of this is huge (Facebook, Twitter, the Web) but some networks are smallish (I think of 500 tuberculosis patients in Canada with and edge between them if they have had contact in the past year). Social scientists and others have developed a suite of techniques for working with such network data. One central focus is exponential random graph models and its wonderful implementation in the STATNET package.
The experimental types have discovered some really strange things. For simple, natural models (incorporating, say, the number of edges and the number of triangles) standard fitting methods (e.g., maximum likelihood) go crazy, giving very different looking graphs on repeated simulation from the same model. The theorists had parallel results but in a limiting sense. There was a lot of common ground possible BUT the usual “turf” and applied/theoretical barriers could well get in the way. THEY DIDN’T and real progress was made. Here are some of the things I learned:
1) For theorists, one question is “does anyone really use these models for anything real?” I found lots of examples. Perhaps most striking, co-organizer Martina Morris and her colleagues fit these models to sexual contact data from Africa. The incidence of HIV in these countries varies by factors of 10 (with something like 40% infected in Botswana). What causes the disparity? The data is extensive, but of poor quality. Using exponential models, the different data sets can be cleaned up and compared. She found that a simple factor: concurrent partnerships, could explain a lot of the variability. Different countries have about the same number of lifetime sexual partners (and the same as for Western countries) BUT there are big differences in concurrency. After discovering this, Martina got governments and tribal leaders and anyone else who would listen to stigmatize the behavior and it really seems to make a difference. There is a lot of substance hiding behind this paragraph and their work is a thrilling success. Go take a look.
2) The theorists had no idea how well developed the computational resources are. One had only to suggest a project and someone could sit down in real time and try it out. STATNET and R are amazing.
3) I don’t think that the applied people had internalized some of the theoretical progress. Graph limit theory is full of infinite dimensional analysis and its main applications have been to extremal graph theory. After some of its potential applications (see below) were in believable focus, there was a lot of explaining and discussing. This was useful for me too.
4) Here is a success story from the conference: An exponential random graph model has an “unknowable” normalizing constant, which is a sum over all graphs on n nodes. Even for little graphs (n=30) this is too big to handle with brute force. Chatterjee and Diaconis proved a large sample approximation for the normalizing constant. This was in terms of an infinite dimensional calculus of variations problem, but sometimes it reduces to a 1-dimensional optimization. Their approximation is based on large deviations bounds (due to Chatterjee-Varadhan). Its relevance to finite n could and should be questioned. Mark Handcock and David Hunter programmed the tractable approximations and compared them to Monte Carlo approximations that are well developed in the applied world. To everyone’s amazement, the approximations were spot on—even for n=20. These approximations have parameters in them and are used to compute maximum likelihood and Bayes estimates. IF things work out, there are really new tools to use and develop.
Two questions had all of us interested. Once one realizes that this route is interesting, one can try to find approximations of the not-so-nice infinite dimensional problems by disceretizing. This is quite close to what physicists do in “replica symmetry breaking,” so there are ideas to borrow and specific projects to try out. Second, some of the statistics that the applied community finds natural are not continuous in graph limit space. What does this mean in practice AND can the theorists come up with continuous functionals that are similarly useful.
There are a dozen other successes. Some small, some big. I think that all of us enlarged our worldview and made some useful new scientific friends. BRAVO AIM!
Persi Diaconis
Stanford University