ezyang's research log

What I did today. High volume, low context. Sometimes nonsense. (Archive)

Selective generativity


Generative module semantics acts quite a bit like newtyping. And everyone knows that if you need multiple instances for the same logical “type”, you should newtype.

The implication here is that selective generativity of dependencies might be able to be used to arrange for multiple instances for the same data type coexist. Suppose you have

import Data.Bool
import Data.SomeTypeClass
instance SomeTypeClass Bool where

and some other user now needs to integrate this library with some of their code which also defines the type class for Bool. Intuitively, what we’d like the elaborated version of this code to look like is:

import Data.MyBool
import Data.SomeTypeClass
instance SomeTypeClass MyBool where

with all of the relevant instances of Bool replaced with MyBool (bonus points if you can leave alone the bits that don’t rely on the type class). Now the two Bools don’t unify, there’s no overlap, and YOU GET A TYPE ERROR WHEN YOU TRY TO USE ONE IN THE OTHER CONTEXT. You can coerce if it’s used in a representational context but not a nominal context. Everything is great! The trick though is that GENERATIVE MODULE INSTANTIATION SHOULD REUSE REPRESENTATION. That might be tricky.



The key insight of the apartness check, in Conor McBride’s words, is not to test against a minimal model, but the maximal model. (That’s why it rejects things even if they only unify due to an infinite unifier.)


Andy Gordon - Strategic Thinking for Researchers

[Alan Perlis quote, 1966]

It’s really about what you do beyond writing single papers.

At Simon’s talk yesterday, he gave a lot of good tips about how to write a paper, come up with ideas, etc. This talk is about longer term strategies; PhD, postdocs, your whole career. It’s worth thinking about this, because people who are passionate about research, really make it our lives. It can be long hours in your early and mid career. If you’re putting that much effort in it, think strategically about how to make the most of it.

There’s no original contribution in this talk; and I feel well of the leash, pontificating about ideas that have impacted me, and I think I’ve tried at one time or another with success, but not rigorously evaluated. Of course, there’s no one correct strategy, and I don’t think people follow all of these.

A few years ago, with a colleague Thore, we had fun here organizing a workshop for researchers at Microsoft about impact. We called it, “Making a Difference, by Research.” The goal was, “How to have impact?” How can our work change the world? Let’s move away from the paper you’re thinking about right now, your final report, your dissertation… This workshop was a lot of fun, so the next year, we did the MSR Speed Dating Society—-there were some misunderstandings about what this should be. One researcher said, “Andy I can’t come, I’m a happily married man.” The actual idea, was social links would lay foundations for transfer of expertise; have some fun, 90 minutes of having… now that we’re on separate floors, some researchers you don’t meet that often. It would be good to break down barriers. Some of the work I’ve done that was most influential was taking an idea from one area and putting it in another.

Most important thing: Know what you’re trying to do. [Two quotes, which seem a bit contradictory. Bjarne Stroustrup exhorting that it should be clear what you’re trying to build, and Wernher Von Braun (Rocket Scientist), research is what I’m doing when I don’t know what I’m doing.] The reconciliation: you should have some conception of what you’re trying to do! What’s the most important problem in your field? (Responses: P=NP? Trans [indistinct] computer architecture? Artifial general intelligence?) Follow-up questions: What are you working on?

I’ve had this happen to me in job interviews, and I’ve said… “Well, I’m not actually working on that.” Maybe you can’t stop at once to work on it, but you can move towards it.

Serendipity (chance connections): this happens, but we shouldn’t rely on it. “Chance favors the prepared mind.” He was ready for the luck, to take advantage of it.

Richard Hamming: you have to work on your problems. Essay “You and Your Research”, about working in corporate labs, and what you really need to do. He set Friday afternoons as “What the big questions are”? But don’t do a “peanut butter strategy”, where you spread yourself thinly over multiple different problems. Don’t do lots of little things.

Steve Johnson (TED talk), he’s analyzed where famous inventions in history came from: we can take comfort: a lot of them came down to collective effort. Example: double entry bookkeeping (from Florence), but the idea of counting outgoings/ingoings, had arisen in lots of places, although it hadn’t been written down. Another invention: combinatorial (Gutenberg: ink, paper, press, movable type—all of these things existed previously, but Gutenberg brought them together.) Third invention: “sheer individual insight”, Willis Carrier, in 1905, he invented air condition machine. Some customers said it’s hot and humid, and he figured out how to solve that. His eureka moment: it was misty, maybe I can create artificial mist, and he figured out how to do that.

When I was a PhD student, I thought, “how am I going to come up with big ideas?” This eureka moment from nowhere. In Johnson’s analysis, individual insight very rarely comes up. Instead, collective effort and combinatorial ideas are far more common. Most inventions are not original. Nobody had thought of how to combine them in a particular way. I find that quite comforting, becuase I feel good at combining ideas, but not necessarily coming things up from scratch.

As for where good ideas come from, “exploring the adjacent possible”, the set of ideas that are about to be found. “Liquid networks” facilitating combinations of ideas: great advice for scientists, live in a city; if you live in London, you’re more likely to go to parties with a broad range of people, and come into contact with different ideas. And the slow hunch: Darwin discovered evolution, and if you read his memoirs, he believes on Sept 28, 1838, that was when he came up with theory of evolution. But what’s interesting, one of his biographers read his diaries for a year before, and it turns out it he had essentially gotten the ideas in some form before he got the eureka moment, but then even afterwards, he hasn’t really completely got it. The point is: ideas come up, you have a hunch for a while that it’s possible, and eventually things fall into place. It’s quite rare for it to be a single Eureka moment, and even when they have them, they didn’t! As for serendipity, LSD was medicinal, Teflon was man to the moon, and Viagra was for hypertension, before they figured out it had, well, other applications. (laughter) Finally: Error, people hate making mistakes, but you should, you’ll learn for it. Lee de Forest, a colorful businessman, he invented electronics, the Audion (prelude to triode and vacuum tube.) He made an amplifier out of it. He thought it worked… rareified gas (vacuums weren’t possible), so he thought the amplification worked due to ions, as opposed to electrons flying through. So he was wrong about how it workd, but he still managed to make it work, and become rich being a business in it (and he defrauded people, people defrauded him, four wives.) I recommend this book.

Changing tack… Life in research labs. Don Syme invented F#. This lecture is about all stages of your career, and the nice thing about a research lab is you can take some time off to do something really something substantial in the company. Don Syme is the posterchild for a great tech transfer success. Joining in the late 90s, he did .NET generics (to the runtime and the original C#), going “all in” 93-03 where he wasn’t doing anything else besides hacking C# code; and then he built a new programming language to show off what you could do with genercs. (A bunch of things on the slide.) He was here speaking to us research: encouraging colleagues to get involved; it’s quite a challenge as researchers to reach out to people in Redmond where they’re building products with short timelines, whereas here we have a lot of freedom, and we’re in a different management chain. For them to trust us is a big deal: we don’t report to them. Don’s adivce: you need to go in, be fully dedicated, get respect, find actual problems they care about, and apply them to enterprises the company cares about. Also, often the company has a shared vision from the top (Gates was behind .NET, when Microsoft needed alternative to Java ecosystem, managed code runtime for webservers using C# and F#). He’s got a lot of stories about this; sometimes it’s difficult due to different values, but it can make a really big difference. There’s a lot of upside.

Related to that: George Heilmeier, also a character. He was in DARPA (American funding body) in the 70s and a great engineer, pioneered liquid crystal display. There are 100s of LCDs in this room: talk about impact! But when he was at DARPA, he had a checklist for projects:

  • What are you trying to do? (GOAL)
  • Who cares?
  • If you’re successful, what difference will it make? (IMPACT)
  • What are the midterm and final exams to check? (REVIEW)

It’s the “Heilmeier” catchechism. But this is a bit abstract. So let’s imagine I was Don Syme. What would I say?

What are you trying to do? allow benefits of typed functional programming on .NET platform. This is a little jargony, but for people in MS this is jargon-free.

How is it done today, what are the limits? C# had no generics, no functions, no type inference. No benefits of code reuse.

What is the new approach, and why successful? Simple syntax as opposed to C#. He had a good start: the CLR was going to support multiple languages, so there were in fact enough instructions that compiling F# would be feasible.

Who cares? Well, people making websites maybe don’t really care. But there was potential market in the finance industry, quants, who wanted automatic trading, these people were super interested in functional programming and F#. When it turned into an actual product, this argument made people decide to go with it.

If you’re successful, what difference will it make? Here was a business setting. Microsoft tries to lock in customers to their technology. This would help better take financial instutitions to .NET

What are the risks and payoffs? Risks: little support from groups (and this happened, but Don stuck with it). The payoff, transferring ideas.

How much will it cost? Don went all in, but this was maybe just about the person a single person could do, before getting other people.

How long? A year. But it took 8-9 years before it became a real product. There were a number of midterms, compiling itself, free download, customers…

If you’re writing a grant proposal, think about these questions and argue for what you’re going to do.

Seek criticism. (beat) This definitely applies as a PhD student. You know what you’re trying to do, you’ve got some ideas on how to get there, but your inexperienced, don’t know literature; you should put yourself out there, get some feedback. Write proposals, and they’re great, despite the complaints. It’s really good to write down what you’re trying to do, and get feedback from your formal committee. Also get people together, force them to listen to your talk. If you don’t get feedback, maybe they’re critical about it (loathe to be in public). Take them aside at some point and force them to get feedback later. John Wheeler: make mistakes as fast as possible.

Reviews and planning of projects. It’s easy to fool yourself, hard to fool peers. Get the ideas out there. Get the feedback. Don’t worry about failures. I love this picture: the two people who gave me my big break: Needham and Gates. Bill went to Roger, said, “Hire the best people you can, let them do what you want, and if all the research projects have succeeded, you have failed!” The idea is, if people propose that they’re going to do something, and then on the review, it’s happened, that means they’re not pushing themselves hard enough. It’s as if people said, “This year, we’re going to go to Sainsburies.” What you really want, “This year, we’re going to go to themoon.” And then at the end, they say, “We didn’t make it to the moon… but we made it to the space station.” In terms of numbers, maybe we’ve had 100-200 projects, and I’d say that they’re big enough that a few people are working on them, nad most of them successfully produced papers. But we’ve had 2-3 successful moonshots: F#, Kinect, that’s why we setup this lab. He wanted people to feel empowered to go for big things. He’d already got things in development, he wants researchers to dream big and occasionally have big impact. That should apply to research in all situations: universities and corporate labs. The reason the govt wants unis to do research, is big innovations for new businesses, etc.

Don’t be seduced by proxies. As time goes on, you’ll find you’re invited onto PCs, people cite you, you’re asked to give invited talks, you have software. Maybe this is why you came into research (e.g. dls you want to get software out there), but you really wanted to change the world, not sitting on PCs. Give lip service to that: it’s a necessary evil, but don’t confuse the proxies for what you’re trying to do.

Work in Collaboration. Collaboration is great! When you do your PhD, it’s individual, but when you becomemore senior you get to work with people. Edsger Djikstra: his slogan, “Only do what only you can do.” Figure out what your unique contribution is, and do it. If you’re on a project where anyone else could have done it, you’re wasting your talents. Your team needs pigs not chickens: the degree of commitment people have to a project, where some people are driving, and some people helping a little bit. The analogy is the breakfast plate. Across disciplines? Walther Scott: “One half the world thinks the other daft.” We always divide into different groups and think the other side is daft. That’s a thing to be wary of; it means the paper you write for POPL is not the kind of paper you’d send to an OS conference, even if it was PL to OSes. Communities have different values. Subjectsare conveniences for administrators: it’s just science, at the end of the day (e.g. machine learning, where it’s statistics, ML, or statistical chemistry, or … lots of different names.)

Do interdisciplinary work… but AFTER YOUR PHD. It’s pretty risky to do it in your PhD, you need to master one discipline first. And only have one specialist per discipline, or the two PL experts will argue about irrelevant nonsense in the discipline rather than collaborating.

More slides, two about theoreticians, one at practitioners. Theoreticians: Robin Milner, prof whenI was in underrad in Edinburgh, he established lab for foundations of CS. His emphasis, which was unusual in theory: he wanted interplay in theory and practice. The design of computing systems can only succeed if it’s well grounded in theory, nad important concepts in theory can only emerge through protacted exposure to practice. Test theory in practice, like physical scientists, controlled experiments. Take theories and try t do practical things with them. LCF was a theorem proving system, which needed a typed progrmaming language, because he needed to formulate unproved formulas as goals to be proved, and then ahave an abstract type of proved theorems, where it could only be put in through inference rulesof logic. Hebuilt up a grand apparatus of functional programming, and in the course of figuring that all out, he invented ML. ML has gone on to be hugely influential. If you’re a thoeretician, it’sgreat to explore math, but you need to figure out how these apply to practical things.

Moving along: Eric Reese, he has a book “the lean startup.” It’s not too much of a stretch to take his ideas for startups and paply them to your research. This doesn’t apply to everything: maybe don’t do this for theory, but for actual things, create new products/services under conditions of extreme uncertainty. Learning what your customers want, and what will work, is what startup is trying to do. It’s really easy to kid yourself about what customers want (inventing something that only works for you.) Validated learning: “minimal viable product”, you shouldn’t wait until the software is completely ready before giving it to actual people. As soon as possible, get feedback from people, guide what’s worth investing in. He wants a version that lets you iterate, build-measure-learn, build again. His exampl was a video. Dropbox was a huge successful startup, they didn’t know what features they needed, but they cooked up a video of the experience that you could have. A lot of people wanted it. Company Aardvark, they wanted a website which would shop locally, fulfilling menus. The idea was that the website would suggest a menu, and figure out which local suppliers would give you the ingredients. They weren’t sure if this would work, so what the CEO did instead, he put out some adverts for people who wanted the service, found one person prepared to pay $10/week, he went down, sat down beside him, siad what would you like for your supper this saturday, had a conversation , figured out what things she would like, and ifgured out how to source them from local supermarkets. No sfotware, no investment, it was just him; it was key that the person was paying him. Validated learning: someone interested in what you’re going to give. It was a success, even though it took a while to get the website going.

If you’re trying to build something practical, put a minimal product in front of someone, orpossibly not a product at all. Another example: a website to answer questions (AI), wizard of oz type thing. Behind the scenes, that had people who actually answered the questions. I don’t know if that worked out, but they got some information about what questions people asked. Think out of the box.

Work with the sytsem. Sometimes people don’t. There are huge arrays of resources at your disposal. Clay Shirkey, he has a book “Cognitive Surplus”, thinker aobut the internet. His thesis, in the developed world of white collar workers, people have hours free in the evening where they might contribute to an enterprise, such as open source coding. These people you can exploit… well, not exploit, but they’re there. Don Syme did that: his project, in the last few years, has exploited the fact that people in hobby time play with F#, contribute code. Lots of stories about open source. Applies to science: a lot of citizen science going on, which you might exploit. Galaxy Zoo: shows up distant images of galaxies, and humans are trained to classify them in different ways, sourcing data that way. Maybe you can come up with some crowdsourcing idea… Also, if you fight a system, be very careful. Do you want to change the system, or do top class science? Some guy wanted blackboards with chalk, made a huge fuss, fought the system, prevailed, but he spent days and hours complaining, nad it was kind of pointless. He told mea fterwards, he’d done it afterwards, in the end, he’d have been much bettter of teaching students and doing reserach. Another example, someone joined MSR some time ago, and he wanted f.bloggs in his email address. Richard made a big mistake… no one had a dot in their email address, so none of the systems (personnel, HR) were tested on it, so sure enough, some crucial HR system failed, and he didn’t get a bonus. Maybe I exagerrate, but he’s wasted a lot of time. Go with the system!

Finally, invite yourself places. This handsome young man, in 1992, when I ceased to be a student, I got a PhD, got a paper in to conference in Boston, “Sure Andy, we’ll pay for the airfare,” and I thought, it’s a shame to go for four days, so I thought, “Why don’t I invite myself to a few universities”, and sure enough, it was great, the unis said yes, I flew to Boston, then Yale, DC (watergate building), in the middle of the Andy Gordon North American Lecture Tour: you gotta put yourself forward (no one else will), went all the way up to Calgary (chased by a bear), eventually back to NJ… put yourself forward. Ask yourself. The larger point, specific suggestion. If you get money to go somewhere, don’t just go to one uni, ask around. People love to have you visit, give talks. They often pay for accom… some unis paid a fee! $100 checks, it was great. Do that! It works! I talk to a lot of grad students, and no one has ever heard of this… you should do it.

Now I’m going to get spiritual. These are great jobs, a lot of fun, but it’s demanding, stressful, and anxious. No one will tell you what to do, but if things don’t work out, it was your ideas. Strategies: get some exercise (picture of Turing as a marathon runner), make sure it’s fun (picture of Perlis), and (picture of Kathleen Fisher) she is philosophical about how to pursue a career as a professional—in the sence of dividing up principles work, family, community—she has a lot of advice, check the PDF.

Five Regrets of the Dying. Top five regrets, one includes “I wish I hadn’t worked so hard.” Take it easy.

12 resolutions for grad students. Maybe Matt Might works too hard, he’s got some great resolutions. Check it out. January: map out the year. Feburary: improve productivity. Embrace uncomfortable (prove a theorem?). Upgrade your tools. Stay healthy. Update CV/web site. Network. (Put yourself forward, because no one else will.) Say thanks. (If there’s something you loved, drop them an email. You’ll be glad on your deathbed.) (Simon said the same!) Volunteer for a tlak. Practice writing. (If you’re happier coding than writing, do some writing, maybe not the paper deadline.) Check with your committee. Think about the job market. Good time to think about internships (December.) Think about the job market. Good time to think about internships (December.) Think about the job market. Good time to think about internships (December.) Think about the job market. Good time to think about internships (December.) Email someone. Don’t imagine the system will automatically figure out that we will pull you in.

First part of homework: Time management thing: manaage email time better. Delete/delegate/do/defer. Process it really quickly.Switch off notifications that email has arrived. No email on weekends, no sending email. If you’re senior, DEFINITELY don’t do that. Heads of depts are bad at this. YOU choose a boundary.

Second part of homework: organize speed dating society. “Andy, I’m just a grad student, no one would pay attention to me.” But they would pay attention to you. above a certain size, institutions become siloed. Tell your head of dept, advisor, they’ll say go for it, lead a meeting, professional initiative. YOU cross a boundary. If you do this, email Andy, say it was a huge amount of fun.

(summary slide)

Q: What advice do you find people disagree with?

A: Email on weekends. Also, criticism will often not be imposed on you, you will just be met by deathly silence. So seek people out and ask them for their opinions.

Q: In the Heilmeier catechism, you bring up “Who cares.” But in research, you don’t necessarily have that foresite

A: There must be someone who cares. Ask your advisor why they care, and they should have an answer.

Q: What if it’s just pushing to the next level?

A: THen the people who care are other people in the community. We’ve got customers at different levels. The immediate customers in theory are often people in the community.

Q: What are some activities you did in the speed dating society?

A: It was very simple. 24 people, 1min surprise on a slide, and then we paired people up and had 5-8 speed dates, for 5min for a conversation

(Heard afterwards: “Everyone likes to give an opinion, so ask them for their opinion.”


Sketch: Concept for an intergalactic restaurant

The trope of an intergalactic restaurant (or bar, or whatever) is a motley mix of species from many different places. I.e. it’s an opportunity for the special effects crew and the alien species designers to show off.

What if you didn’t have a budget for that, and had to cast it all as human actors? That would be a very boring intergalactic restaurant, wouldn’t it?

Suppose that the year is 2401, intergalactic travel is common, but the majority of alien species (humans included) have not gotten over deep-seated being weirded-out-ness with interacting with aliens. This poses an interesting problem for proprietors of major travel nexus points: one simply can’t arrange for all foreign species to simply never be in sight. So, in a typically overengineered fashion, the designers of these crossroads have decided to use virtual reality technology to provide every traveler the illusion… that every single one of their travelers is their species!

Imagine a backwaters tourist who has just touched down at their local spaceport. The illusion is in full effect: the scene he sees is by no means dissimilar from the hustle and bustle of the airports he has been familiar with, men and women in suits walking by. And in the crowded anonymity of a space like that, all of these other characters might as well be philosophical zombies. But the illusion is only skin deep: they may look like humans, but they certainly won’t act like it. And that could lead to some deeply disorienting interactions…


Bridge 2014-06-26


Board 1: When declarer leads low out of hand in D, DUCK IT. you can tell either declarer is Kx, in which case you’re not losing it, or partner has the K and will win it for you.

Board 2: Interesting defensive problem for S, given the auction 2D 3N. Should they switch to a heart? If N has AQx, that beats the contract. But clubs could also lead to victory.

Board 8: Even if partner is not doing what you want, play your best game. In an ending of JT52 versus declarer AKQ3, DON’T LEAD THE 2!!

Board 10: ALWAYS ANALYZE THE LEAD. In this case, 6D lead by E; rule of 11 will tell you that you should let it RIDE.

Board 21: You can make inferences about lengths from opp leads. Prefer to lead low to an honor on dicey combinations, you can pick up the T doubleton that way. It would be better to get the count, although in this case you can’t. (NB: low to the J unblocks, so it might be better)

Board 23: This is a pretty weird hand, but on the play, I went for a ruff-sluff in spades without cashing our side club winner. This slipped us a trick. Also, the club position is difficult. Qxx looking at dummy with xxxx, small is right if declarer has KJ, but Q is right if partner has AJ. The trouble with small clubs is that it can induce a misguess in the latter case.

Board 27: Dodgy question about the diamond raise after auction proceeds 1D (1S); P (2C); 2D (2S) all pass. With worse spade spots, raising diamonds is clear, but KT942 is fairly chunky. With a diamond stiff, might be worth doubling.

Board 28: Never mind partner’s lack of double. Count in clubs is important, because holding up once is important to prevent declarer from enjoying good clubs. Partner needs to exit a spade after cashing hearts


Cod Poached in Court Bouillon »

Court Bouillon sounds complicated but it’s actually very simple. I didn’t have any saffron (oops) and subbed new potatoes. Cod was pleasantly flaky. Very easy, would cook again!! Paired with white wine and sauteed shallots and asparagus.

Comments Comments

The relationship of GHC and Cabal

Yesterday, while chatting with Simon Peyton Jones, I got a better picture of Simon’s mental model of how GHC and Cabal (the library) fit together. Essentially, Simon has an imaginary firewall between GHC and Cabal, where if there is any package-related complexity that the core GHC doesn’t need to know about, it can be pushed into Cabal. GHC has a low-level interface that can be implemented simply, and Cabal is responsible for “pushing the buttons” on this interface so that GHC does the right thing.

Thus, one road to understanding how Cabal works (and perhaps, why it’s failing to build some package of yours) is to understand what knobs it has available from GHC.


Quentin Carbonneaux - End-to-End Verification of Stack-Space Bounds for C Programs

Toyota unintended acceleration; stack overflow. Stack usage of embedded systems important.

Quantitative logic for C (for stating stack usage) as well as a (certified) automatic bound derivation program. Extends CompCert

CompCert models behavior as traces; now talk about call/ret in trace

Trace weight = maximum stack usage (supremum over prefixes)

Correctness: source does not go wrong, and all traces have weight smaller than stack size

Assertions in quantitative logic are natural numbers (saying how much stack space is available.) 0 is True, infinity is False. Logic works the way you would expect.

"Multipel function arguments": varargs?

Q: Do you support embedded code pointers? Function pointers in memory.

A: The logic basic, so we don’t support it. But if we used XCaP, then we should bea ble to use it.

Q: How did you prove soundness? Agument semantics with calling depths?

A: Continuation based definition

Q: Do you have recursions? (Yes) Does simple induction work? (That’s not simple)

Q: (Alastair Reid) This is really important. But embedded systems have another problem: interrupt driven. So it’s really important to know how much stack the interrupt handlers are using.

A: First we need semantics of C with interrupts. Assuming we have that, we can bound it with interrupt handlers

Q: They are often nested. So it’s important to reason about priority levels.


PLDI Artifact Evaluation Session

(Shriram Krishnamurthi) We wanted to explain this process, since it’s new to PLDI and it has started to happen at other conference. (Insert description of process here: only after paper acceptance, AE does not change status of paper)

What is the criteria? Is it good? Does it build? Actually: evaluate against expectation set by the paper. About the reviewers: “They’re young and naive and actually wrote long reviews.” One paper had 29 comments. “People on the PC actually responded and stuff, which is amazing.”

52 accepted, 20 artifacts. It’s a bit questionable that only 12 passed. That’s low—maybe a little disturbing.

What are quotes about good artifacts? (Sometimes reviewers thanked the submitter, etc.) Some people were positive. High praise for some artifacts: “I’d be happy to use it in a production web site today”

Distinguished in artifact evaluation: Doppio: Breaking the Browser Language Barrier (Vilk, Berger). Java bytecodes to JavaScript, and virtualize filesystem and everything else. “Worst of both worlds!” (chuckles)

Questionable reviews: “slightly crashy, undocumented format, but it works”. “The virtual image didn’t work, but it was easy to build the source archive.” (AEC did their job!) “Once it runs, it runs, but getting to that point wasn’t easy.” That is shady. “Artifact is easy to use but insufficient to reproduce the experiment.” “I couldn’t reproduce quantitative results because there were no parsing for output files.” “Couldn’t reproduce what happened in the screencast.” (There is an innocent explanation: the paper was based on original data, but the artifact was after the paper was accepted.) “The configure script exited and just printed ‘Sorry’.” “Too many dependencies!”

There was one paper where the AE might have affected the actual acceptance. “Comparing to related work without faithfully implementing the other optimization.” We wrote to the authors, but we weren’t setup to have a detailed discussion, but it was very questionable.

(this ends the slides)

(Emma Tosch) This is for my own benefit, since the OOPSLA AE is due today. I’ve been making lots of changes to things: I have data, I was wondering, for the system and the analysis I do, these are going to be different, so I need to give an explanation about what’s different. How much documentation should I give and how much explanation? Nothing in my paper depends on it; I’m just reporting. What do you recommend w.r.t. the goodness of science and reproduceability?

(Jan Vitek) I think you should think about the most important criterion: expectations set by the paper. It would be nice if they can doublecheck the things in the paper. If some small things change, that’s fair. You’re constructing a closure: here is everything from the point of the submission, to reproduce the submission. As long as you keep data sets around, if you change format, well you should have also changed your scripts.

(SK) Explaining what happened is fine too. But if something mysteriously disappeared, you don’t have a replayable artifact. It doesn’t invalidate the research, but it’s an asterisk. You might be able to explain why we couldn’t get the stamp, e.g. “here’s the tool that disappeared.”

(ET) One problem is there are many different types of artifacts. The trouble with a dataset, the analyses are often very sensitive to the implementation. So if you’re using one library to do a simulation, your RNG changes. Reproduceability and understanding bounds can be difficult.

(SK) Yes, so some papers mention version numbers.

(JV) And some people do VMs so that you can have true reproduceability

(SK) But if it’s that sensitive, why do you trust the result?

(JV) If there’s a range, and you’re in the tolerance, that’s OK. If you get varying numbers, which are not discussed in the paper, well, that’s questionable.

(SK) One quote from a reviewer, which I didn’t put on the slides: “I got something roughly similar, but I got so many different answers, it was patently clear the paper should have error bars.” Well, of course you should have error bars, but it’s possible that the authors were picking desirable data.

(Adrian Sampson) This came up when I was doing reviews: some people place more weight on reproduceability and some placed weight on general expectations. I think it’s worth teasing this out. The paper expectation is different. Do we want to reproduce the evaluation? The paper is not just the evaluation. I was willing to be more lenient; the evaluation might be logistically complicated; the overall description of the tool, that seemed more important. But that’s not necessarily to argue, but I think it’s an argument for being clear about the role of reproduceability.

(Eric Eide) This was pretty tricky. I had to explain multiple times the criterion was not perfect repeatability. There are lots of reasons why perfect repeatability is not attainable. There are a lot of performance numbers! General trends? Whatever. That said, I think there is expectation the tool should plausibly support the conclusions made in the paper. If there are questions about variability, maybe the paper should note that. In general, part of the idea is to improve repeatability and move towards an understanding that we should have a scientific process.

(JV) Another benefit is we have to be much more precise about what we did. In the paper, we might say, “I’ve implemented a C++ compiler.” Then a reviewer might throw any C++ program at it. But more likely, they’ve implemented a compiler for a subset of C++ for parallel loops. (SK: you actually implemented a compiler for Pascal). Is the description in the paper matching what was actually implemented?

(EE) The point is to put the D back in PLDI. Someone made a dialect of Python, it’s easy to use? They have a bunch of language constructs, semantics part, and then a big user study, whatever. Then they give you the compiler. What would you expect to be able to do with the compiler, to evaluate it? The example programs should go through, maybe the semantics is formal, but the user study is not reproduceable.

(SK) Well, be careful: this is the question of repeatability versus reproduceability. The user study is not repeatable, but we might still evaluate the raw data. To get back to the high level point, reviewers often have an obsession with exactly reproducing what the paper said. For some papers, it is obvious that you can’t. Embedded hardware you don’t have? Well! But for things that look similar enough, you might think you can. I think we shouldn’t be too obsessed, but I’ve published papers, we generated some number of states (e.g. two), get some speedup, there was a lot of writing. If someone else gets five, did I really understand what was going on? Maybe there’s a problem.

(AS) Maybe in addition to this, maybe as an overall goal, say, “Give us all the tools to reproduce the evaluation, or tell us what the gaps are, which are infeasible”

(SK) It wasn’t good enough?

(AS) It felt people weren’t totally sure… often people said here are the things that are reasonable to reproduce, but didn’t explain other things. This left reviewers at a loss. It would be good if things said: “Here’s why there’s variability.”

(ET) I like the idea about responses: maybe having a conversation for making the paper better. For AE, that seems necessary, in that, for error bars, maybe people weren’t thinking about it, making typos, especially for last minute.

(SK) Part of the problem is this: every time we’ve run this process, there was not enough time. There are four weeks to do it: it has to be before the camera ready. There’s not enough time, but we did make up an ad hoc response period this year. We sat on the reviewers so they did the reviews ahead of time. We told authors, this is ad hoc, we gave authors the full reviews, and told them you’re welcome to respond. We got some responses, some were not prepared. It’s just really constrained.

(Carl Bolz) Did it make a difference?

(SK) I can’t think of a case.

(Paolo Giarrusso) Was there a phase for all papers?

(EE) We only solicited ones where there was a question.

(George Kastrinis) Given the code which cast doubts, do you think the process should change in the future and take artifacts into consideration?

(PG) Yeah, didn’t we just lose this data (from the AEC)?

(SK) All reviews are like that. At ESOP, people were very concerned: how is this going to affect the process? The community was fairly conservative. The idea was, if we do AEC after decisions, then nobody will be hurt by this.

(Ben Liblit) Even the bad papers!

(JV) Overall the number of bad papers is not overwhelming, and the whole point is to encourage people to do the right thing. If it’s punitive, then it will change the tenor. People know that the paper is accepted, so they are trying to be encouraging and helpful. So it was an artifact of the genesis of the paper, but it’s a feature we should keep.

(SK) Here are the things people were concerned about http://www.artifact-eval.org/motivation.html (Design Criteria). (Singled out the exposing artifacts to the general public is unfair to the group)

(ET) About exposing artifacts and people poaching your research. Do we really want to a community that has this fear?

(JV) Well, in Biology, you explain how you did it, but you will not share the data. It cost you millions, it’s precious, you will not share it.

(EE) Well, in computer science, things tend to be very open.

(SK) What kind of community do we want to be?

(ET) Think about informal publications, I have a research blog where I write lots of details. Maybe two people read it.

(SK) I didn’t know you had a blog!

(ET) My understanding is, today, that’s considered a formal document—if there was bad faith behavior, it functions as lab notebook. But if you’re in the public, it seems to foster a sense of scientific community and shared goals and published, rather than the economic system of biology, which is setup to divide people.

(SK) My view is if someone builds on it, I get citations without doing any work. That’s great!

(JV) The reward system is not quite right…

(SK) Can people write papers based on your work? Then you will get your next NSF grant for free. “This is something people will use!”

(ET) Do people ever contact original authors?

(SK) Never.

(Jean Yang) How much work is evaluated on how easy it was to setup environment? Should we build environments to improve that?

(SK) That was not the goal. Even if painful and undocumented, if we can push it through, we accepted it. Here is the social problem. If we reject most things, then the AEC process would die. 20 grad students rejecting papers; we’d have 50 professors would have pitchforks saying what a terrible process this is. We want to encourage people: they did a lot more work than they needed to! The primary thing was not the artifact, it’s the paper: you are NOT EVEN REQUIRED to publish the artifact. The paper is key.

(JV) It’s a process. People don’t know how to package these things. Every year we’ll know better.

(Gregor Richards) I have a response. The dependencies comment was my comment. I wanted to say how excruciating it was, but it wasn’t part of my conclusions about the quality the artifact. It was, “In the future…”, but to actually judge it, here are the other flaws.

(SK) And in the future you might get rejected; you won’t get an easy pass.

(JY) If we make it easier to submit artifacts, will more people do it?

(SK) Well, we provided links to many tools for doing this. See the section “Code Artifacts Packaging.” This year we saw aa lot more Docker instances.

(Ross Tate) I have some relevant experience, in the context of whether this should be required or not. This year I have a bunch of students work on established GitHub projects, but the biggest headache is setting things up. These things are just part of the process. It can be intimidating, training people on these things. As an untrained user, it’s very intimidating. But if it’s not required, then maybe it doesn’t work out.

(GR) That’s why one suggestion is to provide a VM image.

(RT) To have someone compile my code, that is not easy!

(SK) That’s why I really like the example where the VM did not work, but the compiler image did.

(Swarpenedu Biswas) I submitted an artifact and benefited from it; one reviewer told about documentation. Do you plan on making the response period a mandatory part? Or keep it ad hoc? It’s very possible to misunderstand the documentation.

(SK) We would like to make it a standard part, but there’s just not enough time.

(JV) We tried a “kick the tire” thing. First three days, try to get proof of life (build it.) If you get stuck…

(SK) Well, you don’t notice subtle things… the VM runs, that’s nice, but that’s often not the tricky thing. It’s all the other stuff; the weird output format, numbers don’t match up. We’re squeezed on time. Here’s why I think we had time here: because we only got 20 submissions. If 40 submitted (we expected 30), there’s no way we would have gotten it done.

(GK) Are there valid reasons for not submitting an artifact?

(SK) Absolutely. We should not turn this into a litmus test for every paper. I write theory papers sometimes. These should not be pounded out of the system. There’s enough … going on from SIGPLAN. That’s why expectations are important. There’s a running joke, since POPL has done AEC; is there anything besides Coq that should be done? Maybe you have a 30 page technical report, and that’s the artifact. It’s all a function of claims, and justification of claims.

(JV) Don’t knock proofs as artifacts. I have seen papers which have claimed to have proven something, but authors haven’t typed it up.

(SK) It turns out dogs eat everything: codes, proofs, manuscripts…

(JV) I don’t think this is about fraud.

(SK) But I’ve been there.

(AS) What about super-complicated artifacts? I know someone anecdotally, because he has a big cluster system, you need Infiniband between 102 nodes. I was trying to argue, take your analysis part, and do it, but he couldn’t find guidelines about how to extract the useful part.

(JV) Do they have any data? How did they compute that data? There is some output of simulation, a complicated set of scripts which made the graph, that is very valuable.

(AS) I agree, but I wish the guidelines for the AECs had ideas for how to deal with this. You don’t need the whole thing.

(SK) We also have to accept at some point we’re not going to be able to do it. Let’s just continue to keep this after the paper is accepted, because to reject it would be crazy. That’s too bad, maybe you depend on very specialized hardware, that was the point… what are you going to do. There’s two copies… etc. One author offered to ship us hardware. we were like, well, we have three reviewers, we can’t clone it, it just isn’t going to work. Also, I think this is a case where the common case is not problematic, so I don’t want the uncommon case to hold up the common case. One of the things we do is we give the decision back to =the authors. They can choose to print it however they want. “This paper was not submitted AEC because could ship Amazon things.” What I want: I read a paper two years later, I want to know if anyone evaluated this. If I see AEC, someone evaluated it!

(CB) Here’s a thought: there’s a tension between making artifacts easy to evaluate for AEC, and making it actually useful for consumers of research. Shipping VMs is sensible for AEC, but not useful for sticking it in a pipeline. This is something to figure out over time.

(SK) We got some feedback: it was helpful to have someone outside the group run the damn thing. Multiple people! This automatically helps.

(CB) It also helps you find holes in arguments.

(SK) Some people had hardcoded usernames… but if the only thing I have is my machines, well, maybe you forgot. It’s hard to think of everything.

(RT) The topic, is to verify claims? Then could another way to do it is have the AEC work with the authors, rather than independent.

(SK) That’s painful. I’ve not known many cases where the authors vigorously disagreed, but some authors were mad. Imagine your paper is rejected… you write back and refuse to accept the rejection…

(BL) If evaluators are grad students, you don’t want them getting into a fight with a tenured professor.

(EE) The chairs acted as intermediaries between committee members and authors.

(JV) Signed NDAs…

(SK) Commercial software is tricky. We want people to be anonymous, there was commercial software freely available for people who sign the document. So Jan signed the document, and told the reviewers, “Don’t fuck up!” It’s not clear what to do in that situation.

(BL) Many people have contacted authors of papers…

(SK) People started to nod…

(BL) …asking for copy of tool. What happens, your email address for the first two-three authors bounces, their emails are invalid; finally you get a response from the professor, they don’t even remember the project existed, because they never ran it. I’m cautiously optimistic to see what effect AECs will have in the long run, when five years from now I want to cite/build on an AEC paper. I hope it’s better.

(EE) There are two papers that have attached artifacts going into digital library. I think this is the very first time you can do that. Positive effect!

(BL) Trying to get and use artifacts has failed so often. I just hope… maybe it will be better.

(SK) I don’t want to give oxygen to a terrible piece of work, but I’m going to do it. Here is Christian Collberg’s buildability of artifacts. They did a dreadful job when they found things… they did not make Arizona graduate students look very smart. One bottom line. Workshop TRUST, Collberg is going to be there, and I’m trying to figure out the most measured way to tell him what I did. A bunch of us setup a crowdsourced effort, to reproduce their reproduceability subject. But they were unresponsive, finally, they talked to a coauthor, got responses, sure, and now, here’s what we have. Here’s the classification: disputed, etc… http://cs.brown.edu/~sk/Memos/Examining-Reproducibility We did this, there was a bunch of response, then interest died out (as usual), but if you read the emails, the conversations, it is truly scary.

(GR) I have a thought: if we’re thinking so much about how authors are going to communicate to the AEC what claims their artifact correspond to, and in the end, the stamp, there’s a disconnect, communicating to the reader what claims it corresponds to.

(SK) Yeah… computer science, more complicated mechanisms don’t work. If the evaluators have decided overall the paper is credible, it gets a stamp, otherwise, it just is like… when a paper is accepted, it doesn’t mean the PC agrees to everything. All it means they signed onto something in the original paper. So there is some amount of trust.

(JV) In the long run, maybe they will submit the code for external review…

(PG) It’s good style… to update the paper to say what things. I want to see how many papers were altered as a result of the feedback. Did you add error bars? The optimization paper? They would have had to withdraw the paper…

(EE) It would be nice to figure out how to work it into the process… here is feedback. Getting that to authors for them to modify things… I don’t know how to do that in the current schedule unless… if you participate in AEC, you get an extra week? Two weeks? That’s one way to get artifacts.

(SK) We have this problem camera ready three months before.

(JV) That could change.

(PG) What about loop stuff?

(SK) First stage?

(JV) Second stage. But we could do it at the end of first stage, because most papers.

(CB) It’s pipelining.

(ET) This goes back to Adrian’s comment, for things which are difficult to reproduce. Simulators: it’s nice to have, but you don’t want to talk about simulator in paper. in terms of reproduceability, it’s nice to have alternatives. Clearly… it’s very selfish, but my work requires Mechanical Turk, but I also have a local host to walk through. This is feasible, but in terms of encouraging understanding, the pipeline works, do we emphasize simulators? It’s nice to be able to model a system

(SK) In any case, you had one for testing.

(EE) I think having a version of your system which works in a simulator, to validate the parts you can validate, is an excellent idea. I have one foot in PL, systems, many of my colleagues in SDN and datacenters… unless I have a datacenter, I can reproduce, but there’s ton of SDN simulators.

(ET) So should we ask people to submit this?

(JV) It’s not appropriate for all papers. Requesting something… extra work? We don’t want it. It’s god to have independent evaluation…

(SK) How are you going to make your case? They have to believe it. you take your best shot. There were all sorts of questions about how to write a paper, because a whacko reviewer didn’t… the advisor now can tell you what to do. I think with a little practice, we’ll figure it out.

(ET) Well, the thing about simulator, is the conflict for users and AEC. Users have no use for simulator. For cases where people who didn’t need it… proper system. I think we should try to encourage people to have these kinds of tools, but in terms of what we’re asking people, maybe clear guidelines on it, so you don’t have people saying it’s not appropriate at all.

(SK) that’s the really troubling thing. “Why didn’t you ask?”

(JV) The reason you have to say it often is because people don’t ask.

(RT) There are two things you’re trying to do, they need to be separate. Verifying claims, and verifying there is a useful tool (helping others.) There are different things.

(JV) Our hope is pushing on one push the other.

(SK) I think that’s true. It makes it all a little better.

(CB) The virtual machine makes it possible.

(SK) As an example, I’m trying to build on something ….

(JV) We’re looking for the least work. You did something, you did experiments to convince people.

(EE) Artifact is message to research peers, not users. If you can address both, that’s great. We like pretty webpages, but users won’t care about some things. Research colleagues do care about. users don’t care about source code, researchers do.

(RT) Well, it’s tough…

(SK) We hope it’s similar enough, so having a third processor is unlikely to be helpful.

(RT) Well, making my tool useful… if I’m not going to use it…

(SK) The tool paper on the tool is different from the result of using it.

(AS) Here’s a funny idea: have a list of stories of successful artifacts. Because everyone’s case is so unique. Once there was a person who submitted [X], or made a cluster thing, they didn’t give us anything… they don’t have to be real stories.

(?) If I can’t reproduce your story…

(ET) We all hope we’re doing work we think is valid, I don’t want to shame my advisor by not earning a sticker.

(SK) Now we require a webpage. I think this is really important.

(JY) Is there a public announcements?

(EE) You have to look at the frontmatter.

blog comments powered by Disqus