03 12 / 2013
Patrick Walton made an effortless comment about Rust and pauseless-GC yesterday: “Well, you still need your memory allocator to be pauseless.”
25 11 / 2013
The discussion began with me stating that I wanted to build an incremental type checker. You know, because when you edit code in Agda and you ask it to recheck the file and it takes half-a-second and while that’s not that slow it’s not that fast either, and quite annoying. So I was going to do it for some sort of simple language, see how feasible it was. And there’s been some previous research on incremental type-checking (c.f. Matthias Puech’s thesis).
But then I realized, “Hm, how am I going to get the diffs which I am actually going to apply incrementally?” In SMT solvers, where one similarly builds up and removes constraints in the online theory solver, the constraints one needs to add our generated by the SMT solver and presented in a direct form. However, in the case of source code editing, this information is locked up in the editor. But when the editor writes to disk, it blasts out the entire file (note, also, why you do not want to edit very large files with most text editors). This is a limitation of the POSIX API, since there’s no way to append data in the middle of the file.
So shit. Either you can change the editor, or you don’t. If you change the editor, you go down the path of IDEs, and I sure as hell don’t have the manpower to pull something off there. But what can you do if you don’t change the editor? Well, you have to recompute the diff. But maybe this is not too bad: diff -u runs quite quickly (practically instantaneously), and you can go from there. (Maybe you don’t want a line-based diff). So one should go from there, if they are serious about an incremental type-checker.
But then, Geoffrey and I got thinking about what would happen if you actually went about and changed the editor. The reasoning went like this: the reason why we can’t get diffs out of the editor is because we want diffs at the sub-file granularity; at the file granularity, we actually can get very accurate information at the POSIX level. So one possibility is to just explode our source files into very small files, and then somehow… recombine them in the editor. But we thought: why does editing lots of small files in a text editor suck?
The current model goes something like this. You have a number of buffers corresponding to files, organized by tabs, split buffers, etc. When you want to change from editing a file, you have to do some sort of action to switch buffers. When the number of buffers is small, the action you take cycles to the next/previous buffer; when you have a lot of buffers, you usually tab complete to the name of the buffer you want to switch to (CTRL+E in Eclipse). So it stands to reason that when there are a lot of small buffers, you do a lot of context switching, which adds up.
Then Geoffrey had an insight: “Well, this is just BarnOwl”. BarnOwl is an IM client, which has an interesting model where messages from different channels are interleaved, and one can focus in on a channel in order to view only messages relevant there. This is in contrast to the current ecosystem of IM clients, which mostly relegate a separate window (e.g. GChat) or tab (e.g. irssi) to each logical discussion. So what if our text editor, rather than giving a separate window per file, just gave them all together in one unified view?
One important implication of this is that many operations which are normally file local suddenly and automatically become file global. Visual commands such as page up/page down automatically jump to later files. Search and replace automatically extends beyond the boundaries of files. Saving is a global operation on all files. So a large swath of “inter-file” mechanisms become unnecessary. For example, instead of having a special command for jumping to a file, you could jump to a location by doing a search (assuming the files had headers of some sort). The result is a simplification of the UI, and the ability to edit a directory of small files as if they were one file.
One important detail to deal with is how file boundaries are treated. Clearly some sort of special marker needs to be inserted in these places, and some commands should respect these markers (for example, should ‘gg’ jump to the end of all documents, like ‘>’ in barnowl, or jump to the end of the current document?) How do you create new file boundaries? What happens if you “cut” a file boundary? Should you be allowed to even select over boundaries? (Yes.) To get even fancier, Geoffrey suggests that you should be able to stitch together interleaved bits of files in a single view. For example, if I am adding a new function, I need to define a new function as well as its prototype in a header; in an ideal world these would just be together; so if I can say “here’s some code for the header” and then “here’s some code for the source file”, that would be better. (Geoffrey also thinks these should not be saved, in analogy to Git staging versus commit.) One challenge, compared to BarnOwl, is there is not an immediate logical ordering for the files, in the same way chat messages should be ordered by time. The ability to edit the past also challenges the interaction paradigm.
Figuring out exactly what interaction paradigms make sense is an open question, and it will probably be hard to figure out without actually building a minimal prototype. One possibility is to modify nvi (the new vi implementation) to support this interaction paradigm; it is far simpler than vim and might be good enough to demonstrate the utility of the idea for other text editors to pick up. Geoffrey suggests figuring out how to emscripten compile nvi along with BusyBox would be a good way of making it easy to let people try the editor out on the web (even better, build support to edit GitHub projects, since you really need some real data to actually tell if it feels good or not.)
Unfortunately, at the end of the day, it comes down to the editor. This is unfortunate, because I am not very keen on the future where IDEs have taken over the world. IDEs are hard to understand, and they cause projects to be completely dependent on them. But many would say, IDEs have already taken over the world. Do I really want to become the world’s expert in text editors? Well, it probably would pay well.
Thanks Gregory Price and Geoffrey Thomas for contributing to the discussion.
31 10 / 2013
Keith came by Stanford to give a guest talk about Remy. http://web.mit.edu/remy/ Here was the Q&A session:
You send end-to-end can be better. But you compared against humans. That’s not fair? What about computer designed end-to-end network?
We don’t know.
Program that gets some inputs and designs an output. But there are still a lot of machines using TCP. So is it friendly? How will it play with TCP?
The respectful answer is that RemyCC will get quashed by TCP, because TCP is designed to fill up buffers. This same thing that happens to TCP Vegas; if you run Cubic versus Vegas, Vegas gets creamed. But it’s up to you, the network designer: there might be another TCP, then get as much as you can; in that case we do make it. So the important thing is what you care about.
How do you encode presence of TCP as an optimization? Since TCP was designed to work well with existing flows. Can you include that as a constraint? What if the computer has to live under the same constraint?
Yeah, you say something like, 10% that they’re running some TCP. It’s part of the simulated model. It compromises performance when node isn’t there, and improves it when it is there. So what is the cost of TCP friendliness?
Levis: The world has changed, we have bigger buffers, so it seems delay is a better signal. So what about looking at what the networks looked like when TCP. So is it that ACKs for losses was a bad idea, or that it just stopped being a good idea?
I think it’s the latter. In 1985, it was hard to build up a large delay, since buffers were so small. If the delay was detectable, you were dropping packets already. But now you can build a 10s delay.
It’s like game theory, where if I open more connections, I do better. So why hasn’t there just been a huge collapse already?
This is outside my competence, but Layer 9 has imposed fairness. There was an arms race; 1986, etc. Netscape did 2 connections, then IE did 4, etc. Finally, Netscape and MS reached a truce and said they would not go beyond 4. Or BitTorrent opened as many connections as possible, and then Comcast decided to start injecting TCP resets. This was pretty unpleasant, so BitTorrent decided to stop using TCP. It’s a gentleman’s agreement.
So about the question in the beginning: if the Internet was just a separate queue, have you compared that?
Isn’t this Cubic/sfqCoDel?
It’s different… it tells you something, it’s a benchmark of performance.
If you did fair queuing with large buffers, you would great isolation, but there would be large delay. You don’t need congestion control, but with Skype… we do live in a world like this; e.g. in a cell phone provider, you have a unique key, so you don’t contend with other phones.
25 10 / 2013
Ran into an interesting problem while working on my implementation of resource limits. The problem is, “When a resource limit is triggered, what is the appropriate response by the system?”
If we look to ulimits, we see that a resource limit is applied to a process/cgroup, and that when the limit is exceeded, the relevant syscalls requesting for of that resource start returning error codes, or if it is not a resource like file descriptors, you will get a signal sent to the process (or all the processes in the cgroup, as the case may be).
In a more fine-grained resource limits system, the situation is a bit trickier. Let’s suppose that a resource limit is hit. There are two mechanisms we can use to notify users of the container (kept abstract, because this is a notion that is not easy to define), directly analogous to the situation in Unix-land. First, when a resource is requested, we can instead error (throwing an exception). Second, we can send an asynchronous exception to the “process” (the specifics of “which process” is somewhat problematic.)
Implementing the second idea is unsuitable for a fine-grained resource limits system, when threads of control can change their current container (either through a privileged withRC call, or by attempting to evaluate a thunk owned by another resource container). The big problem is this: the set of threads to which an exception should be thrown is not well-defined. You might say, “Well, the set of threads whose current resource-container is the dead one should be killed”, but any thread not in this set could later switch to use the dead resource container, and it should be correspondingly killed. So this is only acceptable when the resource container associated with a thread is fixed upon its creation, so the only way to switch into a new resource container is by creating a new thread. Thunk evaluation would change the current RC, yes, but a thread off evaluating someone else’s thunk should still die when its home container dies; and another thread with access to a dead thunk should receive an exception, much in the same way a thunk that throws an exception is replaced with an exception throwing stub.
The first idea is more directly amenable to implementation. When a resource is requested from a container that it cannot furnish, an exception is raised. In some respects, this is quite similar to the second scheme; evaluating a thunk that requires allocation that cannot be furnished would result in a raised exception. A thread running in a dead resource container would not notice until it attempted to allocate memory. So this is a lazy scheme. Note that the idea here is to override how garbage collection is handled; in some situations running out of the nursery will not induce a garbage collection, but rather an exception.
To summarize the differences, one scheme is eager, the other scheme is lazy. The eager scheme may receive an exception even when it would be able to continue evaluating in someone else’s container, however, it has to be careful to ensure stragglers who try to use the container’s resources later get appropriately killed. The lazy scheme only kills threads when it notices they are asking for something they can’t.
Now, the lazy scheme seems like a clear win, in terms of implementability and as well as its ability to enforce the proper limits (it is easy to ensure a resource is never used by stopping it when it is allocated. Furthermore, it smoothly interoperates with exception handlers, as a non-allocating exception handler will evaluate successfully, while an allocating one will reinduce an exception, as proper. But some problems.
The first problem soft resource limits. The idea behind a soft resource limit is that when a resource limit is initially hit, you want to give the process some indication that it is on thin ice, but give it a little bit of extra memory so that it can successfully cleanup after itself before aborting (or maybe even trying again). Key to this scheme is letting the process know that something is wrong. Now, if we consider the Unix model, this would correspond to first sending a catchable signal, and then sending an uncatchable one. That is to say, the idea of a soft resource limit is entwined with the “eager” model of error delivery: we want to proactively let the “user” (once again, problematic) know that he should start cleaning up.
Now the lazy model breaks down. Once the soft limit is hit, you want to let the user know once, and then stop yapping about it. But you can’t do this with an exception, because you might be sending the exception to the wrong thread (who was just unluckily evaluating a relevant thunk). So you might think, “Well, designate some master thread for the container” (notice the similarity to the eager scheme?) and let them know, only, but you don’t want this either: whatever memory hungry computation induced the soft resource limit will surely not be the same as the master thread, and will quickly blaze through the extra memory you so carefully allocated. By the way, this is a problem for the eager model too: you can deal with your threads by inducing an exception on all of your threads, but someone else with their grubby hands on your container could use it up anyway. Perhaps the correct thing to do is to blackhole upon entry to a soft-killed container unless you are the master thread; the blackholes will either turn into proper exceptions or resume once the recovery phase is done. So this is quite subtle.
The second problem with the lazy model is that exceptions may not get delivered to all the threads that you care about. Note that when a resource limit is exceeded, and you’ve decided that the container is irreparably damaged, you want to deallocate all of its memory and kill all of its threads. But there are still these zombie threads which aren’t allocating but “belong” to the container, and that is bad news. But actually this is a non-issue; this is what the user-level IFC support is supposed to handle.
14 10 / 2013
Two years ago, I wrote this email to Chris Okasaki:
I was curious to know if you had done any thinking about what I would like to call “composable queries and updates.” The basic problem is this: when we define new data structures, we usually start off by designing the type and building some of the basic operations you might expect (insert, update, etc.), in which the main algorithmic ideas are presented (e.g. a novel rebalancing algorithm.) Then, when we go off and try to implement the data structure in a library that people are actually going to use, we end up having to implement a combinatorial explosion operations which have essentially zero extra algorithmic content, but are handy and/or impossible to implement without reaching inside the internal representation of the structure (one particularly poignant example is value-strict and value-lazy versions of update functions.)
A few years ago, some people suggested that we might actually want to decompose most of these functions into two parts: a query component, and an update component. http://hackage.haskell.org/trac/ghc/ticket/4887 One way of thinking about this is a query takes the data structure to a (value option, zipper), and an update takes a zipper to a data structure. We reify the recursion, and then can mix and match queries and updates. Unfortunately, actually reifying the stack turns out to be pretty expensive in practice, so the proposal was abandoned. But at the very least, the mental model seemed promising, and I’ve been thinking about whether or not some extra compiler elbow-grease could make this scheme work. One difficulty is it seems that any such decomposition seems highly sensitive to the internal workings of the data structure.
Chris replied saying that he didn’t know a good, general solution to the problem.
Two years later, I now wonder: are lenses what I was looking for?
04 10 / 2013
Here is a problem that I have been dealing with recently…
In the hot-path of a copying garbage collector, namely the code responsible for evacuating objects from one region to a new one, one endeavours to reduce the number of memory dereferences as much as possible, as they will each tack 1-2% onto the runtime of your GC.
This is a bit nontrivial, because the evacuation code needs to compute where an object is supposed to go. This is indirected in twice: where the object goes depends on the destination generation as well as the GC thread (since the GC is parallel). Fortunately, the current GC thread doesn’t change as we are performing evacuation, so we can do a procedure to calculate the pointer of the destination generation workspace as follows:
Arrange for a thread local variable to be a pointer to the current GC thread structure, which contains workspaces for the various generations. Calculate the block descriptor of the closure descriptor, read out the destination generation (DEREF), and use that to index into the thread local structure, providing the memory address of a pointer to the appropriate block descriptor. Overall expression: &gct->workspaces[Bdescr(p)->dest_no]
Notice that we’ve avoided a memory dereferences that might show up in the naive implementation. First, we might have been tempted to have a pointer in the block descriptor to the destination generation (indeed, the blocks do have this pointer); this struct would have to be further indirected by GC threads, since many GC threads may be evacuating objects out of the same block. That’s two dereferences; however, by reading out the gen number and indexing into a thread local variable, we only need one dereference.
Now, let’s consider what happens when we add resource containers to the mix. Resource containers add a extra level indirection to destination generation: we now must dispatch on thread, container and generation. Containers are a particularly curious form of dispatch, because unlike generations there are (1) a lot of them and (2) they are allocated and deallocated. So here is the naive scheme I implemented:
Calculate the block descriptor of the closure descriptor, read out the resource container of the descriptor and the destination generation. (DEREF 2 in same cache line) Index and read out the pointer to the thread-specific workspaces structure. (DEREF + DEREF to find the thread number; probably can eliminate that.) Index to the address of the workspace in this structure. Overall expression: bd->rc->threads[gct->thread_index].workspaces[gen_no]
So that’s an extra dereference. Can we get rid of it? Suppose that we had merely made generations two-dimensional. Then we could have simply read out the two values (in the same cache line) and computed the index in one go. We can sort of simulate this if we fix the number of resource limits, but this requires to churn through a bit of memory to setup all the workspaces.
Maybe an extra dereference is not so bad; it’s about what you’d expect in a scheme like this.
27 9 / 2013
Question: given a conjecture, how can you prove it? Try it? Well, many times conjectures are false. Idea: do random testing first, before trying to prove it.
Our property is: “our abstract machine’s operational semantics are noninterfering”. Kind of hard to randomly test. Does it work?
Here’s how we find out: make a simple machine and then try to break it.
Simple example of a stack machine with arithmetic operation.
Example: End-to-end interference. Problem: generation of cases:
- Default (100%)
- Pairs, building at the same time (79%)
- Weighted, something about avoiding crashes (62%)
- Snippets, building from meaningful code (65%, 70ms)
- Snippets, generate addresses not random integers (13.33, 59%)
Big idea: the big problem is crashes, so symbolically evaluate the machines while we’re generating them and avoid crashes.
Next, add control flow. PC has label. End-to-end is too hard. So next property: lock-step interference, as soon as machines hit public code, they must be the same. Note we no longer require that the two machines run successfully.
Next, strengthen property more. Think about the inductive proof that you would have, and you have “unwinding conditions”, which are single steps. This means that we can test them. This is good enough to make naïve generation work.
So, to do this random testing: machine generation, counterexample shrinking, and PICK THE RIGHT PROPETY.
Lockstep always works and is a good place to look, and then move to single-step inductive invariants. This ASSISTS in the development of the proof, by helping you define invariants.
EZY: TESTING -> meet in the middle <- PROOF
Matthias Felleisen: I want to point out a piece of work: Pete Manolios as created ACL2 with a theorem prover, and they already have what you’re looking for. You start proving theorems (ACL2 tries to do it automatically), you add lemmas, and when you run out of steam, you say “quickcheck”, and what you have already to the QuickCheck. And this came out of an observation, let’s QuickCheck things before we give them to students. It really works. ACL2 may not be a good example for this example, but this technique really works, so I think you will have a lot of fun with Coq.
Can you use this to test properties about side channel leaks?
We haven’t tried that because we’re keeping the context simple. The property strengthening notion: you could add side channels and it would be a natural next thing to test.
27 9 / 2013
Prev paper: “Abstracting Abstract Machines”. It was slow; we speed it up x1000
What does it do? Put in semantics, get out analysis. Want to run analysis. also want soundness, maintainability, precision (only 1/2 in prev paper) and performance (not).
Lots of folklore, but the general idea is “decrease state space” and “explore faster”
CESK (control, environment, store, continuation). Old idea: put continuations in store, and have finite address space.
Idea: lazy non-determinism; don’t make “diamond subgraphs” if you don’t need them.
Idea: abstract compilation (SICP); a lot of dispatch, what to turn (e,ρ,σ,κ) into [e]. This “eliminates the red nodes”
Idea: store widening
What’s the point? 100000% improvement, and these techniques are quite easy to implement in an afternoon.
Theoretical precision loss appears to have translated into no change in precision in evaluation. Since publication, sped it up by another order of magnitude.
Very specific question: if I have an abstract machine that uses one stack, but models a two stack machine, is there a systematic machine to derive a two-stack machine from that?
Abstracting abstract machines will route all abstraction to the store. As a regular abstraction, the number of stacks is not a problem. Talk to me later about how we do it for pushdown.
thank you. But still, AAM is a whole program forwards static analyzer which does not scale for libraries. how do you ake it modular?
Apply it to a modular sematnics? If you have a program that is a module, and want to say “I want to analyze this program under all interactions”, then you need to look at th eimports/exports and model them. My coauthors have a paper about this. If you want more search directed stuff, or information control flow through the graph, it’s another semantics problem.
Wadler: this is quite neat; simple and not so performance. Now performant. What’s the factor difference with hand-optimized?
This is difficult to evaluate fairly without an enormous benchmark suite and an agreement of what consumers of analyses you are going to use. We use a singleton analysis to evaluate. We don’t have a formal analysis, but we played with Suresh Jagannathan and Andrew Wright’s polymorphic splitting implementation; the stuff in the paper is x5-25 faster, and we’ve sped up by 10. So we’re really in the same ballpark.
So you should put this in the talk. Because it seemed like you were still a million away. But you’re much closer. Also, I want to recommend the technique applied here, which is do something really simple, easy to understand, and has abysmal performance, and then there are plenty of papers to write. Perfect technique for getting a PhD (laughter).
27 9 / 2013
D-M inference great for prenex polymorphism, though it gives confusing error messages and is a full inference, making it hard to extend to interesting features
Bidirectional typing: checking (Γ ⊢ e ≤ A) and synthesis (Γ ⊢ e ⇒ A). Unification is not inherent to bidirectional. This works for many type system features, though it’s kind of ad hoc (annotations: where? how many?)
Previously, proof-theoretic recipe for bidirecitonal type systems: introduction rules check and elimination rules synthesize.
Problem: polymorphism. The problem is in the elimination rule, where you have to guess the instance τ for the polymorphic type, which is not known. Davies: Also do D-M inference, but this is unsatisfying. We solved this.
Contribution: Predicative System F which is bidirectional but not ∀ quantifiers; only requires redexes on polymorphic type. Furthermore, how to instantiate ∀
New judgment for application. New rule for subtyping, and how to instantiate a lambda term as long as it’s monomorphic.
Subtyping? When A is at least as polymorphic as B. No instantiation of quantifiers.
Idea for algorithm: add existential type variables, and have solved and unsolved existentials. (Also need a scope marker, see paper.) Ordered context (avoid circularity). You can get out a context from synthesis, it should have as much information.
EZY: Ugh, this talk is not very clear.
Interesting things about instantiation rules: sometimes we might try to instantiate to a variable which is not in scope (the order is in the wrong direction); so just instantiate it the other way.
If nothing gets instantiated, you have a polytype. So you have to lose it.
Not too many details, and seems to be easy to implement.
Good properties: Decidable, sound and complete. Preserves types, keeps standard types of System F. Gives up impredicative instantiation of ∀. (Undecidability prevents all three). MLF, HPF, Haskell live on different points of this.
Related work: Tried to extend greedy bidirectional polymorphism, but there were embarrassing problems. (not even transitive). (We have 70 pages of manically detailed proofs for this paper)
KEY IDEA: ordered contexts. (See also Gundry 2010 and Miller 1992, mixed-prefix unification, where variables are a bunch of quantifiers)
Andreas: I like this. So what are the challenges for extending this to dependent types, LF, etc.
I could speculate, but when designing this system, there were many more twists and turns than either of us expected. We hope extending to GADTs will give us useful insights for full dependent types. Neel thinks this; I’m more dubious.
Didier: You mentioned that in future work you could do manual substitution for impredicative polymorphism, I think you want this in practice. There are simple ways to do this, do you have a way in mind?
I don’t think we do. It’s clearly good to have some way of impredicative polymorphism. But we are hesitant to try to do it automatically, because we have a very simple system which we are trying to extend.
Wadler: I like the simple system, and then soundness/completeness to hide the complexity. But you don’t have impredicative polymorphism? I can write terms in System F, and no MATTER WHAT I DO, I can’t write it here?
Why wouldn’t Frank approve of some of he rules you have?
Well, Frank would need to explain that (he’s not here), but if you stop reading after this rule, you have four rules. But with this one you have five. (laughter and applause)
27 9 / 2013
Context: in gradual typing, casts mediate between types. To have blame, want to point to which cast caused the failure.
Problem: higher-order types. A bad cast to a function doesn’t fail immediately. This can cause space leaks, since the casts accumulate.
EZY: Casts are like dynamic contracts!
Previous solution: coercions and threesomes. Coercions work well with blame strategy, but threesomes can be implemented directly. Note threesomes w/ blame ≈ coercions; but adding blame is TRICKY
Idea: make lots of little correctness preserving transformations to get threesomes w/ blame
"We need more threesomes. No, not those."
something about different models of blame, where you can do upcasts or downcasts. This means there are lots of blame calculi where we didn’t come up with threesome calculus.
Contribution: generated threesomes for all of the design space
Threesomes? A cast from a higher order type is a cast to a middle type and then back out again (there can only be three), e.g. one cast with a middle type. If we consider types as a lattice, the middle type is a meet.
(something about redex pairs)
Reduce just simple coercions into supercoercions, and then remove redundancy from supercoercions. Primitives can be ditched; type context can differentiate between some coercions
The supercoercions turn into the threesomes
There is a calculus for having complete blame. Do you have a story if our calculus has a notion of blame correctness in the setting of complete blame. Secondly, you have all these different reducions for more eager errors (earlier), can you predict the behavior of how early these errors are detected?
I’m not sure what complete blame is. For the second question, that was prior work, but I would say the eager systems catch everything the lazy systems will catch, and more, but when you go beyond lazy it’s less predictable.
Wadler: Comment: in your notation, when you cast from A to B, you wrote it with a backwards arrow. But now we realized that it’s much easier to read when you do it as a forwards arrow. So I urge you to do what ought to be done, and write them forward. It’s really worth paying attention to do notation.