Today, I was describing a problem I was having to Adam, and C— came up. I described it as, “Like assembly, without the register allocation”, and Adam said, “I would totally use that!” Of course, the key is that whatever it is it would have to work with the GCC compiler toolchain as seemlessly as inline assembly, and C— doesn’t exactly have that kind of support. But still, it’s food for thought: it’s easy to say that C— and LLVM were in direct competition of the space of “lower than C intermediate representations”, but LLVM doesn’t really seem to fill this niche very well.
Centaur chess (also known as Advanced chess and Cyborg chess) is a form of chess where a human uses a computer to help explore possible moves. This got me thinking a bit about what “Centaur bridge” that is to say, computer-assisted bridge, would look like. The computer could effortlessly run complex squeezes and know the optimal method for playing all suit combinations; the human could bring to bear bidding judgment and figure out when the computer is about to make a bozo move based on their simulations…
Promise of deep learning! (realtime Chinese to English translation)
Core AI (visual object recognition, speech recognition) versus data mining (movie recommendation). Difference: core AI, specifying features is DIFFICULT. Deep learning assists in SPECIFYING REPRESENTATION. Pretty neat!
"Why am I talking about at OSDI, why aren’t I at NIPS or something?"
What’s holding us back? Size of model needed for task grows linearly with complexity of task, but with a larger model, you need a linear amount of data. This means that amount of data is QUADRATIC in the complexity of task. Lots of computation: petaFLOPS of computing.
Accuracy improves with data and larger models!!
"I don’t have to come up with a new machine learning algorithm, if I can come up with a larger system, I can just get better accuracy."
Goal: model parallelism and data paralellism
Observation: weight updates are associative and commutative (Edward: what about over shoot?)
Generally: exploit asynchrony to remove locks, and partition, partition, partition!
Asynchrony actually helped a lot, pushed them past the 2013 state of the art. Idea: asynchrony helped you jump out of the local minima. (Even helps without increasing model size)
TWICE as accurate than prior best on the ImageNet task! “We think it’s better than humans on this task”
Q: Ousterhout. Question about size of model, and the shared data set. How many bytes are those?
A: Our parameter server are shared among multiple models: they’re a shared resource for multiple models. An individual model, it could be in the order a terabyte or so. (How many models?) 20-30 models. Yeah, we were very interested in Ramcloud.
Q: We’ve been hearing about accelerators like Qualcomm NPU, they have 250 million neurons. If you had that hardware, would it change the landscape? Less machine and get the same accuracy?
A: The trick is, it’s great to have large models, but with larger and larger, it’s harder to train. Just because you have a large model, doesn’t mean you can get better. Small models are easy; it took us months to train large models. It’s not a question of the hardware. Andrew Ng had a paper where, he had a cluster of GPUs to train large models, but it was worse than the accuracy for small models.
Q: NYU. According to my understanding, you partitioned workers and used async. The workers in the same have the same model. So how many machines?
A: We partitioned it among four machines, since it was 256x256, but it’s a function of input size.
Q: If you have four machines, and fit it in L3 cache, so???
A: I said, the working set of the model. So you do it layer by layer, tile it, so working set fits. It’s bigger than L3 cache.
(BTW, the end of talk image, there’s a rabbit, and Adam identified it)
Q: (HP Labs) Usually you use durable storage, replicating the logs and checkpoints. Do you do that?
A: No. We want to explore this.
Q: (HP Labs) This is a nice continuation of PDOS work. You vary the rate you can do recovery and you can do transactions. I did some math on the slide; the rate per core for transactions seems less than MemSilo. What’s the difference?
A: Do you mean SiloR? (Yes, MemSilo versus SiloR) I think that makes sense, since MemSilo has no persistence: it’s committed as soon as it finishes, whereas SiloR has to flush to disk. MemSilo has 32 workers, whereas it was per-core… there’s extra work since workers are transferring data to loggers.
Q: You use three FusionIO cards and a Ray array, but yo udon’t report the bandwidth you’re gettin gout. So it’s dificult to predict what would happen if I had a much faster or slower IO subsystem. What was the bottleneck? Recovery threads? IO subsystem? How much bandwidth?
A: We have some numbers in the paper, but for the recovery system, we didn’t want to be IO bound, so we used multiple disks. We are CPU bound, but we are using all cores to recover as fast as possible. I believe there is room for improvement in terms of recovering faster, but I think we showed we achieved our goals: failures are rare.
Q: Why can’t you just get logs and have consistent checkpoint?
A: We want checkpointing so we can truncate logs, since recovering only from logs would take a very long time. But there is some memory overhead from consistent checkpoint.
I liked this session quite a bit, because a lot of the research seemed very amenable to practical application. This summary post is a bit perfunctory since I was lazy, but you should check out the papers.
SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud System. The basic idea is that, when you model check, you need to reorder messages to explore the state space. But a lot of reordering is stupid. So do clever things to only reorder when you think it will change the behavior of the system. They used this to find bugs in a variety of real world distributed systems, including two novel bugs.
SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration. The basic idea here is that we have race detection tools for userspace, by interposing at the syscall level, but not for kernelspace, where no analogous interposition location exists. (Actually, it doesn’t really feel like race detection tools are ubiquitous in the same way Valgrind and strace are.) So, instead, write an interposition as a hypervisor which the OS runs on. Of course, at this level, all you get are a stream of hypercalls and assembly instructions. Use THREAD AFFINITY to bind instructions to virtual CPUs make it possible to figure out what any given thread is doing. Were able to do much better than stress tests for a variety of Linux kernel bugs. Looks like it would be useful for the Linux kernel.
All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. A nice survey paper about what kinds of durability guarantees file systems give you, and what guarantees applications seem to rely on. I wasn’t really sure on the research methodology for the latter, but I was appreciative of the empirical description of various file systems and their different modes of operation. The example of atomically replacing a portion of a file with another shows exactly how difficult it is to add enough fsyncs/msyncs to guarantee durability on all file systems. Fabulous quote: “Most real-world file systems would fail if submitted as an assignment to an OS class”. The file system API is a clusterfuck, it really needs to be fixed.
Torturing Databases for Fun and Profit. These guys figured out how to simulate power failure on database, and use this to zoom in on the 10-20% of cases where the crash lead to some inconsistency. They had a very admissive failure model (all blocks before crash are guaranteed to have successfully written to persistent store) but all databases they tested failed at some place. There appear to be some common patterns which lead to these sorts of bugs, for example accidentally writing to an mmap’ed region.
Accuracy is related to shape of curve: sharp edges introduce more error
Q: This was very good, better than the poster session. You said it’s sublinear in memory, but the algo requires knowing you have a repeated reference, so how do you avoid keeping track of each repeated symbol?
A: When we’ve computed delta-y, that tells us when the repeated references happen. (What about the original matrix?) That’s what HyperLogLog does. It’s a cardinality estimator.
Q: Scott Kaplan. One thing I see, is that you didn’t actually try to run it online?
A: It’s coming out in our next release, we’re running it right now.
Q: Gathering these curves online has been done, ASPLOS’04, they did it 2-3%. So how do you expect yours to compare?
A: I don’t believe they’re maintaining full workload histories, months in the past. They were computing MRCs over short intervals, which dramatically reduces unique hits. But we have storage events over weeks and months, so that’s not applicable.
Q: Tim (Washington). You mentioned error is affected by shape of the curve. Can you use that to feedback into your system and detect workload?
A: We can do simple elbow detection, and if we see jagged curves, we can dynamically increase fidelity. (Because it’s the elbows you really care about.)
Problem: infrastructure you depend on may have hidden dependencies, such that if that dependency goes down, multiple services will become available. These dependencies can thwart replication, if failure of replicas is correlated.
Answer: Learn information about how failure of components are correlated
Problem: This information is often proprietary, companies don’t want to give it up
Answer: Use secure multiparty computation techniques (p-SOP, in this case) to determine this information
Q: Mark (HP Labs). Who do the cloud providers want people to not know about their stuff? Because it seems like they know what’s going on. I.e., if Amazon wants to know what Microsoft does, they can check the cardinality. So can’t I just giving you subsets of my stuff that I have and don’t have, and eventually figure it out.
Q: Yang (Columbia). So how do you factor in the probability of failure of components? I.e. lightning strike is not likely.
A: This is hard to collect in practice. There was a paper that successfully got this, so we take advantage of this information.
Q: Kozyrakis. If I care about probability, e.g. four nines, can I use your tool to tell me how to deploy my service with this failure probability?
A: This is a tricky question. Our target goal is to generate a ranking list. So if you’re an admin, the target is to show the runtime list to you. You can select the most independent configuration from the ranking list. This score ranks them. That’s the target.
Q: While it seems like computing independence scores, you don’t make use of the fault graph: you just compute on sets.
Q: Re commutative encryption, when you say, “publish”, are you publishing it to the independent agent or everybody?
A: Everybody. Everybody can see the published encrypted data set. It’s been encrypted, so we don’t have to worry about them being able to extract a specific thing.
Q: But you’re saying the ciphertext is equivalent, if it’s commutative.
Q: So if I’m Kd, and I see Kc published the same thing I have, then I know there’s a dep.
Q: Sampling only works if it’s random. Your sampling doesn’t look random.
A: Good question. Notice in the stragglers example, it only works if we pick some tasks based on all of them. So we wait for it to finish, and then pick the best. So the tradeoff only occurs when some stragglers are really long. But what we also found, enabling or disabling this, we still knew what the limits were due to theorems, and we found no discernible difference. This is primarily because things are not deterministic; skew is usually the only deterministic factor. The nondeterminism gets us away from…
Q: Well, as long as it’s system, and uncorrelated, it’s fine.
Q: (VMWare) Great to see a database talk at OSDI. For randomness, re the theory of sampling based on randomly distributed data, you’re dealing with cross data skew, but depending on partition, you might get very skewed results, e.g. range partitioning.
A: The skewed results: the range partitioning affects the size of blocks. For cross-site skew, we’re only counting the number of blocks. We don’t distinguish between larger and smaller partitions.
Q: (Martin Schufska, Cambridge). A lot of this is trading off locality. you mentioned Quincy, I think it’s very closely related: instead of heuristics, it says it’s a optimization problem. Quincy can’t do combinatorial, but in your case, m > k, it’s OK: Quincy would work, you get increasing costs, something near k, optimization wise. Have you given thought to how well Quincy-style works? I think it would be better.
A: We haven’t tried that, the primary approaches, we had the combinatorial thing, and then we had a few extra tasks on top. So we didn’t start from optimization. But it’d be interesting.
Q: Researchers have identified interference is a big problem with scheduling. My question: does it affect wait time, or run time? If it’s wait time, that’s fine, but what if it runs 4-5 slower?
A: Apollo depends on robust local research management on servers. Some relevant work includes CPI2 from EuroSys. Also Linux containers, Java objects, these can be used to isolate performance, but I’m not sure the technology is up to this. (Did you measure the problem?) No.
Q: Henry (Stanford). For maximizing resource utilization, do you also consider maximizing disk IO and memory?
A: In our system, the bottleneck is CPU, so it’s the primary resource we optimize for. We’re concerned about memory, though we didn’t publish any numbers on it. For Disk IO, it is also very important, but again, Apollo relies on the IO scheduler to do its job correctly.
Q: (Cambridge) Question about target workload, since the formulation of wait time/run time to work out optimal schedule depends on batch workload. Some are service tasks, long running web service. As for what you did right now, is it not suitable for long running services?
A: Long running web services are essentially infinite length: those resources are never available.
Q: But then you don’t get a good schedule. There are no tradeoffs to make, if they run forever. Also there are better/worse placements
A: Yes. So Apollo is for Cloud-level big data processing. We support sharing with other schedulers, so you could imagine putting a service scheduler in.
Q: Vijay (Twitter). How can you extend your system to deal with eliminating discrepancy in load across clusters? This is something we spend a lot of time trying to solve at Google. Some clusters are loaded, some are not… Your scheduler runs on a cluster.
A: No, it doesn’t really look at deciding which cluster to put the task in.
Problem: you want to run your applications on a third-party cloud, but you don’t trust your cloud provider. TPM? Secure microkernel?
Idea: try to emulate the old-fashioned model of a secure colo, where everything inside your locked rack is secure, and everything outside is adversarial.
Technique: Intel SGX, which lets you create an “enclave” which is protected
Operating System has to live inside the enclave, because the OS interface is too rich. It’s on top of a “shield module”, which is responsible for interacting with the untrusted runtime and determine if the results it is getting are sensible.
They need more stuff from Intel SGX: dynamic memory allocation/protection, exceptions (page faults/GPFs), RDTSC/RDTSCP, thread-local storage (good news: Intel SGXv2 coming down the pipe!)
Trouble: no actual SGX hardware available yet! But model suggests 35-60%
The cloud should be a UTILITY!
Q: John Sawworth. You use public key to attest?
A: Yes, there’s details in a workshop. It’s a group.
Q: So who owns the private key?
A: Ultimately, the root of trust is in the processor: the processor implements SGX, and the initial state of the Enclave is signed. The cloud provider is outside the TCB.
Q: I’m interested in the programming model. You said full legacy applications. Can you contrast with smaller pieces where you don’t have to deal with Iago attacks?
A: Intel published a number of workshop papers of smaller apps which solve this.
Q: So why legacy apps?
A: There are a lot of them! We want to run them!
Q: Are the limitations on memory size? Memory swapping, for encryption?
A: The size of encrypted memory is fixed, it’s unknown what the eventual size will be. But it will support paging, so you can page to untrusted memory.
Q: What happens if you disable SGX and emulate it?
A: The attestation wouldn’t work. You launch the app: the VHD is encrypted, to get the key, it needs to attest.
Q: John Griswell. You say you modify unmodified apps. Source?
Q: So what if they are using system calls using int80, syscall?
A: It’s very rare, but it will generate an exception, and we get it to the enclave, and emulate it.
Q: For other system calls, you just replace libc
Q: Is SGX simulator available?
A: No, have to ask Intel for that
Q: (Edward Bugnion) Intel introduced VTX without support for EPTs a while back. So you had to jump hoops to do this, and they decided nested page tables (e.g. multiple address spaces) was useful. Now they’re introducing SGX without EPTs. So why is it only user-level code?
A: Well, including multiple privilege levels, with the processor implementing the boundary, it seems like it would be very complicated.