The Burroughs 205 and 220 Blog: 2017

Sunday, November 26, 2017

Emulator Timing Improvements

Version 1.00 of the retro-205 emulator was released at the end of March this year. The blog post accompanying that release described most of the technical changes in the release, then promised a second post on the subject of emulator timing.

This is Paul Kimpel with that long overdue second post for Tom's 205 and 220 blog.

The Significance of Timing

One of the design goals of the retro-205 emulator has been to run programs as close as possible to the real speed of a 205. Modern browsers can run the emulator much faster than that, however, so the speed must be slowed, or throttled, to meet that goal. This post describes how that is accomplished in the emulator, along with the improvements to timing accuracy that were introduced in version 1.00.

The effort to improve timing accuracy started early this year when Tom was experimenting with some prime-number programs and noticed they were running 20-30% faster in the emulator than he suspected they should. That set off a series of studies and experiments on both our parts to determine (a) how fast his programs theoretically should run, and (b) why the emulator was running them significantly faster.

Before getting into that, though, I first need to describe how the emulator regulates the speed with which it runs 205 programs.

Controlling Speed of Execution in a Web Browser

Most people by now know that a web page is constructed using a textual notation known as Hypertext Mark-up Language (HTML). That notation defines the elements we see on the page -- blocks of text, images, form fields, etc. You can think of a "web application" as a set of web pages with a dynamic component. Inside the browser is a programming language, Javascript, and an API, the Document Object Model (DOM), that allows Javascript code to access and manipulate the elements on a page.

The Single-Threaded Nature of Browser Applications

Javascript and the DOM were originally designed to enhance user interfaces by reacting to user-generated events -- mouse clicks, key presses, and the like. Based on its interpretation of those events, Javascript code can use the DOM to access and alter the elements on the page. People quickly realized that this capability could do a lot more than just enhance a static web page -- it could be used to build robust and sophisticated browser-based applications that do significant internal processing in their own right.

As a consequence, the facilities and techniques for building such applications have improved dramatically over the past 20 years. What started simply as a way to create an interface to an application running in a web server has in many cases become the application itself. The fact that we can now use a standard web browser to, say, model and emulate a computer system like the 205, let alone implement something as polished as Google Docs, is impressive, if not downright astonishing.

Browser-based applications still retain their event-driven nature, however. They are generally not structured the way that most off-line applications are. All processing for the set of windows making up an application (the emulator home page and all its sub-windows in our case) is strictly single-threaded. Only one thing runs at a time in that application -- there is no multi-threading or parallel execution. This is an intentional feature of Javascript, to simplify the programming model and avoid the need to coordinate multiple parallel threads of execution.

Thus, most browser applications are structured as a set of functions that are called one at a time from the browser's Javascript run-time controller. There is generally one global block of code that runs when the main page of an application first loads. That code links functions to handle events that will occur when the user interacts with the elements of the page. The code can also schedule functions to execute after a specified period of time. The global code then exits back to the Javascript run-time, leaving the functions it has linked to run later as events occur.

To maintain its single-threaded behavior, the Javascript run-time maintains a queue both of events that have occurred but not yet been serviced and of scheduled activities that have not yet come due. Each of these events and scheduled activities has a handler function associated with it, and as it rises to the top of the queue, the run-time calls that function.. These handler functions will do appropriate processing for their events, and may of course link or schedule still other functions for future execution.

Once a handler function exits back to the run-time, the run-time then figures out what to do next. That may be to pluck another event out of the queue or do some of the browser's background processing. It may be there's nothing to do at the moment, in which case the run-time will wait for the next scheduled function to come due or for some external event to occur, such as a click in the user interface.

Another thing about Javascript functions in a browser is that you can't run them continuously. Because there is only one thread of execution, a function that runs continuously will block other functions from running, which would prevent any user events from being serviced or any scheduled activities from taking place. To prevent that, most browsers will allow a function called from the Javascript run-time to execute for 15-30 seconds, and if the function hasn't given up control (i.e., exited back into the run-time) in that amount of time, the run-time will assume that the function is caught in an endless loop. The browser will pop up an alert and ask the user whether they want to continue executing, abort application, or bring up the debugger. Thus, a web application needs to be designed so that it returns to the run-time controller on a regular basis in order to avoid the run-time declaring it scriptum non gratum.

You can think of the emulator as a program that runs in loop, iteratively simulating the execution of 205 instructions. Internally, though, the emulator is not structured around a loop. Instead, it's constructed as a set of functions that are called from the run-time, execute for a bit to accomplish some piece of work, perhaps schedule themselves or another function to run later, and then finally exit back into the run-time. The iterative behavior comes mostly from the run-time repetitively servicing its event queue, and from the handler functions continuing to schedule other functions that will run at some point in the future.

Regulating the Speed of Emulation

Throttling the performance of an application in a web browser is a bit tricky. The way that you do that is to introduce delays into a script's execution. The way that you introduce a delay is by scheduling some function to run after the desired delay, and then exiting back into the run-time. The application will then sleep until the next item in the run-time's queue is selected.

Functions are scheduled to run in the future by calling the Javascript setTimeout() intrinsic function, passing it a number of milliseconds to delay and the function to call after that delay has elapsed. The actual delay will be at least the amount of time you specify, because the run-time only runs one thing at a time, and it may be that some other function or background task is already running when the delay expires. The scheduled function must wait for any currently-executing and queued activities to be completed first. Thus, the run-time will call the function you passed to setTimeout() after the specified delay and after nothing queued ahead of it is still pending.

Therefore, the trick to throttling emulator performance down to an appropriate speed is to break up the design into functions that can be scheduled after appropriate delays, and to keep rescheduling those functions in order to keep the emulation running.

Fortunately in the case of the 205, we have delays happening all the time due to the relatively slow drum memory, so there are plenty of opportunities to give up control and exit back into the run time. This also nicely prevents the emulator from hogging the single Javascript thread and having the run-time declare it non gratum.

Significance of the Memory Drum to Instruction Timing

To understand how drum timing interacts with performance throttling, let us first review the drum memory and its timing:

The drum rotates at 3570 RPM. It contains 24 tracks of 200 words each. An 11-digit word on the drum is the smallest unit of data that can be addressed, so it is also referred to as a sector.
At 3570 RPM, the drum completes a revolution in 16.8 milliseconds. Since there are 200 words on a track, a word passes under the read/write head once every 84 microseconds. This unit is referred to as a word-time.
20 of the tracks comprise the main memory, with word/sector addresses 0 to 3999. There is a combined read/write head for each track.
The other four tracks comprise the high-speed loops, or just the loops, and effectively store 20 words each. These tracks are addressed differently than main memory, however -- each loop is addressed modulo 20 within a 1000-word address range. Thus, the tracks are commonly known as the 4000 loop, the 5000 loop, the 6000 loop, and the 7000 loop. Addresses 6018, 6278, and 6538 are all in the 6000 loop, but all of them identify the same sector on the drum because all those addresses are congruent modulo 20.
Instead of having a combined read/write head, the four loop tracks have separate read and write heads, placed 36 degrees apart, so that words written by the write head can be read by the read head 0.1 revolution (1.68ms) later. Normally the words read by the read head are fed back (“looped”) immediately to the write head, with the result that the words are duplicated 10 times around the drum. Hence, while each high-speed loop can store only one-tenth the words of a main-memory track, the drum can access those words 10 times faster than main memory.
When the processor requests a memory access, it must wait for the drum to rotate until the desired word is under the head. This delay is termed the latency. On average, a main memory request must wait for one-half a revolution, or 8.4ms. Accesses to the loops, however, must wait on average only one-half of 0.1 revolution, or 0.84ms. Well-written 205 programs tried as much as possible to execute instructions from one of the loops, and to keep frequently-referenced variables in a loop as well.

The timing of instruction execution for the 205 is complicated because of the drum. Each instruction requires at least one memory access -- to fetch the instruction for execution. For instructions that use an operand, a second memory access is needed to read or write the operand. For all but a few instructions, the drum latency dominates the overall execution time -- most instructions will finish before the drum rotates to the location of the next instruction to be executed, even when running from one of the high-speed loops.

So the big question when attempting to determine instruction timing and throttling delays is, "where's the drum at this point in time?"

Design of the Performance Throttling Mechanism

When designing the emulator in late 2014, I decided to implement the drum internally as arrays of words in memory. Then I looked at the drum timing and decided that the latency delays were generally so large, I could combine the delays due to drum latency with the delays needed to throttle performance, and do them together within the memory access mechanism. The way I chose to do this is by maintaining two clocks:

The first clock maintains internal, emulated processor time. It counts word-times as a result of instruction fetch and execution. We have reasonably complete information on how the instructions were mechanized inside the processor, so the emulator keeps track of the word-times each instruction requires as it executes. Under the assumption that the emulator will run much faster internally than a real 205 would, this clock should run fast with respect to real-world time.
The second clock is the drum -- or the drum is the clock -- whichever way you want to think about it. This clock is real-world time. The drum rotates at a constant rate, so if we convert milliseconds of real time to word-times by dividing by 0.084 milliseconds/word-time and call that drum-time, then "(drum-time) mod (tracksize)" can be taken as the track-relative sector number under the head at that point in real time. Tracksize will be 200 or 20, depending upon whether the emulator is accessing main memory or one of the loops.

Knowing the track-relative sector where the head is now, and knowing the track-relative sector of the address being sought, we can compute the drum latency for an individual memory access. We know the number of words we are going to read or write (either 1 for normal accesses or 20 for block transfers), so we also know the track-relative sector where the memory access will finish, and therefore how many word-times (latency + word count) should be added to the internal processor clock to account for the memory access.

Since we hope the emulation is running faster than real time, we can compute the total real-world time delay for a memory access as:

delay = (internal time) - (drum-time) + (latency) + (word-count)

where "(internal-time) - (drum-time)" is the difference between emulated time and real-world time -- the amount that the emulation is running too fast. We need delay by that amount to allow real-world time to catch up. "(latency) + (word count)" is the amount of real-world time the memory access takes.

Thus, the emulator operates by running unthrottled between memory accesses, then sleeping for a time that combines the delays for drum rotation and performance throttling. It provides a call-back function to the Javascript run-time to continue operation after the delay completes.

You can see how this works in the readMemory() and startMemoryTiming() routines of the emulator’s Processor object. First, readMemory() determines the type of memory access that will take place:

D205Processor.prototype.readMemory = function readMemory(successor) {
    /* Initiates a read of the memory drum at the BCD address 
    specified by C3-C6. The memory word is placed in the D register, 
    and an appropriate delay for drum latency, the successor function 
    is called. Sets the memory access toggles as necessary.
    Note: the D register SHOULD be loaded after the latency delay, 
    but we do it here so that the word will show on the panels during 
    the drum latency */
    var addr;               // binary target address, mod 8000
    var cbCategory;         // setCallback delay-averaging category
    var trackSize;          // words/track in target band of drum

    addr = D205Processor.bcdBinary(this.CADDR % 0x8000);
    this.memACCESS = 1;
    if (addr < 4000) {
        trackSize = D205Processor.trackSize;
        cbCategory = "MEMM";
        this.memMAIN = this.memLM = 1;
        this.D = this.MM[addr];
    } else {
        trackSize = D205Processor.loopSize;
        cbCategory = "MEML";
        if (addr < 5000) {
            this.memL4 = 1;
            this.D = this.L4[addr%trackSize];
        } else if (addr < 6000) {
            this.memL5 = 1;
            this.D = this.L5[addr%trackSize];
        } else if (addr < 7000) {
            this.memL6 = 1;
            this.D = this.L6[addr%trackSize];
        } else {
            this.memL7 = 1;
            this.D = this.L7[addr%trackSize];
        }
    }

    this.startMemoryTiming(addr, trackSize, 1, cbCategory, successor, 
                           this.readMemoryFinish);
};

The property this.CADDR is the address field of the C register, which specifies the memory address to be accessed. The this.memXXX properties represent the memory control toggles shown on the right side of the Supervisory Panel. These are set only for display purposes. cbCategory is a short string that identifies the category of delay we require and is used with the setCallback() function discussed below. The successor parameter is a function that is to be called once the memory access finishes -- it's often the part of the Fetch mechanism that loads an instruction into C or the part of the Execute mechanism that actually executes an instruction after any operand word has been loaded from memory.

Based on the value of the address, the routine selects either the array of words for main memory (this.MM) or the array for one of the loops (this.L4, this.L5, etc.) and sets the appropriate memory control toggles so that they will display on the Supervisory Panel. It also chooses the appropriate track size -- D205Processor.trackSize=200 or D205Processor.loopSize=20 -- and loads the desired word into the Processor's D register. "%" is the Javascript modulo (remainder division) operator.

The this.readMemoryFinish parameter is a small function that will be called after the delay for memory access completes. It does some housekeeping and calls the actual successor function.

The startMemoryTiming() function is common to all of the memory access routines. It handles the chore of computing the necessary latency and delay, updating the internal processor clock, calling this.updateLampGlow() to compute average lamp intensities, and finally initiating the delay by calling setCallback():

D205Processor.prototype.startMemoryTiming = 
        function startMemoryTiming(addr, trackSize, words, 
                   cbCategory, successor, finish, finishParam) {
    /* Starts the necessary timers for the memory toggles to 
    aid in their display on the panels. Then delays the amount 
    of time required for drum latency, data transfer, and amount 
    necessary to bring the processor internal time back in sync 
    with real-world time */
    var latency = (addr%trackSize - 
                  (this.procTime + 
                      D205Processor.memSetupTime)%trackSize +
                  trackSize)%trackSize;

    this.memACTION = 1;
    this.updateLampGlow(this.procTime);
    this.memoryStartTime = this.procTime;

    this.procTime += latency + words + D205Processor.memSetupTime;
                     // emulated time at end of drum access
    this.successor = successor;
    this.scheduler = setCallback(cbCategory, this,
            this.procTime/D205Processor.wordsPerMilli - 
                performance.now(),
            finish, finishParam);
};

addr%trackSize is the track-relative sector number of the desired memory address. this.procTime is the internal processor clock, maintained in units of word-times. Taking the value of the clock modulo trackSize gives the track-relative sector number where the emulated processor thinks the drum is now. B205Processor.memSetupTime is the number of word-times necessary to set up and initiate the memory access internally, so adding that to the internal clock indicates where the drum will be when the processor actually begins looking for the desired sector.

Taking the difference in sector numbers between where the drum is now and the desired address gives us the latency. Depending on the relative positions of the two sectors, however, that difference could be negative, so I add the value of trackSize to make sure the difference is positive, then mod that sum by trackSize in case it was positive to begin with, thus keeping the result between 0 and trackSize-1.

this.updateLampGlow() updates the calculation of lamp intensities for the display panels so that they exhibit a somewhat realistic partial brightness. Based on current register settings, it computes a time-weighted exponential running average of the 0/1 state for each bit in each displayed register. That calculation will be completed by this.readMemoryFinish at the end of the memory access.

The routine updates this.procTime with the latency, memory access setup time, and word count, which gives us the internal clock time (in word-times) when the access should finish. It also stashes a reference to the successor function in a global property (this.successor) for use by the finish function (which is this.readMemoryFinish when called by readMemory) after the delay is complete. Finally it calls setCallback(), discussed in more detail below, which will call setTimeout() to initiate the necessary delay.

That delay should be the difference between the value of the internal clock when the emulated memory access starts, and the value of real-world time when the physical memory access finishes. Since the emulator models the drum internally as a set of memory arrays, the physical time to load the desired word into D is negligible, and in fact was already done in readMemory(). Thus the delay is simply the difference between the internal clock and real-world time. Dividing by the value D205Processor.wordsPerMilli converts word-times to milliseconds, the unit required by setCallback() and setTimeout().

Javascript has a Date object from which you can get real-world time, but instantiating that object every time you need to have the current time results in a lot of overhead. There is a special timer designed for performance measurement, though, performance.now(), that's both efficient and more granular than the Date object, so I use that for all timing within the emulator. It returns a floating-point number in milliseconds.

Managing the Javascript Delay Mechanism

I have gone through this rather long-winded explanation of memory access and timing partly to give you an idea of what's involved in regulating performance in the emulator, but also to talk about the role of setCallback().

The setCallback() function is something I originally developed for the retro-B5500 emulator. Its main purpose is to compensate for two nasty characteristics of the standard Javascript setTimeout() function:

As discussed above, the delay you get using setTimeout() is always at least as long as the one you requested, and may be much longer, depending what else is happening in the browser at the time.
The minimum amount of time you can delay in modern web browsers is about 4ms. If you specify a smaller delay than that, it is treated as if you had specified 4ms.

The reason for the 4ms limit is that there are a number of older, poorly-behaving web scripts out there (many of them games) that are designed to run as fast as they can with no delays at all if they can get away with it. These scripts still need to exit back to the Javascript run-time periodically, however, and some years ago you could specify a delay of zero for setTimeout() and actually give up control for a minimal amount of time. In operating systems, this brief release of control is termed a yield.

The problem with this yield technique is that it tends to suck all the juice out of laptop and mobile device batteries in short order. So the browser developers instituted minimum delays (originally on the order of 10ms) to keep users from blaming the browsers for their dead batteries. The web standards now require the minimum delay a browser must support to be no greater than 4ms, and as a result, all the major browsers now use 4ms as a minimum.

That standard 4ms minimum on delay time is not an issue for timing access to main memory, which averages 8.4ms. For loop access, though, which averages 0.84ms, the 4ms limit is a serious problem. The emulator often needs to delay for fractions of a millisecond, but there is no way setTimeout() will do that.

I had a similar problem, although for a different reason, when working on the B5500 emulator, so in 2013 went looking for a solution. It turns out that there is no finer-grained substitute for setTimeout(), but there are a couple of ways that you can give up control and have the run-time call a function you provide as soon as there is nothing else ahead of you in the queue, which is exactly what people were trying to do by passing a zero delay to setTimeout().

Browsers have a mechanism that can be used to pass messages asynchronously between windows, implemented using the postMessage() function for sending and the onMessage event for receiving. It also turns out that a window can send a message to itself, which is something I would never have thought of doing, but this is why we have Google Search -- to discover things we can't think up ourselves. Passing a message to yourself effectively produces a yield, and the event handler you supply for onMessage is simply the call-back function for that yield.

So having the setTimeout() mechanism that will delay for at least 4ms, and the postMessage/onMessage one that will delay for a hopefully very small, but unpredictable amount -- how do we generate a delay of 0.84ms?

The answer is that in individual cases we don't. Instead, we play the averages.

My setCallback() function has the same overall purpose as setTimeout() -- you call it passing a requested delay in milliseconds and a function to be called after that delay expires. If the requested delay is more than 4ms, setCallback() calls setTimeout(), which will always result in a delay that's at least a little longer than we asked for. If the delay is less than 4ms, setCallback() will call postMessage(), which should usually result in a delay that is shorter than we asked for. setCallback() passes its own call-back function to both of these mechanisms as a proxy, and when either the delay expires or the message event occurs, that proxy function measures the amount of time that actually elapsed. The proxy then executes the original call-back function passed to setCallback().

Subtracting actual elapsed time from the delay originally requested gives a deviation -- how too much ahead or behind schedule the delay completed. setCallback() keeps a running total of those positive and negative deviations, and on the next call to it, attempts to apply a portion of any accumulated deviation to the new requested delay. If the accumulated deviation is positive (actual delays are running longer than requested), it attempts to reduce the new requested delay. If the accumulated deviation is negative (actual delays are running shorter than requested), it attempts to increase the new requested delay. Then that adjusted delay is compared to 4ms and used to decide whether to call setTimeout() or postMessage(). In this way, the mechanism compensates for deviations in the individual delays, and the deviations tend to average out. Or that's the hope, anyway.

There is one more twist to this mechanism. I have been discussing delays in the context of timing memory access, but the emulator has other areas where it needs to introduce a delay, primarily in I/O. Because the different areas where the throttling mechanism is used have different delay lengths and patterns, I didn't want to lump them all together for the purpose of adjusting deviations. Thus the delays are divided into categories, and a separate cumulative deviation is maintained for each one. The cbCategory variable mentioned back in the discussion of readMemory() is a short string that identifies the category on a given setCallback() call. "MEMM" refers to the category for main memory accesses and "MEML" refers to the category for loop memory accesses. Each I/O device has its own category as well.

With this background on emulator timing and performance throttling, we can now discuss the problems the retro-205 emulator was having with its timing and what we did about it.

Analyzing and Correcting the Timing

When Tom pointed out that his programs were running faster than expected, there were three areas I thought could be prime suspects for the timing disparity. One was the way the emulator was accumulating internal word-time counts for instructions as they were executed. Another was the accuracy of the delays being computed for memory access and performance throttling. In the back of my mind, I was also wondering if the setCallback() mechanism, and especially the deviation adjustment portion of that mechanism, was working the way it needed to. It turned out all three were contributing to the problem.

I completely reviewed the way the emulator was accumulating word-times for each instruction, comparing that to some timing notes that Prof. Donald Knuth had kindly sent me from his files a couple of years ago. I found lots of discrepancies, so adjusted the word-times for many instructions, often upward by three or more. This slows the emulation of those individual instructions.

Next, the drum latency calculations appeared to be correct, but I realized in researching the timing of memory accesses in TM 4001, the 205 computer training manual, that I had not been accounting for the necessary time the processor needs to set up its internal logic for the access. This amounts to 2-3 word-times per access, and after some experiments I settled on the value 2. This is reflected in the B205Processor.memSetupTime field mentioned above, and also slows the speed of emulation somewhat.

Finally, I turned to the setCallback() mechanism. You might think that an algorithm for adjusting requested delays based on deviations accumulated from prior delays would be straightforward, but it hasn't been for me. I've completely rewritten that adjustment algorithm at least four or five times over the years, and still had my doubts about it.

I was finding it difficult to analyze the algorithm's behavior from inspection of the code alone, so I modeled it in a spreadsheet, and found that the adjustment mechanism was not as stable and well-behaved as I had thought. In fact, it was embarrassingly easy to come up with patterns of delay that would cause the adjustment mechanism to go completely unstable. So I sat down and really thought through the problem this time, came up with a completely new approach to calculating adjustments for requested delays, and modeled that until I had it working smoothly. This change did not actually correct the emulator timing per se, but it did eliminate the runaway instabilities and make the emulator timing more consistent over longer periods.

After making these changes, Tom and I independently reran the programs we had analyzed earlier to obtain theoretical run times. We found that the emulator was now generally operating within 2-3% of those theoretical times. For an emulator that operates internally at the instruction level rather than the clock level, I am quite gratified that we got that close. Flushed with this success, we declared victory over the problem of emulator timing accuracy.

Then Tom reminded me that there was this other program...

The Remaining Timing Problem

In the back of Prof. Knuth's copy of the 1956 Burroughs 205 Central Computer Handbook he donated to the Computer History Museum, there is a strip of paper that has been pasted onto the page. The strip has four lines that appear to have been printed by an IBM 407 tabulator, which read:

777 FAST DRUM ZERO. USES 4000 LOOP. ZEROES 6000-6019 AND 0000-3999 IN 1.1 SEC.
666608104D4024000002040260000064403500000224017100000260100000072402740810444016
6666081044403010000B601001000026006010000260020000007240350000012403500000754034
66660000204016000000039000000000010000810441000000002240241000026018010000260140

This is a 205 machine-language program in card-load format that, as the first line indicates, clears the 205 main memory in 1.1 seconds. It turns out that Prof. Knuth himself wrote this program ca. 1960.

I had discovered this piece of code early in my research into the 205 before beginning work on the emulator. Largely to better understand how the 205 worked, I manually disassembled the code and did a detailed analysis of it. Since this was a program for which we had a reported execution time, Tom suggested we should see how long it actually takes to run in the emulator with its corrected timing.

The program runs so quickly that it's difficult to time it manually, but the number I came up with using the Firefox browser was 3.2 seconds, almost three times longer than its reported run time. So much for the emulator timing being accurate to within 2-3%. That surprisingly slow performance launched an effort that eventually took me more time and thought than the original timing analysis and corrections had.

The first question to resolve was whether the reported 1.1 second run time was realistic. As a basis for comparison, I constructed a test that cleared the 4000 words of 205 main memory in a straightforward way -- by storing individual words in a loop. The following code illustrates the technique:

0050  0 0000 02 3999      STC  3999    clear A
0051  0 0000 72 0050      SB   0050    set B to 3999
0052  0 0000 30 0053      CUB  0052    block and execute in 7000 loop
0053  1 0000 02 0000      STC  0000    zero word a B+0
0054  0 0000 22 7053      DB   7053    decrement B and loop if B >= 0
0055  0 9999 08 9999      STOP 9999    halt

This code takes about 69 seconds in the current retro-205 emulator under Firefox. That is because accessing successive locations in main memory requires a full drum rotation of 16.8ms between accesses. Just multiplying 4000 times 0.0168 is 67.2 seconds, so this test generates a reasonable result. Reducing the time of this straightforward approach by a factor of more than 60 would be quite an accomplishment.

Prof. Knuth’s program works by taking advantage of the block-transfer instructions in the 205. These instructions transfer 20 words to or from main memory and one of the loops in one drum revolution plus the latency necessary to reach the starting address in main memory. The program initializes the 6000 loop to zeroes, then does five successive BF6 (Block From 6000 Loop to Main Memory) instructions to clear 100 words, iterating and decrementing the B register to step through all 4000 words in 40 increments. The really clever thing the program does is to modify the successive BF6 addresses by 40 instead of 20, allowing the fetch and setup of the next BF6 to occur during the intervening rotation of the drum, a technique known as interleaving.

To confirm the reported time, I took the disassembled code for the program, and once again referring to Prof. Knuth's timing notes, attempted to determine not only the amount of time each instruction should be expected to take, but also how the drum position changed with each instruction, and therefore the amount of drum latency and transfer time to expect for each memory access the program did. That analysis showed that one iteration clearing 100 words should take a total of 298 word-times, or 25.04ms. 40 such iterations should therefore take about 1002ms. Adding the time to initialize the program and then the 6000 loop to zero, the total time should be very close to 1.1 seconds. Thus, I had to accept that the reported time for the program was correct.

The next question was then, why did the emulator take three times longer than that? I added a temporary trace to the emulator to record real-world times taken by the individual instructions as they executed in that program. This showed that the revised setCallback() mechanism was doing a superb job of regulating the adjusted delays, but every couple of hundred milliseconds there is a large deviation averaging about 100ms. I suspect that this was caused by Firefox inserting its background tasks into the run-time queue and taking precedence over the emulator's execution. To compensate for this, the setCallback() mechanism was adjusting delays for emulated instructions downward, so that they were executing in about half their theoretical execution time. That is probably about as fast as the emulator could go. Performance under Google Chrome is better than under Firefox, but only by about 10%.

I finally concluded that this was a case where the emulator simply could not adjust the delays low enough, and consequently could not keep up with real-world time. The design of the performance throttling mechanism was predicated not only on introducing throttling delays during drum latency, but also on using those delays to absorb the overhead of browser background activities. Alas, Prof. Knuth's coding of the program so efficiently minimized drum latency that there isn't enough latency left to absorb all of that overhead. Thus the emulation of the program ran too slowly.

The final question was what to do about this situation. One possibility would be to use an approach to throttling performance similar to the one I use with the retro-B5500 and retro-220 emulators. Those systems had core memories, with memory access latencies less than one-thousandth that of the 205. Throttling in those emulators works by allowing the processor to run unthrottled for a time slice equivalent to several milliseconds of emulated time, then comparing the internal clock to real-world time and pausing execution for the difference to obtain the appropriate average performance.

This approach is eminently feasible for retro-205, and would probably require at most several days of work to implement. If employed in the 205 emulator, it would serve to aggregate multiple drum latencies and per-access throttling delays into one larger delay. That would result in considerably lower overhead within the emulator for memory accesses, and may free up sufficient time for the browser overhead functions to execute without impacting the overall accuracy of emulator performance. Then again, it may not.

In the end, Tom and I decided to do nothing, at least for now. The current throttling mechanism appears to provide quite accurate timing for all of the other processor-intensive programs we have tested, and was defeated only by one very small, very carefully-crafted program that took maximum advantage of drum latencies overlapping instruction execution to minimize total run time.

Besides, clearing memory is something you normally do only once before loading a program for execution. Having to wait an addition two seconds for that is not exactly a crisis.

Wednesday, August 23, 2017

BAMCO Lives!

When W. S. Burroughs and his three moneyed associates formed a company in St. Louis in 1886 to market Burroughs' invention, they named it the American Arithmometer Company. This turned out to be a poor choice of name since few people could successfully spell (or even pronounce) "Arithmometer."

In 1905, after relocating to Detroit, the company was renamed as Burroughs Adding Machine Company a name that lasted until 1953 when the name became The Burroughs Corporation. Burroughs Adding Machine Company was easily and frequently abbreviated to BAMCO internally. So imagine my surprise upon seeing this T-shirt on a server at McDonalds in southern Minnesota.

Apparently the name is now used by BAMCO, Inc., the local McDonalds franchisee in Mankato. I don't know if they have battled for the legal rights with other folks like the Bon Appetite Management Company.

Similar name confusion exists with the name Datatron which now is usually associated with a game in the form of a card-based game, Morphtronic Datatron.

Likewise, ElectroData searches will usually lead to the Electrodata company a manufacturer of communications testing equipment.

Tuesday, August 1, 2017

Research Is Hard Work - and it Takes a Lot of Luck

When I put together my Datatron Maintenance Page on the Burroughs 205/Datatron site, I mentioned Bill Oller, the Pacific Power & Light maintenance engineer. Bill gave me my first "computer job" in the fall of 1963 - testing a couple of hundred tubes that he had ordered for his preventive maintenance routine on the University of Portland 205.

I wished I had a picture of Bill to include on that page and spent a good deal of time trying to locate Bill or his family all to no avail. (It turns out that Bill passed away in 2002 in the Coeur dAlene Idaho area). I wasn't even sure of the correct spelling of his last name. But as luck would have it, today I came across a picture of Bill and the PP&L tape subsystem.

I was browsing The Oregonian website for some family history information. The Oregonian does not participate in the major newspaper archive sites that are so easily searched and show up in response to Google searches; they have their own archive with a rather clumsy interface and sell access by the day for low volume users such as I. After finding my relative's obituary, I did some computer history searches looking for the words "computer" and "bank." This turned up some useful history of a pioneering U.S. Bank project to automate account posting.

From January of 1957 in The Oregonian:

Announcement that the U. S. National was making use of the electronic computer on an experimental basis in connection with the operations of its Sheridan branch was made in The Oregonian last November. At that time it was believed to he the first bank in the world to start posting its checking accounts with an electronic brain.

John N. Raleigh of the bank's "methods" department had developed a branch posting system on an IBM 650 - he called it SONIA. (System of Numeric Integrated Accounting, likely inspired by BofA's ERMA acronym) SONIA was good for the bank and very good for Raleigh - he was later to become president of the San Fernando Valley Bank.

As I later read a 1960 article that also mentioned the bank system, my eye was drawn to two accompanying photographs. My eye naturally went first to the five young ladies in bathing suits who were marveling at the the IBM RAMAC disk drive. Have you noticed that the 1950s computers always featured an attractive damsel at the console? Jantzen, the Portland swimsuit manufacturer, went them one - no, five better! Jantzen's was most likely the most popular booth at the Business Machine Show.

But there, just below the ladies and the RAMAC, stands Bill Oller with his hands on a Datatron pluggable unit in the Tape Controller. I now have my picture for the maintenance page.

The Jantzen picture will doubtless reappear on a disc storage page on the site in due time.

Saturday, April 1, 2017

Emulator Version 1.00 and the DataFile

Development of the retro-205 emulator for the ElectroData/Burroughs Datatron 205 computer system is now functionally complete. That is not to say that it is finished. As a friend used to say, programs are never finished -- they merely become obsolete.

This is Paul Kimpel with another guest post on Tom's 205 and 220 blog. It is time to drag the emulator out of its 0.xx development era, so I am pleased to announce that the version just released has been labeled 1.00. Rather than have one monster post that describes all of the changes in this release, this post will discuss the implementation of the final piece of the 205 hardware, a magnetic tape device known as the DataFile. It will also describe a few bugs and enhancements that have been addressed. A second post will discuss the other major improvement in this release -- emulator timing.

The DataFile

The Datatron 205 supported a relatively typical, reel-to-reel magnetic tape drive, known as the Model 544 DataReader. To understand the DataFile, let us start with a description of the DataReader.

Overview of the DataReader

The DataReader was connected to the 205 processor through the Model 543 Tape Control Unit. It used 2500-foot reels of 3/4-inch magnetic tape. The tape was recorded in two parallel lanes of six tracks each. Four tracks were used to record BCD numeric data, the fifth track was even parity, and the sixth track signaled the start of a block. Blocks were a fixed 20 words each, matching the size of the high-speed loops on the 205's memory drum. The tape moved at 60 inches/second, with data recorded at 100 bits/inch. Total length of a block, including inter-block gap and a block header, was 2.78 inches. Total capacity of a reel of tape was 10,000 blocks per lane.

Blocks in a lane were numbered from 0 through 9999. Individual blocks could be read and written, and since they were of fixed length, and even reliably overwritten -- something that could not be done with the IBM recording format that later became the de facto standard. The 205 processor had a tape search instruction that specified the lane and block number sought. Once initiated, the Tape Control Unit conducted the search either forwards or backwards, asynchronously and independently of the processor.

This was in the pre-disk era of computing, and much data processing was done using sequential file techniques. The asynchronous search, dual lane, and overwrite features gave the 205 some interesting advantages in applications that had to maintain large data sets on magnetic tape. A number of clever indexing and lane-switching techniques were developed to take advantage of them, some of which are described in Bulletin 3024, The Magnetic Tape System Handbook.

Even with these advantages, tape operations were still very slow. Average time for the DataReader to locate a random block on a full reel of tape was almost four minutes. Disk technology was on the horizon, but very expensive -- IBM was to introduce its 305 RAMAC system in September 1956, shortly after Burroughs acquired ElectroData. The RAMAC disk boasted an access time of 600ms, but a total capacity of only five million digits, hardly more than the 4.4 million digits on a single reel of 205 tape.

The DataFile Compromise

Into this situation, ca. 1954, stepped Duncan MacDonald, one of the lead designers at ElectroData. Realizing that the search would be faster on a shorter tape, and that data capacity was proportional to the total length of tape, he came up with a mechanism that used shorter tapes, but lots of them. The result was the Model 560 DataFile, for which MacDonald was awarded U.S. Patent 2,914,752 in 1959.

The DataFile was housed in a long, waist-high cabinet with a hinged cover, leading to the enclosure being nicknamed "the coffin." This image from Tom Sawyer's web site (and originally from the Charles Babbage Institute) shows the unit with its cover open:

Model 560 DataFile Unit

Inside the enclosure were a set of horizontal rods, over which were draped 50 tapes. Underneath the rod assembly was a moving carriage with pinch rollers and two tape heads. The carriage could traverse under the tapes and stop at any one of them, latching into position and allowing the pinch rollers to engage with a rotating capstan to drive that tape forward or backward. The time to traverse and latch into position varied linearly from about 0.5 to 2.0 seconds, depending on the distance traversed. Thus the carriage could relatively quickly select any one of the 50 tapes and read or write it.

Below the rods the cabinet was partitioned into 50 parallel bins, each wide enough for a tape, with perforated walls separating the tapes. The tapes fell loosely into these bins -- they were not wound on reels or hubs. The perforated bin walls allowed air to enter and exit the bins as the carriage drove a tape back and forth, causing the tape strip to rise or fall within the bin.

The tapes, tape heads, and recording format used with the DataFile were exactly the same as the reel-to-reel DataReader, except that the individual tapes were 250 feet long instead of 2500 feet. Each tape still had two lanes, but only 1000 blocks per lane. Tape speed, recording density, block size, and the overwrite and asynchronous search features were all the same as for the DataReader.

The DataFile used the same Tape Control Unit, which could support any combination of up to ten DataReaders and DataFiles per 205 system. One DataFile could store 50×2×1000 = 100,000 blocks or 22,000,000 digits, and a fully-configured system up to 220,000,000 digits. This was mass storage, ca. 1955.

Unlike the DataReader, the individual tapes in the DataFile were not removable, except for servicing and replacement when they became worn. Thus, the DataFile acted much more like a disk unit than a tape drive, albeit a disk unit with a rather severe latency problem. Burroughs quoted the average time to access a random block within a tape as 15.3 seconds, and an average of 16.3 seconds if the carriage had to move. This was a lot better than four minutes for the DataReader, but a lot worse than 600ms for the RAMAC, once it came along. The DataFile was thus a compromise design, filling a niche in the time between tape- and disk-based eras of data processing.

As you might expect of a device with a heavy carriage that (a) moved quickly when traversing, and (b) was surrounded by flimsy tapes, the DataFile could be high-maintenance, to the point where its less-charitable nickname was "the DataFail." In addition to its mechanical complexity, the tapes were subject to wear, and if left motionless in their bins too long, could settle and crease, causing data errors. Standard operating procedure called for all tapes to be moved from end-to-end at least weekly to avoid this, a process that could be done off-line from the unit's control panel.

Programming for the DataFile was similar to that for a DataReader. The only significant differences were:

A DataFile unit had 100 lanes instead of two.
Individual lanes held one-tenth the number of blocks compared to the DataReader.
The DataFile did not respond to tape rewind commands. To position a tape to its beginning, you did a tape search for block 0000 on an appropriate lane.

The tape instruction formats for the DataFile were the same as for the DataReader, with one exception in the way lanes were selected by a Magnetic Tape Search (MTS, 42) operation. The format of that instruction was:

s hhup 42 aaaa

where s is the sign digit, hh specifies the lane number, u is the tape drive unit number, p is the standard breakpoint digit, and aaaa is the address field, which for this instruction specified the block number to which the tape should be moved. If the low-order bit of the sign digit was set, normal B-register modification would occur, and the effective block number would be the sum of the contents of the B register and the address field.

When this instruction was used with a DataReader, only the low-order bit of the hh field was significant, as there were only two lanes to select between. When used with the DataFile, the full two-digit lane number (00-99) was significant.

The retro-205 DataFile Implementation

Even though the recording and instruction formats for the DataFile were the same as the DataReader, implementing the DataFile in the emulator posed some issues. The major one was persistence of the data. Since the tapes are not removable, their data cannot be stored outside the emulator in ordinary files, the way that tape image files used with the DataReader are. The emulator needed to store the tape data internally in a way that would survive across emulator sessions and workstation restarts.

The solution was to accept that the DataFile is more like a disk unit, and to apply the same approach I used to implement Head-per-Track disk units for the retro-B5500 emulator. Modern web browsers support a lightweight data base API known as IndexedDB. This is not a relational data base, but instead stores Javascript objects. The objects are stored in one or more structures termed object stores, which are similar in concept to tables in a relational data base. The objects within a store may be indexed by properties of the objects themselves, or by external values supplied by the application.

To support the DataFile, the retro-205 emulator uses a single IndexedDB data base, named "D205DataFileDB". This data base is created automatically the first time that a DataFile is added to the system configuration. Each "physical" DataFile unit, lettered A through J in the configuration, has a separate object store in this data base. That store is permanently associated with that physical device letter. If you reconfigure that device letter to be a DataReader, the store will continue to exist, but be hidden. If you later configure the device letter back to a DataFile, the data in the store will again be available. Note that the unit designation numbers (1 to 0 [for unit 10]), selected on the drive panels and used by the software to address tape units, can be changed at any time without affecting the object stores.

Within each object store for a DataFile unit, the data is stored as objects that are 20-word arrays. Each object corresponds to one block on one of the tape strips for the unit. These objects are indexed by a number computed as (lane number)×1000 + (block number), where the lane number ranges from 0-99 and the block number from 0-999 within a lane. While internally the block objects can be accessed quickly and randomly, the emulator executes the necessary delays to model the timing of the DataFile device.

Once the IndexedDB data base is created, it resides permanently on your workstation's local disk. The emulator does not provide a way to remove it, although most browsers allow you to inspect and remove these data bases through their "developer tools" interface. If you remove the data base for the DataFile, it will simply be recreated again the next time you run the emulator with a DataFile present in its configuration. Any former data for the data base will have been lost, of course.

Different browsers implement IndexedDB data bases in different ways. For example, Firefox uses SQLite as the underlying storage engine, while Google Chrome uses LevelDB, a derivative of its BigTable technology. Therefore, you will not be able to share the data base for a DataFile between browsers.

In addition, these data bases are subject to the browser's "same-origin" policy, meaning that resources associated with one web site generally cannot be accessed by resources from some other web site. The combination of scheme (http/https), host name, and port in a site's URL is used in determining same-origin access. For example, you can access the emulator's hosting site as http://www.phkimpel.us/ElectroData-205/ or as http://phkimpel.us/ElectroData-205/ -- both URLs refer to the same physical files on the same physical web server, but to your browser they are different origins, and consequently will use different IndexedDB instances.

Using the retro-205 DataFile

As with all peripheral devices configured in the emulator, each DataFile unit will display a separate window on your screen, similar to this:

retro-205 DataFile Device Window

The controls on this window are similar to those for the DataReader. The area below the controls shows a representation of the 50 tape bins. The brown bars show the relative position of the each tape, with a zero-length bar representing the tape in a rewound (block number 0000) position. Below the bins is a representation of the carriage and tape heads, which move in response to lane selection by tape search instructions. The currently-selected bin is highlighted in yellow.

When the emulator initializes, each DataFile in the configuration is opened with its tapes in random positions. The carriage is initially positioned at tape 25, lane 50.

The DESIGNATE pull-down list selects the unit number that will be recognized by magnetic tape instructions. The REMOTE/LOCAL and NOT WRITE/WRITE switches function the same as for a DataReader. The unit must be in Remote (ready) status in order to be addressed by software. Setting the NOT WRITE switch prevents writing to any of the tapes, but the tapes will still move in response to write instructions. A real DataFile had a set of individual write lock-out tabs for each tape, but this has not yet been implemented in the emulator.

The REWIND and STOP REWIND buttons are always active, regardless of the setting of the REMOTE/LOCAL switch. Clicking REWIND will initiate a rewind of all tape strips, in the manner it worked on a real DataFile. The carriage will move to the left-most tape (for lanes 0/1) and rewind it. Then it will move to the right-most tape (for lanes 98/99), rewinding each tape in reverse sequence until it again reaches the left-most tape. Clicking STOP REWIND will interrupt the process and return the carriage to the left-most tape.

A full rewind on a real DataFile would take an average of more than 20 minutes, depending on the positions of the tape strips. In implementing the rewind function in the emulator, reproducing that particular timing behavior of a real system did not seem necessary, so the emulator presently rewinds DataFile tapes at ten times the speed of a real unit -- averaging a little over two minutes to complete.

Applications for the DataFile

The DataFile is useful for storing large programs and any data that you want to persist across emulator sessions. The Shell Assembler was specifically designed to work with the DataFile, and Donald Knuth's Algol-58 compiler works well from it, too. One issue is that there is no built-in way to load data into a DataFile -- you must write a program to do that.

Here is a small card-load program that will load the Shell Assembler to lane 89 on a DataFile designated as unit 0 (i.e., 10) from lane 1 of a DataReader designated as unit 1:

https://github.com/pkimpel/retro-205/blob/master/software/Shell-Assembler/Shell-Utilities/Shell-Load-DataFile.card

The tape image used by this program can be downloaded from:

https://github.com/pkimpel/retro-205/blob/master/software/Shell-Assembler/Shell-Assembler-Load.tape

Once loaded to the DataFile, the assembler can be executed using this small card-load program:

https://github.com/pkimpel/retro-205/blob/master/software/Shell-Assembler/Shell-Assembler-Bootstrap.card

Bug Fixes and Minor Enhancements

The 205 floating-point implementation has a nice trick -- you can discard the fractional part of a number by adding the floating-point value 5800000000 (i.e., 0.0 × 10⁸) to it. Algebraically this is meaningless, but the way that floating-point addition was mechanized caused the augend value in the A register to be scaled (shifted right) during the alignment of exponents that took place before the addition. Any fractional digits in the value would shift off the low-order end of the number and be lost. The normalization (shifting left) that occurs after the addition would then put the remaining digits back in their original positions. Alas, this was not working in the emulator. The problem was that I was normalizing the addend value in the D register before scaling the A register. By reversing the order of scaling and normalization, that trick now works properly.

Tom Sawyer discovered that the tape drives were not properly transferring data when the operand address was not aligned on a mod-20 boundary. The problem had to do with the way that tape reads and writes are buffered in the 4000 and 5000 high-speed loops on the memory drum. The word at the mod-20 memory address was being mapped to the first word in the tape block, rather than the word at the operand address being first in the tape block. This has been corrected.

Tom also discovered that the way the emulator was interpreting the breakpoint digit in instruction words was not correct.

The low-order three bits of that digit are a mask. If any of the 1-bits in that mask match the setting of the breakpoint switch on the Control Console, the processor will halt with a BKPT alarm after the execution of that instruction.
The high-order bit in the breakpoint digit is used with the SKIP switch. If that switch is on and the high-order bit of the breakpoint digit is a 1, the processor will skip that instruction. It does so by remaining in the fetch state rather than switching to the execute state, allowing the next instruction to be fetched immediately.

The problem occurred when both the breakpoint and skip conditions were true for the same instruction. In this case, the instruction should not be executed and the processor should stop at the end of the fetch cycle for that instruction. Instead, it was doing this only if the low-order bit of the breakpoint digit was a 1, not if the breakpoint switch matched one of the 1-bits in the breakpoint mask. This has also been corrected.

One of the items in the material Prof. Knuth sent me last year was ElectroData Technical Newsletter #5, dated February 14, 1958. This document described how all 100 two-digit character codes mapped to glyphs on the Flexowriter and Cardatron output devices. Character translation in the emulator for these devices has been changed to match the information in that document.

Input translation from ASCII to internal 205 code for card devices has been altered slightly in order to improve support for Cardatron reload lockout. In the Cardatron, reload lockout could be imposed by a multi-punch combination in the column that selected the format band. A numeric punch (1-7) selected the format band as usual; an 8-punch signaled reload lockout. Trying to simulate this with ASCII character codes is a little awkward, but here is the current arrangement:

the grave accent (`), a 1-8 punch, now maps to the digit 1;
the colon (:), a 2-8 punch, now maps to the digit 2;
the "#" is a 3-8 punch;
the "@" is a 4-8 punch;
the characters "|", "}", "~" now arbitrarily map to the digits 5, 6, 7, representing 5-8, 6-8, and 7-8 punches, respectively ;
the double-quote ("), a 7-8 punch, now maps to the digit 7;
the apostrophe (') now maps to the "@" but is treated as a 5-8 punch for reload lockout.

On the Flexowriter, clicking the RESET button on the Typewriter Control Unit panel will now output a new-line sequence if the carriage is not at the left margin. Previously, there was no way to reset the carriage if the computer stopped in the middle of a line.

Please stay tuned for the second post on improvements to emulator instruction timing.

Tuesday, February 28, 2017

Gloria Bullock

This is Tom Sawyer again, noting a new page added to my Burroughs 205 history website. There are many interesting people and personalities associated with the Datatron and one that has long interested me is Gloria Bullock. Gloria shows up early in the organization - even before there was an ElectroData. She is found in the Mathematics section of the fledgling organization, but there are references to her being involved with training too.

When I came across a 1959 Los Angeles Times story noting an Urban League award presented to Bullock by baseball star Roy Campenella, I knew this was someone I wanted to learn more about. I was a big "Campy" fan. The California Eagle characterized the award as being "for her part in perfecting computing machinery."

Last summer, I finally came across an internal Burroughs publication from 1957 that designated Bullock as the first Datatron programmer. With the recent publicity surrounding the movie, "Hidden Figures," I thought it might be time to put together Gloria Bullock's chapter in the Datatron story. (The book is much better than the movie, in my opinion.) The result is this page on the 205 history website.

My one regret is that I have not been able to close out the story beyond about 1977 when Gloria was living in Palo Alto near her younger sister in Moss Beach. Perhaps, as is sometimes the case, a mention on the Internet might turn up someone who knows the answer.

Revisions to the Blog Format

This is Tom Sawyer, posting to the Datatron blog after a long absence.

This blog has been very difficult to read from the beginning but the problem became terribly apparent after Paul Kimpel's lengthy posts about the 205emulator. I finally bit the bullet and made revisions to the layout a few days ago and believe we now have a much more readable blog.

I will attempt to make more frequent contributions of a historic nature as I continue to add to my ElectroData / Burroughs history website and I am certain that Paul will be keeping us up to date as he moves forward with his next project, the Burroughs 220 emulator.

The Burroughs 205 and 220 Blog

Welcome to the Datatron 205 and 220 Blog