Welcome to the Datatron 205 and 220 Blog

This blog is a companion to T J Sawyer's Web Page that outlines the history of the Burroughs Datatron 205 and Paul Kimpel's incredible 205 and 220 emulators. Please visit those sites and return here to post any comments.

Sunday, March 11, 2018

retro-205 Emulator Version 1.01 Released

Most of the development work that has been going on behind this blog over the past year has been devoted to the Burroughs 220 and its retro-220 emulator. Nonetheless, I have continued to enhance and correct the retro-205 emulator for the ElectroData/Burroughs Datatron 205 computer system. We have now accumulated sufficient changes to this emulator to justify releasing an update, version 1.01.

This is Paul Kimpel with another guest post on Tom's 205 and 220 blog. Version 1.01 consists of bug fixes, internal enhancements, and a few new features that are externally visible. This post will describe those changes.

New Features

Power-Off Behavior

The thing that everyone who uses the retro-205 emulator needs to know about this release is that the way you "power off" and shut down the emulator has changed. Previously, a single click of the red OFF button on the Supervisory Panel would terminate the emulator and close all of its windows, leaving only the original browser window with the home page open on your screen.

Now, the OFF button requires a double-click to shut down the emulator. If you hover over the button with your cursor, a tool-tip will pop up to remind you. This behavior has been adopted from the retro-220 emulator, where the power-off and system-clear buttons are close together. I was frequently clicking power-off when I intended to click the clear button, so added the double-click as a safety measure. It seems reasonable to make this behavior consistent between the two emulators.

Memory Dump Generation

The emulator now supports a memory-dump capability. On the Supervisory Panel, below the D register, is a section labeled TEST SYSTEM. The controls in this section have been non-functional in the emulator, as they affected the 205 electronics at a level below which the emulator operates.

Supervisory Panel TEST SYSTEM section

Now, however, clicking the SINGLE PULSE button in the upper-right corner of this section will cause the emulator to open a sub-window and format the current processor state and memory contents to that window. You can generate this dump at any time, even while the emulator is running a program. It will show an instantaneous snapshot of the system state at that point in time.

SINGLE-PULSE Button Memory Dump

From this window, you can copy, save, or print the dump using the standard facilities in your browser for doing so. You can click the button multiple times to generate multiple dumps, each in its own window. When you are finished with a dump, simply close its window.

Apple Safari Pop-up Restrictions

This next item could be qualified as a bug fix, but it's really a new feature. The emulator had been operating quite well in the three major modern workstation browsers -- Google Chrome, Mozilla Firefox, and Apple Safari. I recently updated my MacBook Air to macOS 10.13 (High Sierra), and with that update came Safari 11.0.

Trying to run the retro-220 emulator in Safari after this update resulted in the emulator crashing after opening its first pop-up window. I quickly discovered that the retro-b5500 and retro-205 emulators suffered the same problem. I had Safari configured to allow pop-ups, so what gives?

Some research revealed that this new version of Safari places additional restrictions on pop-up windows. Specifically, it would always allow the first pop-up, but if subsequent pop-ups were opened before some amount of time had passed, Safari would inhibit the pop-up. Reports varied on the amount of time that had to elapse between opens, ranging upward from 500ms.

The emulator opens pop-ups for the Supervisory Panel, Control Console, and each of its peripheral devices. All of these elements are implemented as Javascript objects, and as part of their instantiation, each one opens, sizes, and positions its own pop-up window. Trying to insert appropriate delays in the initialization process that instantiates these objects was going to be difficult, so I decided instead to centralize the mechanism for opening pop-ups.

Now, each object calls a common "open pop-up" function in the D205Util.js module, passing parameters necessary to open and configure the window. In addition, each object passes a reference to its "onload" event handler that should be called once the browser opens the window and loads its initial content. These parameters are bundled into a data structure, and that structure is inserted into a queue of pop-up open requests. If the queue was previously empty, the open function calls another function to dequeue and process the first (and in this case, only) entry in the queue. If the queue was not previously empty, then processing of the queue is known to be on-going, and the open function simply exits.

The dequeue function plucks the first entry from the queue and uses the parameter values from the entry to perform a Javascript window open call. If that call fails (i.e., returns a null result), the dequeue function reinserts the failed entry back into the head of the queue, increments a timeout value, and using that updated timeout value, schedules itself for later execution using the Javascript setTimeout() function. When the timeout expires and calls the dequeue function, that function again tries to process the first entry in the queue. This process repeats, incrementing the timeout value at each iteration, until the window open is successful.

Once the dequeue function obtains a successful window-open result, it attaches the original caller's onload event handler to the window. After the browser finishes loading the window, it will call this handler, which will complete the initialization of the console or peripheral device object. Finally, and unless the queue is now empty, the dequeue function reschedules itself using the current timeout period to process the next entry in the queue.

This new mechanism works extremely well. It automatically adjusts the delay between opens based on the individual browser's behavior. The initial timeout value is 500ms and increments by 250ms on each iteration of a failed open. My tests indicate that Safari 11.0 starts making nice when the timeout reaches 1000ms. Chrome and Firefox both cruise through pop-up opens, because their open attempts do not fail; thus nothing ever gets queued, so the opens proceed without any delay.

Bug Fixes

 The following problems have been corrected in the emulator:
  1. Floating-point addition/subtraction was producing the wrong result if the mantissa of either operand was zero. Thanks to Tom Sawyer for finding and reporting this problem.
  2. During floating-point divide, the quotient mantissa was not being normalized properly.
  3. If a magnetic tape operation was initiated against a not-ready unit, the Processor's internal performance throttling mechanism could become corrupted, causing the emulator to run erratically.
  4. The Designated Unit lamps on Cardatron devices were not always being illuminated properly.
  5. Changes to the Format Select control on Cardatron input devices (card readers), an emulator feature that allows the band-select column on cards to be overridden, were not being detected properly. This usually rendered that control non-functional.
  6. Format-Lockout for Cardatron input devices was not working when there was a simulated 8-punch in the band selection column of the card. See the Cardatron wiki page for details on how simulated 8-punch coding is implemented in this emulator.
  7. Skip-to-channel carriage control on Cardatron printer devices did not always work properly, depending on the carriage control being applied on the next write to the device.
  8. Several 205 devices display output from the system in windows or frames, and allow you to save or print the data using standard browser controls. The common document used for these windows and frames contained extraneous white-space, which resulted in blank lines before and after the actual output. This extraneous white-space has been eliminated.

Internal Enhancements

Every release contains a number of minor tweaks and optimizations to improve the performance, reliability, or maintainability of the emulator. Sometimes there are new Javascript or DOM features that are mature enough to replace custom coding in the emulator that was doing the same or a similar thing. Many of these changes are too minor or obscure to mention, but all are visible in the source file "deltas" available on the project's repository site.

The following summarizes the most important of these internal enhancements for version 1.01:
  1. Binding an object context to a function is a common feature of Javascript object-oriented programming. The traditional way of doing this has been to wrap the subject function with another function that captures the context in a closure and then calls the subject function with the Function.apply() method. Recent versions of Javascript can now do this internally using the Function.bind() method. The emulator's custom bindMethod() utility functions have been replaced with this intrinsic capability.
  2. A common operation in web programming is to manipulate the list of CSS class names in a DOM element's className property. Traditionally this has been done using simple regular expressions, but the modern DOM treats the className property as a classList object, which has intrinsic methods to add, remove, and test for the presence of specific class names. These classList methods are now used in the emulator.
  3. Data transfers between memory and card devices are controlled by format band programming in the Cardatron. Because of this, the Cardatron controls how much data is transferred. Thus it must signal the Processor when the transfer is complete so that the Processor can terminate that I/O instruction and continue execution. The way this signal was being passed to the Processor during card-read operations was not particularly reliable, and has been improved.
  4. The setCallback() function used to manage asynchronous functions in the emulator seems to be forever undergoing refinement. Some additional tuning changes were made in this release. 
Work on the retro-205 emulator will continue in the background, and as accumulation of new enhancements and bug fixes warrants, I will be making new releases of the emulator available.

Wednesday, February 28, 2018

Introducing the Burroughs 220

The Burroughs 220 was a late-1950s decimal, vacuum-tube, core-memory computer system. It was the successor to the ElectroData/Burroughs Datatron 205 system. It shared many of the 205's characteristics while improving upon several others. It was also the first new system to be released under the Burroughs name after Burroughs acquired ElectroData in mid-1956.

The 220 was perhaps the last major vacuum-tube computer design to reach the market. It did so as IBM, Philco, and Univac were announcing, and in some cases delivering, transistor designs. Because of this, the 220 did not sell particularly well, but it was popular with universities and research establishments, and it laid groundwork for the much more successful Burroughs systems that followed it, particularly the B5500.

This blog was re-titled for the Burroughs 205 and 220 a year ago, and we haven't said a peep about the 220 -- until now. This is Paul Kimpel with our initial post on that computer system. It will introduce the architecture and components of the 220, compare them to the earlier 205, and give a brief overview of our project to build an emulator and recover software for the machine.

Genesis of the Burroughs 220

By the mid-1950s, the ElectroData 205 had become a successful product. It was a relatively low-cost computer system for its day, but due to its drum memory and digit-sequential internal operation, it was also relatively slow. It had robust magnetic tape and punched-card facilities, however, and even though it had originally been designed with scientific computation in mind, these input-output facilities gave it an entry into businesses with a need to process large repositories of data -- the field of "commercial" data processing.

ElectroData needed to pursue these commercial customers and workloads if it was to stay in the computer business, and for that they would need a new, larger, faster system. They saw an opportunity to produce a larger system at a lower price point than competing systems of comparable size. The problem was that they were extremely strapped for capital just to continue their current operations, let alone to design and build a new system. The situation was so acute that, by early 1956, ElectroData was seriously considering terminating operations and closing the business. It was precisely at that point that Burroughs, having significantly more capital, and looking for an acquisition to boost their attempts at entry into the commercial computer business, came calling.

Prior to this, ElectroData had studied the possibility of modifying the 205 to use magnetic core memory instead of a drum. That idea proved not to be viable, for both technical and economic reasons, but it did initiate some preliminary work on a design, known as Project 737, for a new system that would use core memory.

As ElectroData's acquisition by Burroughs was closing, this new design was chosen over the BEAM IV, a system designed under the direction of Issac Auerbach at the Burroughs Research Center in Paoli, Pennsylvania. Project 737 became the Burroughs 220. While Paoli had their BEAM IV canceled, they did get to build the core memory for the 220.

The 205 and 220 are sometimes referred to as the B205 and B220. This is incorrect. The Burroughs B200-series systems, which eventually expanded to the B100/200/300/500-series, came immediately after the 220, ca. 1960. They were small, transistorized, variable field-length, character-addressable computers. Architecturally, they were very different from the 205 and 220. They were, however, somewhat similar to -- and competed against -- the IBM 1401.

Since the 220 is an outgrowth of the 205, and retained a number of the 205's characteristics, let us first review the architecture of the 205 and then see how the 220 differed.

A Review of the ElectroData 205

The 205 was initially conceived at Consolidated Electrodynamics Corporation (CEC) as a digital calculator. CEC made mass spectrometers, and analysis of the data coming from these instruments requires solving systems of simultaneous linear equations. By the mid-1940s, CEC had built analog devices that could solve  systems with up to 12 unknowns, but to solve larger problems they needed to go digital.

CEC recognized early that the calculator would have additional potential if developed as a fully-programmable computer, so they expanded their scope and designed a system initially known as the CEC 30-201. The development proceeded through models -202 and -203, at which point CEC spun off the computer business as ElectoData, and the 30-203 became the Datatron 203. The 203 had only paper-tape input/output, plus a 10-character per second Flexowriter printer. Enhancements to support magnetic tape resulted in the Datatron 204, and finally adding the Cardatron punched-card interface resulted in the Datatron 205.

ElectroData/Burroughs 205 System

205 Drum Memory

The design of the 201 through 205 systems was oriented around their drum memory. Data was stored on the drum as 44-bit words, with each word organized as 11 four-bit BCD digits. The high order digit was used as the sign, with the low-order bit of that digit indicating positive (0) or negative (1). The other three bits of the sign digit were not generally defined for internal processor operations, but did have significance in some I/O operations. The remaining 10 digits of a word represented either a signed-magnitude decimal number or five alphanumeric characters. As an option, the later systems supported floating-point arithmetic, which used the high-order two digits of a word as an exponent biased by 50 and the low-order eight digits as a fractional mantissa. The sign digit retained its meaning in floating point as for integer values.

The memory drum was organized in 24 tracks of 200 words each. It rotated at 3750 RPM, so a full revolution took 16.8 milliseconds. Switching between tracks took negligible time, but the average latency until a desired word came under the read/write head was half a revolution, or 8.4ms. This was slow -- compare it to the IBM 650's drum, which rotated at 12,500 RPM. The choice of a slower drum was apparently made for reasons of cost and reliability.

205 Memory Drum

20 of the tracks on the drum formed the computer's main memory of 4000 words. To help compensate for the slow access time, the other four tracks were organized as recirculating delay lines, termed the high-speed loops, with the read and write heads separated by one-tenth of a revolution. This allowed storage of only one-tenth the number of words (20) per track, but with one-tenth the access time (0.84ms average). Special instructions could block 20 words in the main memory to or from one of the loops in a single drum rotation. Executing instructions and accessing data in the loops was almost ten times as fast as main memory. Doing so was highly desirable, but programming for it was complicated by the need to block and unblock words explicitly between the loops and main memory.

The drum was so central to the design of the system that even the clock signals for the electronics came from the it. Prerecorded tracks provided pulses for digit and word timing, plus a once-per-revolution signal to assure that the processor's address coincidence logic stayed in sync with the drum. These timing tracks gave the 205 an effective clock rate of 142.8 KHz.

205 Registers and Data Flow

The 205 had the following major registers:
  • A -- the accumulator.
  • R -- the "remainder" or accumulator extension of 10 digits. Digits could be shifted between this register and A, but R could not be loaded or stored directly. The Multiply instruction multiplied A by a value in memory, producing a 20-digit product in A and R. The Divide instruction divided the 20-digit value in A and R by a value in memory, leaving the quotient in A and the remainder in R. R was also used to store the return address in subroutine-branch instructions.
  • B -- the index register. The ElectroData 203/204/205 were the first commercial systems in the U.S. to offer this form of address abstraction, although the idea appears to have come from the UK Manchester Mark 1 by way of Harry Huskey, who consulted with the original CEC design team.
  • C -- the "command" register holding the current instruction. This 10-digit register held the two-digit operation code, the four-digit effective operand address, and the four-digit address of the next instruction to be executed. By default, the instruction address incremented automatically during fetch of an instruction into C. Branch instructions simply replaced this default address with the effective address from the branch instruction.
  • D -- the "data" register. It buffered a word coming from memory or a peripheral device.
Each track on the drum was recorded bit-parallel but digit-sequential. That is, each digit-wide logical track consisted of four bit-wide physical tracks. Data entered the processor from memory or the peripheral devices one digit at a time into the D register. Data exited the processor for storage in memory or writing to peripheral devices one digit at a time through the A register. Because words entered the processor low-order digit first, buffering an incoming word in this register also allowed the processor to examine the sign -- which arrived from the drum last -- before commencing an arithmetic operation.

Since the data came into and left the processor in a digit-sequential manner, all internal processing was done digit-sequential as well. The adder was only one digit wide, and all inter-register transfers occurred one digit at a time. In fact, all inter-register transfers went through the adder, which would actually add or not, depending on the operation. The digit-sequential design also reduced the complexity of the electronics, helping keep the size and cost of the system down.

Burroughs 205 Processor Data Flow

205 Instruction Format

Words holding executable instructions had the following format:


  • ± is the sign digit. If the low-order bit of this digit is 1 (negative), the contents of the B register are added to the AAAA field as it is transferred into the C register, thus indexing the operand address prior to instruction execution.
  • CCC are the so-called "control" digits. These had significance primarily for I/O instructions.
  • B is the breakpoint-skip digit. Bits in this digit would cause the computer to halt or skip the instruction based on switches set on the Control Console. It was used for program debugging.
  • PP is the operation code, e.g., 74 for Add or 20 for Change [Branch] Unconditionally.
  • AAAA is the operand address. In most instructions it referenced an address on the drum, but in some instructions, such as shifts, it signified a count, and in others it had no significance. B-register modification of this field would occur regardless of its significance.
With this background on the 205, we can now examine how the design of the 220 differed.

The Burroughs 220 Architecture

Both architecturally and electronically, the 220 was quite similar to the 205. It was still a decimal, vacuum-tube design. The 220 retained the 205's 11-digit word format, instruction format, and numeric representation. Core memory replaced the drum. Memory sizes could range from 2,000 to 10,000 words in 1,000-word increments. A typical memory size was 5,000 words. There being no drum, the clock was supplied by an oscillator circuit, and at 200 KHz, was only slightly faster than the 205's 142.8 KHz rate.

The Operational Characteristics manual describes the architecture, instruction set, and input/output devices in detail.

While the 220 retained the instruction word format of the 205 and many instructions worked similarly, the operation codes and control (now termed "variant") digits were entirely different. Programs for the 205 had to be rewritten to run on the 220. That said, the 220 was a much easier system to program than the 205, first because the design headache of blocking 20 words at a time into the high-speed loops was no longer an issue, and second because the 220 had a more powerful instruction set.

Burroughs 220 System

220 Registers and Data Flow

The 220 retained the same major registers as the 205, but with a few changes and additions:
  • The A and D registers were the same, except that now data both entered and left the processor through D.
  • The R register had the same function in arithmetic and shift operations, but was expanded to 11 digits to accommodate a sign digit. It no longer stored the return address for a subroutine call. R could also be loaded and stored directly from and to memory, and could be compared directly to a word in memory.
  • The C register was still 10 digits and held the op code and effective address, but not the address of the next instruction to be executed (see P below). It was a copy of the instruction word from memory, without the sign, and after any B-register indexing of the address field had been applied. Therefore, it also held the variant digits from the instruction word.
  • The B register was the same, although there were additional instructions for loading, incrementing, decrementing, and branching based on its value.
  • A new register, P, was implemented to hold the next-instruction address.
  • A new register, E, was implemented to hold the address of a word to be loaded from or stored to the core memory.
  • A new register, IB (input buffer), was implemented to hold the word loaded from or stored to memory. Words were generally transferred to and from this register and the D register, but not always.
  • A new register, S, was implemented to replace the function of the breakpoint digit in instruction words. You could set an address in this register and enable it with switches on the console. When execution reached that address, the processor would halt.
 All of these registers except for IB were shown on the 220's rather fancy console:

220 Control Console

The 44 bits of a word were transferred in parallel between the core memory and the IB register in a single five-microsecond clock cycle. Then the data moved between IB and D, and through the other registers, in the same digit-sequential fashion as in the 205. The adder was still only one digit wide, and most inter-register movement continued to be gated through it.

220 Processor Data Flow

This sequential operation and the only modestly-faster clock meant that the internal performance of the 220 was little better than that of the 205. Most of the system's speed increase came from eliminating the latency of the drum memory.

Retention of digit-sequential operation appears to have been a conscious decision, probably to reduce the cost of the system and allow it to come to market more quickly. An internal ElectroData memo from 5 June 1956, after the Burroughs acquisition was announced but before it closed, stated:
The basic configuration of the 220 will be the same as the 204, with the exception of the Core Memory. New circuitry in the 220 will be held to a minimum. Generally the plug-in [vacuum tube] packages will remain the same.

We do know that the concept of the 220 was not fully realized in the design that was finally built, at least in the mind of its chief logician, Edward (Ted) Glaser. In his 1973 oral biography, on file at the Smithsonian National Museum of American History, he had this to say (page 24):
It turns out that what the 220 was supposed to be was the 205, with the loops removed, the recirculating loops, and a small core memory put in. That was my assignment. And it became quite clear that with the speed of core that was then available, this couldn't be done and it was a ridiculous balance. There was no way I could prove that to anyone. So I did one of the lowest-down, dirtiest tricks I've ever done. I designed five machines before I told anybody about any of them. And each one had glaring faults. And I was able to escalate Burroughs from one machine to the next. Well, it turned out we got stuck on four. We didn't build five. So four -- 220 was the fourth in the series. It was not the machine I really wanted to build, but it was close to it.
At that point, Glaser appeared to start to talk about the fifth design, but the interview was interrupted, and it never came back to this topic. So we'll never know what he had in mind for his fifth design, or the ways in which he thought the 220 as built was still deficient.

220 Partial-Word Operations

Aside from the advantages that retaining the digit-sequential operation had for cost and time to market, the most beneficial was that it enabled a very useful new feature -- operation on partial words.

The digits in a 220 word are numbered as follows:

±   1 2 3 4   5 6   7 8 9 0

where ± is the sign digit and "0" is interpreted as 10. A "partial word" is a contiguous field of digits denoted by a two-digit pair, referred to as sL (start-Length). The first digit of the pair specifies the starting digit in the field; the second digit specifies the field length. The field therefore consists of the digit numbered s and the L-1 digits to its left. A length of zero is also interpreted as 10.

Thus, the operation code in a word can be referred to as the 62-field -- digit #6 and the digit to its left. The low-order four digits comprising the operand address are the 04-field. The entire numeric portion of the word (all but the sign) is the 00-field. The sign could not be the starting digit of a field, but in most contexts could be included in a partial-word field. Field overflow was not permitted, however -- the sign digit was as far to the left as a field could go. If an sL digit pair had L > s+1, the processor would halt with a Program Check alarm. An sL = 12 field is valid; a 13 field is not.

Partial-word operations were implemented as variations on shifting. Like the 205, the 220 could physically shift or rotate its registers only to the right. Left shifts were actually rotates to the right for the complementary number of digits. In a partial-word operation, the word is simply rotated to the right until the s digit is in the low-order or "0" position, from which it can be transferred to and from the adder. Then it and the next L digits are operated upon (add, transfer, etc.), and finally any remaining digits are again simply rotated until the word is back in its original position. This behavior was mechanized using a special register, DC, the Digit Counter.

Partial-word fields were supported in the following operations:
  • Branch on equality of a two-digit value compared circularly to a field in register A or R.
  • Compare a field in A or R to a corresponding field in a memory word and set the Compare Indicator, from which branches could be made based on a low, equal, or high conditions.
  • Increase or decrease a field in a memory word by a four-digit value, leaving the other digits in the word unaffected.
  • Decrease a field in a memory word by a four-digit value (possibly zero), then load the high-order digits of the resulting value in the field into the high-order digits of B.
  • Store a field from A, R, or B to a corresponding field in a memory word, leaving the other digits in the memory word unaffected.
Being able to operate intrinsically on partial words often eliminated additional instructions that would otherwise be needed to shift and extract digits before manipulating or storing them. This increased the efficiency of both execution and memory usage, partially offsetting the overhead of digit-sequential processing.

The partial-word facilities also provided convenient and efficient ways to modify operand addresses, variant digits, and even op codes in instruction words. A common trick was to implement a switch by adding or subtracting 29 to the 62-field in a word -- toggling the op code between 01 (No-Op) and 30 (Branch Unconditional). Partial-word facilities could also be used to pack multiple values into a single word, allowing programs to maintain larger numbers of values -- a not-insignificant advantage in the days of severely-constrained memory sizes. This was particularly important in the implementation of the Burroughs Algebraic Compiler (BALGOL), which will be discussed in future blog posts.

Additional 220 Enhancements

There were several other notable enhancements to the 220 instruction set over the 205:
  • The A, R, and B registers could be cleared to zero individually or in any combination with one instruction. The 205 could clear only A and R, and A only as a side effect of a store.
  • R could be loaded and stored directly without affecting A. The 205 could load and store R only by shifting values between A and R. This improved the setup for integer multiply/divide operations and allowed R to be used more conveniently as temporary storage. Stores could be of a partial word.
  • Similarly B could be loaded and stored directly, either as a whole or partial word.
  • A word in memory could be cleared to zero with a single instruction.
  • Instructions to compare A and R to a whole or partial word in memory replaced the 205's awkward branch-on-sign-difference mechanism. 
  • The sign digit of the A register could be tested against or replaced by a literal digit value.
  • Loop control was considerably enhanced. The 220 supported the 205's ability to decrement B and branch if it had not underflowed below 0000, but the decrement value could be 0000-9999 instead of just 1. Similarly, there was an instruction to increment B by 0000-9999 and branch if the result had not overflowed past 9999.
  • Looping could also be controlled using a new element, the Repeat Toggle. This was set by instructions that could increment or decrement a partial-word field in a memory word by 00-99. The toggle was set if the operation did not overflow or underflow, respectively, the designated field. The Branch Repeat instruction would branch or not based on the state of this toggle.
  • Subroutine call was supported on the 220 by the Store P instruction, which would store the value of the instruction counter, plus one, into the low-order four digits of the word at the instruction's operand address. This effectively stored a return address, and would typically be followed by a branch instruction to enter the subroutine. Compare this to the 205's Change-and-Record instructions, which stored the return address in the upper four digits of the R register.
  • The 220 supported a full complement of shift operations. Right shifts discarded digits off the low end of the register; left shifts acted as rotates (and as mentioned above, were implemented internally as right rotates). Register A could be shifted left or right, with or without participation by the sign digits. A and R could be shifted together left or right for up to 20 digits. The sign digit did not participate in the twin-register shifts, but it did get copied from A to R or R to A, depending on the direction of the shift.
  • The 220 had a Record Transfer instruction that could move from one to 100 words between locations in memory. The instruction's operand address specified the starting source location; the B register specified the destination address. At the end, B had been incremented to the address after the last word transferred, allowing multiple consecutive transfers to move larger blocks of data.
  • On the 205, overflow in an arithmetic instruction caused the processor to halt if the next instruction was not a conditional branch. On the 220, this behavior could be turned on or off programmatically. There was also an explicit Branch on Overflow instruction to test for overflow and reset the indicator if the branch was taken.
  • The 220 had an explicit No-Op (no operation) instruction that did nothing. On the 205, any unassigned op-code was treated as a No-Op. On the 220, executing an unassigned op-code caused the processor to halt with a Program Check Alarm.

220 Input-Output Facilities

 The 220 supported the same types of input-output devices as the 205, although the I/O instructions and the capabilities of the devices in some cases were quite different.

Console Devices

The so-called console devices included teletype printers, paper-tape readers, and paper-tape punches. A system could support up to 10 readers, up to ten teletypes or punches in any combination, and one special teletype termed the Supervisory Printer, or SPO (pronounced "spoh"). Teletype units could be substituted for punches and vice versa.

The 205's Frieden Flexowriter was replaced by a Teletype Model 28 RO (Receive Only) teleprinter, which operated at 10 characters per second. The Model 28 had the advantage that it used a moving type box to position characters on a line rather than the Flexowriter's moving typewriter carriage. The SPO was identical to the other teletype devices and was interchangeable with them. The SPO was addressed by a different instruction in the processor, however, and that instruction had an automatic decimal-point insertion feature that instructions for the other teletypes did not.

ElectroData designed and built their own high-speed paper-tape equipment, using 7/8-inch (22mm) tape and a seven-level code compatible with the IBM Eight-Channel Tape Code for the Type 046 Tape to Card Converter (a variant of the 026 keypunch). The reader was photoelectric and operated at a switch-selectable 500 or 1000 characters per second. It could stop in the space of a single character. The punch operated at 60 characters per second.

As an option, each teletype device could be configured with a paper tape reader, but this could be used only for off-line printing of paper tapes.

The paper-tape punches, teletypes, and SPO were connected to the processor in a daisy-chain. Switches in the device cabinets designated their unit number, 1-10, or SPO. The first device in the chain with a designation that matched that of an output instruction received and acted upon the data. The teletypes and SPO could each respond to multiple unit numbers, so it was possible for the software to address multiple devices but have the output come out on a smaller number of them.

On the 205, words were output to console devices one at a time from the A register. They were formatted as numeric or alphanumeric based on the values of address digits in the instruction words. On the 220, a single instruction could output up to 100 words directly from memory. The 2-bit in the sign digit of a word indicated whether it should be formatted as numeric (0) or alphanumeric (1). Thus, words with signs of 2 were considered to hold five alphanumeric characters, and words of mixed numeric and alphanumeric data could be output by the same instruction. This alpha vs. numeric behavior was effective only for the console devices, however.

Punched Card Devices

The Cardatron punched-card subsystem originally developed for the 205 was adapted for the 220 with very little change. The Cardatron was an interface between the 220 and IBM punched-card equipment. It translated between card or printer columns on the IBM machines and words in the 220 memory, It provided low-level editing of the data during transfer by means of digit strings termed "format bands" that were loaded to the interface from the 220's memory. It could also buffer one card or line image per IBM machine, allowing the processor to continue running while the machine operated on that card or line.

A 220 could have one Cardatron subsystem, which could support up to seven IBM machines. The Type 087 collator was used as a 240-card/minute reader, the Type 523 could be used as a 100-card/minute reader or punch (but not both at the same time), and the Type 407 was used as a 150-line/minute printer.

 The only significant differences between the Cardatron for the 205 and the 220 were (a) the ability to specify a sign of 2 for a word during input translation from a card reader, and (b) data transfer between the 220 memory and Cardatron units went from high to low addresses. That is, Cardatron instructions specified the ending address of the buffer area in memory. The Cardatron processed card and print columns backwards, so the descending addressing allowed the data in memory to be in ascending order. On the 205, data transfer was in ascending sequence, which meant that the words in memory were reversed with respect to their order on the card or print line.

For more information on the Cardatron and its features, see these two posts on this blog for the 205 version:

Magnetic Tape Devices

The 220 took the basic idea of the 205's magnetic tape subsystem and significantly extended it. The 205 used 3/4-inch (19mm) plastic tape on 2500-foot (762m) reels. The tape had two parallel lanes that each stored digit-wide frames at 100 bits/inch and moved at 60 inches/second. Only one lane could be read or written at a time, but the lane was programmatically selectable. Data was recorded in fixed-length blocks of 20 words each. Individual blocks could be overwritten in place. A full reel of tape could hold 10,000 blocks. Each block was assigned a four-digit number. The tape subsystem could search bidirectionally for a specified block number independently and asynchronously from the processor.

The 220 enhanced the 205 tape implementation in the following ways:
  • The tape was physically identical, but came on 3500-foot (1067m) reels.
  • The tape still had two parallel lanes, but data was recorded at 208.33 bits/inch and the tape moved at 120 inches/second -- twice as dense and twice as fast.
  • Blocks could be 10-100 words in length, and a tape could have mixed block sizes. Blocks could still be overwritten in place, but once a block of a certain size was written on tape, that size could not be changed. The entire tape had to be erased (a process known as "editing") and rewritten in order to change block sizes.
  • Up to ten blocks could be read or written in a single instruction. Variants of the instructions allowed reading and writing sequences of mixed block sizes. Tapes could also be spaced forward or backward up to ten blocks at a time.
  • Instead of block numbers, the first word in each block was considered to be its identifier or "keyword". The tape subsystem could search a tape bidirectionally and asynchronously for the block with the highest keyword that was less than or equal to a specified value. Optionally, the match could be made on a partial-word field. The search stopped with the tape positioned to read the matching block. In order for the search to work as intended, the keywords had to be unique and written in ascending order, but it was not a requirement that tapes be created in that particular fashion.
  • In addition to the bidirectional keyword search, the tape subsystem could search asynchronously in the forward direction for a block where a specified word in the first ten of the block matched a value. This was termed a "scan." This match could also be made on a partial-word field. As with search, a scan stopped with the tape positioned to read the matching block.
  • Instead of a simple end-of-file mark like the IBM tapes had, the 220 could write and sense two types of special blocks on the tape:
    1. An End-of-Tape block was typically written at the end of information on a tape, but could be placed anywhere. It contained a single control word (although the block physically occupied ten words on tape) that held two addresses:
      • The 64-field of that control word was the address of a word in memory where the values of the P register and 04-field of the C register would be stored.
      • The 04-field of the control word was the address to which the processor would branch when the block was read.
      An EOT block would terminate a search or scan operation, but the control word would not be acted upon until the block was actually read.
    2. A Control Block was one where the sign digit of its keyword was 7. When this type of block was read, all words in the block except the last were transferred to memory. The last word was not transferred to memory, but was acted on in the same way as the control word in an EOT block, storing fields from the P and C registers and branching to a specified address.
Like the 205, the 220 had two types of tape drives. The first was a traditional free-standing, reel-to-reel unit. It had an unusual design -- waist-high, with a slanted tape deck that held the reels, heads, and vacuum columns, as shown in this image from the Georgia Tech installation:

220 Free-standing Tape Drives
The second type of tape drive was the DataFile, a pre-disk design intended to provide faster random access to large data sets. It was another waist-high cabinet, but with 50 parallel bins instead of reels, each bin holding a 270-foot tape strip. The strips looped over a set of rods that supported a head assembly on a moving carriage.

220 DataFile Tape Unit
The carriage could position to any one of the tape strips within 1.5 seconds and drive that strip forward or backward. Except for their shorter length, the tapes, recording characteristics, and tape operations were identical to those for the free-standing units. While tape in the free-standing drives had two lanes, the DataFile acted as if it had a tape with 100 lanes. The tapes were not removable except for maintenance, so the DataFile was used more like a disk drive than a tape unit, albeit one with a large access latency.

This image shows the interior of a 205 DataFile, but the internal mechanism for the 220 was similar:

205 DataFile Showing Tape Strips

High-Speed Printers

Burroughs developed an early electrostatic printer named the Whippet that could output about 250 characters per second. It used coated paper and heat to fix a dry ink to the surface of the paper. It was originally developed for the U.S. Army Signal Corps, but at least a few units were installed at non-government sites, including one at CalTech. It could substitute for one of the teletypes.

Burroughs also developed its own high-speed drum printer for the 220, a design that continued to be refined and used into the 1970s with the B-series computer systems. The printer for the 220 could be used in two ways. It could be attached directly to the system and driven programmatically. It could also be used off-line, with a separate control unit that read data from a free-standing tape drive. The control unit was configured through a plugboard and was designed with the intent that data could be formatted and printed directly from master files, avoiding tying up the rest of the system for long print runs. The idea of off-line printing did not survive beyond the 220 era, however.

The retro-220 Project

I started a project in late 2016, having as its main goal the development of a web-browser based emulator for the 220, along the same lines as the retro-205 project for the 205. The emulator is written in Javascript and uses standard HTML and CSS to create a user interface within the browser. Due to the similarities between elements of the 205 and 220, I was able to port quite a bit of the code for the 205 emulator over to the 220.

Like my earlier emulation projects, the retro-220 emulator is open source and available for anyone to use. The emulator itself and all but the magnetic tape implementation were substantially complete by late Spring 2017. The 220 tape subsystem proved to be a significant challenge, and was not completed until late 2017. At this writing, the emulator is at version 0.06, and is complete except for the DataFile and high-speed drum printer devices.

This screen shot shows the emulator in action:

retro-220 Emulator Running from the Firefox Browser

The second goal for this project has been reconstruction of the Burroughs Algebraic Compiler for the 220, better known as BALGOL. This was an early implementation of the Algol-58 language, the predecessor to the better-known Algol-60. The compiler was first released in March 1960. Donald Knuth, who participated in its development, has donated a scan of the compiler and library listing to the Computer History Museum in Mountain View, California. An earlier, less complete listing is also available on bitsavers.org.

I have used Professor Knuth's listing to transcribe the assembly language source code for the compiler and its library, and have written two cross-assemblers to generate 220 object code from those transcriptions. The reconstructed compiler is now starting to work and is presently undergoing further testing and validation. The current state of its development can be found in the retro-220 project repository.

At this point, I am starting to work on a wiki for the emulator and compiler that will be available on the repository web site. I am also preparing additional blog posts on the project. The next one will be an introduction to the emulator and its user interface, so stay tuned.

Sunday, November 26, 2017

Emulator Timing Improvements

Version 1.00 of the retro-205 emulator was released at the end of March this year. The blog post accompanying that release described most of the technical changes in the release, then promised a second post on the subject of emulator timing.

This is Paul Kimpel with that long overdue second post for Tom's 205 and 220 blog.

The Significance of Timing

One of the design goals of the retro-205 emulator has been to run programs as close as possible to the real speed of a 205. Modern browsers can run the emulator much faster than that, however, so the speed must be slowed, or throttled, to meet that goal. This post describes how that is accomplished in the emulator, along with the improvements to timing accuracy that were introduced in version 1.00.

The effort to improve timing accuracy started early this year when Tom was experimenting with some prime-number programs and noticed they were running 20-30% faster in the emulator than he suspected they should. That set off a series of studies and experiments on both our parts to determine (a) how fast his programs theoretically should run, and (b) why the emulator was running them significantly faster.

Before getting into that, though, I first need to describe how the emulator regulates the speed with which it runs 205 programs.

Controlling Speed of Execution in a Web Browser

Most people by now know that a web page is constructed using a textual notation known as Hypertext Mark-up Language (HTML). That notation defines the elements we see on the page -- blocks of text, images, form fields, etc. You can think of a "web application" as a set of web pages with a dynamic component. Inside the browser is a programming language, Javascript, and an API, the Document Object Model (DOM), that allows Javascript code to access and manipulate the elements on a page.

The Single-Threaded Nature of Browser Applications

Javascript and the DOM were originally designed to enhance user interfaces by reacting to user-generated events -- mouse clicks, key presses, and the like. Based on its interpretation of those events, Javascript code can use the DOM to access and alter the elements on the page. People quickly realized that this capability could do a lot more than just enhance a static web page -- it could be used to build robust and sophisticated browser-based applications that do significant internal processing in their own right.

As a consequence, the facilities and techniques for building such applications have improved dramatically over the past 20 years. What started simply as a way to create an interface to an application running in a web server has in many cases become the application itself. The fact that we can now use a standard web browser to, say, model and emulate a computer system like the 205, let alone implement something as polished as Google Docs, is impressive, if not downright astonishing.

Browser-based applications still retain their event-driven nature, however. They are generally not structured the way that most off-line applications are. All processing for the set of windows making up an application (the emulator home page and all its sub-windows in our case) is strictly single-threaded. Only one thing runs at a time in that application -- there is no multi-threading or parallel execution. This is an intentional feature of Javascript, to simplify the programming model and avoid the need to coordinate multiple parallel threads of execution.

Thus, most browser applications are structured as a set of functions that are called one at a time from the browser's Javascript run-time controller. There is generally one global block of code that runs when the main page of an application first loads. That code links functions to handle events that will occur when the user interacts with the elements of the page. The code can also schedule functions to execute after a specified period of time. The global code then exits back to the Javascript run-time, leaving the functions it has linked to run later as events occur.

To maintain its single-threaded behavior, the Javascript run-time maintains a queue both of events that have occurred but not yet been serviced and of scheduled activities that have not yet come due. Each of these events and scheduled activities has a handler function associated with it, and as it rises to the top of the queue, the run-time calls that function.. These handler functions will do appropriate processing for their events, and may of course link or schedule still other functions for future execution.

Once a handler function exits back to the run-time, the run-time then figures out what to do next. That may be to pluck another event out of the queue or do some of the browser's background processing. It may be there's nothing to do at the moment, in which case the run-time will wait for the next scheduled function to come due or for some external event to occur, such as a click in the user interface.

Another thing about Javascript functions in a browser is that you can't run them continuously. Because there is only one thread of execution, a function that runs continuously will block other functions from running, which would prevent any user events from being serviced or any scheduled activities from taking place. To prevent that, most browsers will allow a function called from the Javascript run-time to execute for 15-30 seconds, and if the function hasn't given up control (i.e., exited back into the run-time) in that amount of time, the run-time will assume that the function is caught in an endless loop. The browser will pop up an alert and ask the user whether they want to continue executing, abort application, or bring up the debugger. Thus, a web application needs to be designed so that it returns to the run-time controller on a regular basis in order to avoid the run-time declaring it scriptum non gratum.

You can think of the emulator as a program that runs in loop, iteratively simulating the execution of 205 instructions. Internally, though, the emulator is not structured around a loop. Instead, it's constructed as a set of functions that are called from the run-time, execute for a bit to accomplish some piece of work, perhaps schedule themselves or another function to run later, and then finally exit back into the run-time. The iterative behavior comes mostly from the run-time repetitively servicing its event queue, and from the handler functions continuing to schedule other functions that will run at some point in the future.

Regulating the Speed of Emulation

Throttling the performance of an application in a web browser is a bit tricky. The way that you do that is to introduce delays into a script's execution. The way that you introduce a delay is by scheduling some function to run after the desired delay, and then exiting back into the run-time. The application will then sleep until the next item in the run-time's queue is selected.

Functions are scheduled to run in the future by calling the Javascript setTimeout() intrinsic function, passing it a number of milliseconds to delay and the function to call after that delay has elapsed. The actual delay will be at least the amount of time you specify, because the run-time only runs one thing at a time, and it may be that some other function or background task is already running when the delay expires. The scheduled function must wait for any currently-executing and queued activities to be completed first. Thus, the run-time will call the function you passed to setTimeout() after the specified delay and after nothing queued ahead of it is still pending.

Therefore, the trick to throttling emulator performance down to an appropriate speed is to break up the design into functions that can be scheduled after appropriate delays, and to keep rescheduling those functions in order to keep the emulation running.

Fortunately in the case of the 205, we have delays happening all the time due to the relatively slow drum memory, so there are plenty of opportunities to give up control and exit back into the run time. This also nicely prevents the emulator from hogging the single Javascript thread and having the run-time declare it non gratum.

Significance of the Memory Drum to Instruction Timing

To understand how drum timing interacts with performance throttling, let us first review the drum memory and its timing:
  • The drum rotates at 3570 RPM. It contains 24 tracks of 200 words each. An 11-digit word on the drum is the smallest unit of data that can be addressed, so it is also referred to as a sector.
  • At 3570 RPM, the drum completes a revolution in 16.8 milliseconds. Since there are 200 words on a track, a word passes under the read/write head once every 84 microseconds. This unit is referred to as a word-time.
  • 20 of the tracks comprise the main memory, with word/sector addresses 0 to 3999. There is a combined read/write head for each track.
  • The other four tracks comprise the high-speed loops, or just the loops, and effectively store 20 words each. These tracks are addressed differently than main memory, however -- each loop is addressed modulo 20 within a 1000-word address range. Thus, the tracks are commonly known as the 4000 loop, the 5000 loop, the 6000 loop, and the 7000 loop. Addresses 6018, 6278, and 6538 are all in the 6000 loop, but all of them identify the same sector on the drum because all those addresses are congruent modulo 20.
  • Instead of having a combined read/write head, the four loop tracks have separate read and write heads, placed 36 degrees apart, so that words written by the write head can be read by the read head 0.1 revolution (1.68ms) later. Normally the words read by the read head are fed back (“looped”) immediately to the write head, with the result that the words are duplicated 10 times around the drum. Hence, while each high-speed loop can store only one-tenth the words of a main-memory track, the drum can access those words 10 times faster than main memory.
  • When the processor requests a memory access, it must wait for the drum to rotate until the desired word is under the head. This delay is termed the latency. On average, a main memory request must wait for one-half a revolution, or 8.4ms. Accesses to the loops, however, must wait on average only one-half of 0.1 revolution, or 0.84ms. Well-written 205 programs tried as much as possible to execute instructions from one of the loops, and to keep frequently-referenced variables in a loop as well.
The timing of instruction execution for the 205 is complicated because of the drum. Each instruction requires at least one memory access -- to fetch the instruction for execution. For instructions that use an operand, a second memory access is needed to read or write the operand. For all but a few instructions, the drum latency dominates the overall execution time -- most instructions will finish before the drum rotates to the location of the next instruction to be executed, even when running from one of the high-speed loops.

So the big question when attempting to determine instruction timing and throttling delays is, "where's the drum at this point in time?"

Design of the Performance Throttling Mechanism

When designing the emulator in late 2014, I decided to implement the drum internally as arrays of words in memory. Then I looked at the drum timing and decided that the latency delays were generally so large, I could combine the delays due to drum latency with the delays needed to throttle performance, and do them together within the memory access mechanism. The way I chose to do this is by maintaining two clocks:
  • The first clock maintains internal, emulated processor time. It counts word-times as a result of instruction fetch and execution. We have reasonably complete information on how the instructions were mechanized inside the processor, so the emulator keeps track of the word-times each instruction requires as it executes. Under the assumption that the emulator will run much faster internally than a real 205 would, this clock should run fast with respect to real-world time.
  • The second clock is the drum -- or the drum is the clock -- whichever way you want to think about it. This clock is real-world time. The drum rotates at a constant rate, so if we convert milliseconds of real time to word-times by dividing by 0.084 milliseconds/word-time and call that drum-time, then "(drum-time) mod (tracksize)" can be taken as the track-relative sector number under the head at that point in real time. Tracksize will be 200 or 20, depending upon whether the emulator is accessing main memory or one of the loops.
Knowing the track-relative sector where the head is now, and knowing the track-relative sector of the address being sought, we can compute the drum latency for an individual memory access. We know the number of words we are going to read or write (either 1 for normal accesses or 20 for block transfers), so we also know the track-relative sector where the memory access will finish, and therefore how many word-times (latency + word count) should be added to the internal processor clock to account for the memory access.

Since we hope the emulation is running faster than real time, we can compute the total real-world time delay for a memory access as:

delay = (internal time) - (drum-time) + (latency) + (word-count)

where "(internal-time) - (drum-time)" is the difference between emulated time and real-world time -- the amount that the emulation is running too fast. We need delay by that amount to allow real-world time to catch up. "(latency) + (word count)" is the amount of real-world time the memory access takes.

Thus, the emulator operates by running unthrottled between memory accesses, then sleeping for a time that combines the delays for drum rotation and performance throttling. It provides a call-back function to the Javascript run-time to continue operation after the delay completes.

You can see how this works in the readMemory() and startMemoryTiming() routines of the emulator’s Processor object. First, readMemory() determines the type of memory access that will take place:
D205Processor.prototype.readMemory = function readMemory(successor) {
    /* Initiates a read of the memory drum at the BCD address 
    specified by C3-C6. The memory word is placed in the D register, 
    and an appropriate delay for drum latency, the successor function 
    is called. Sets the memory access toggles as necessary.
    Note: the D register SHOULD be loaded after the latency delay, 
    but we do it here so that the word will show on the panels during 
    the drum latency */
    var addr;               // binary target address, mod 8000
    var cbCategory;         // setCallback delay-averaging category
    var trackSize;          // words/track in target band of drum

    addr = D205Processor.bcdBinary(this.CADDR % 0x8000);
    this.memACCESS = 1;
    if (addr < 4000) {
        trackSize = D205Processor.trackSize;
        cbCategory = "MEMM";
        this.memMAIN = this.memLM = 1;
        this.D = this.MM[addr];
    } else {
        trackSize = D205Processor.loopSize;
        cbCategory = "MEML";
        if (addr < 5000) {
            this.memL4 = 1;
            this.D = this.L4[addr%trackSize];
        } else if (addr < 6000) {
            this.memL5 = 1;
            this.D = this.L5[addr%trackSize];
        } else if (addr < 7000) {
            this.memL6 = 1;
            this.D = this.L6[addr%trackSize];
        } else {
            this.memL7 = 1;
            this.D = this.L7[addr%trackSize];

    this.startMemoryTiming(addr, trackSize, 1, cbCategory, successor, 
The property this.CADDR is the address field of the C register, which specifies the memory address to be accessed. The this.memXXX properties represent the memory control toggles shown on the right side of the Supervisory Panel. These are set only for display purposes. cbCategory is a short string that identifies the category of delay we require and is used with the setCallback() function discussed below. The successor parameter is a function that is to be called once the memory access finishes -- it's often the part of the Fetch mechanism that loads an instruction into C or the part of the Execute mechanism that actually executes an instruction after any operand word has been loaded from memory.

Based on the value of the address, the routine selects either the array of words for main memory (this.MM) or the array for one of the loops (this.L4, this.L5, etc.) and sets the appropriate memory control toggles so that they will display on the Supervisory Panel. It also chooses the appropriate track size -- D205Processor.trackSize=200 or D205Processor.loopSize=20 -- and loads the desired word into the Processor's D register. "%" is the Javascript modulo (remainder division) operator.

The this.readMemoryFinish parameter is a small function that will be called after the delay for memory access completes. It does some housekeeping and calls the actual successor function.

The startMemoryTiming() function is common to all of the memory access routines. It handles the chore of computing the necessary latency and delay, updating the internal processor clock, calling this.updateLampGlow() to compute average lamp intensities, and finally initiating the delay by calling setCallback():
D205Processor.prototype.startMemoryTiming = 
        function startMemoryTiming(addr, trackSize, words, 
                   cbCategory, successor, finish, finishParam) {
    /* Starts the necessary timers for the memory toggles to 
    aid in their display on the panels. Then delays the amount 
    of time required for drum latency, data transfer, and amount 
    necessary to bring the processor internal time back in sync 
    with real-world time */
    var latency = (addr%trackSize - 
                  (this.procTime + 
                      D205Processor.memSetupTime)%trackSize +

    this.memACTION = 1;
    this.memoryStartTime = this.procTime;

    this.procTime += latency + words + D205Processor.memSetupTime;
                     // emulated time at end of drum access
    this.successor = successor;
    this.scheduler = setCallback(cbCategory, this,
            this.procTime/D205Processor.wordsPerMilli - 
            finish, finishParam);
addr%trackSize is the track-relative sector number of the desired memory address. this.procTime is the internal processor clock, maintained in units of word-times. Taking the value of the clock modulo trackSize gives the track-relative sector number where the emulated processor thinks the drum is now. B205Processor.memSetupTime is the number of word-times necessary to set up and initiate the memory access internally, so adding that to the internal clock indicates where the drum will be when the processor actually begins looking for the desired sector.

Taking the difference in sector numbers between where the drum is now and the desired address gives us the latency. Depending on the relative positions of the two sectors, however, that difference could be negative, so I add the value of trackSize to make sure the difference is positive, then mod that sum by trackSize in case it was positive to begin with, thus keeping the result between 0 and trackSize-1.

this.updateLampGlow() updates the calculation of lamp intensities for the display panels so that they exhibit a somewhat realistic partial brightness. Based on current register settings, it computes a time-weighted exponential running average of the 0/1 state for each bit in each displayed register. That calculation will be completed by this.readMemoryFinish at the end of the memory access.

The routine updates this.procTime with the latency, memory access setup time, and word count, which gives us the internal clock time (in word-times) when the access should finish. It also stashes a reference to the successor function in a global property (this.successor) for use by the finish function (which is this.readMemoryFinish when called by readMemory) after the delay is complete. Finally it calls setCallback(), discussed in more detail below, which will call setTimeout() to initiate the necessary delay.

That delay should be the difference between the value of the internal clock when the emulated memory access starts, and the value of real-world time when the physical memory access finishes. Since the emulator models the drum internally as a set of memory arrays, the physical time to load the desired word into D is negligible, and in fact was already done in readMemory(). Thus the delay is simply the difference between the internal clock and real-world time. Dividing by the value D205Processor.wordsPerMilli converts word-times to milliseconds, the unit required by setCallback() and setTimeout().

Javascript has a Date object from which you can get real-world time, but instantiating that object every time you need to have the current time results in a lot of overhead. There is a special timer designed for performance measurement, though, performance.now(), that's both efficient and more granular than the Date object, so I use that for all timing within the emulator. It returns a floating-point number in milliseconds.

Managing the Javascript Delay Mechanism

I have gone through this rather long-winded explanation of memory access and timing partly to give you an idea of what's involved in regulating performance in the emulator, but also to talk about the role of setCallback().

The setCallback() function is something I originally developed for the retro-B5500 emulator. Its main purpose is to compensate for two nasty characteristics of the standard Javascript setTimeout() function:
  1. As discussed above, the delay you get using setTimeout() is always at least as long as the one you requested, and may be much longer, depending what else is happening in the browser at the time.
  2. The minimum amount of time you can delay in modern web browsers is about 4ms. If you specify a smaller delay than that, it is treated as if you had specified 4ms.
The reason for the 4ms limit is that there are a number of older, poorly-behaving web scripts out there (many of them games) that are designed to run as fast as they can with no delays at all if they can get away with it. These scripts still need to exit back to the Javascript run-time periodically, however, and some years ago you could specify a delay of zero for setTimeout() and actually give up control for a minimal amount of time. In operating systems, this brief release of control is termed a yield.

The problem with this yield technique is that it tends to suck all the juice out of laptop and mobile device batteries in short order. So the browser developers instituted minimum delays (originally on the order of 10ms) to keep users from blaming the browsers for their dead batteries. The web standards now require the minimum delay a browser must support to be no greater than 4ms, and as a result, all the major browsers now use 4ms as a minimum.
That standard 4ms minimum on delay time is not an issue for timing access to main memory, which averages 8.4ms. For loop access, though, which averages 0.84ms, the 4ms limit is a serious problem. The emulator often needs to delay for fractions of a millisecond, but there is no way setTimeout() will do that.

I had a similar problem, although for a different reason, when working on the B5500 emulator, so in 2013 went looking for a solution. It turns out that there is no finer-grained substitute for setTimeout(), but there are a couple of ways that you can give up control and have the run-time call a function you provide as soon as there is nothing else ahead of you in the queue, which is exactly what people were trying to do by passing a zero delay to setTimeout().

Browsers have a mechanism that can be used to pass messages asynchronously between windows, implemented using the postMessage() function for sending and the onMessage event for receiving. It also turns out that a window can send a message to itself, which is something I would never have thought of doing, but this is why we have Google Search -- to discover things we can't think up ourselves. Passing a message to yourself effectively produces a yield, and the event handler you supply for onMessage is simply the call-back function for that yield.

So having the setTimeout() mechanism that will delay for at least 4ms, and the postMessage/onMessage one that will delay for a hopefully very small, but unpredictable amount -- how do we generate a delay of 0.84ms?

The answer is that in individual cases we don't. Instead, we play the averages.

My setCallback() function has the same overall purpose as setTimeout() -- you call it passing a requested delay in milliseconds and a function to be called after that delay expires. If the requested delay is more than 4ms, setCallback() calls setTimeout(), which will always result in a delay that's at least a little longer than we asked for. If the delay is less than 4ms, setCallback() will call postMessage(), which should usually result in a delay that is shorter than we asked for. setCallback() passes its own call-back function to both of these mechanisms as a proxy, and when either the delay expires or the message event occurs, that proxy function measures the amount of time that actually elapsed. The proxy then executes the original call-back function passed to setCallback().

Subtracting actual elapsed time from the delay originally requested gives a deviation -- how too much ahead or behind schedule the delay completed. setCallback() keeps a running total of those positive and negative deviations, and on the next call to it, attempts to apply a portion of any accumulated deviation to the new requested delay. If the accumulated deviation is positive (actual delays are running longer than requested), it attempts to reduce the new requested delay. If the accumulated deviation is negative (actual delays are running shorter than requested), it attempts to increase the new requested delay. Then that adjusted delay is compared to 4ms and used to decide whether to call setTimeout() or postMessage(). In this way, the mechanism compensates for deviations in the individual delays, and the deviations tend to average out. Or that's the hope, anyway.

There is one more twist to this mechanism. I have been discussing delays in the context of timing memory access, but the emulator has other areas where it needs to introduce a delay, primarily in I/O. Because the different areas where the throttling mechanism is used have different delay lengths and patterns, I didn't want to lump them all together for the purpose of adjusting deviations. Thus the delays are divided into categories, and a separate cumulative deviation is maintained for each one. The cbCategory variable mentioned back in the discussion of readMemory() is a short string that identifies the category on a given setCallback() call. "MEMM" refers to the category for main memory accesses and "MEML" refers to the category for loop memory accesses. Each I/O device has its own category as well.

With this background on emulator timing and performance throttling, we can now discuss the problems the retro-205 emulator was having with its timing and what we did about it.

Analyzing and Correcting the Timing

When Tom pointed out that his programs were running faster than expected, there were three areas I thought could be prime suspects for the timing disparity. One was the way the emulator was accumulating internal word-time counts for instructions as they were executed. Another was the accuracy of the delays being computed for memory access and performance throttling. In the back of my mind, I was also wondering if the setCallback() mechanism, and especially the deviation adjustment portion of that mechanism, was working the way it needed to. It turned out all three were contributing to the problem.

I completely reviewed the way the emulator was accumulating word-times for each instruction, comparing that to some timing notes that Prof. Donald Knuth had kindly sent me from his files a couple of years ago. I found lots of discrepancies, so adjusted the word-times for many instructions, often upward by three or more. This slows the emulation of those individual instructions.

Next, the drum latency calculations appeared to be correct, but I realized in researching the timing of memory accesses in TM 4001, the 205 computer training manual, that I had not been accounting for the necessary time the processor needs to set up its internal logic for the access. This amounts to 2-3 word-times per access, and after some experiments I settled on the value 2. This is reflected in the B205Processor.memSetupTime field mentioned above, and also slows the speed of emulation somewhat.

Finally, I turned to the setCallback() mechanism. You might think that an algorithm for adjusting requested delays based on deviations accumulated from prior delays would be straightforward, but it hasn't been for me. I've completely rewritten that adjustment algorithm at least four or five times over the years, and still had my doubts about it.

I was finding it difficult to analyze the algorithm's behavior from inspection of the code alone, so I modeled it in a spreadsheet, and found that the adjustment mechanism was not as stable and well-behaved as I had thought. In fact, it was embarrassingly easy to come up with patterns of delay that would cause the adjustment mechanism to go completely unstable. So I sat down and really thought through the problem this time, came up with a completely new approach to calculating adjustments for requested delays, and modeled that until I had it working smoothly. This change did not actually correct the emulator timing per se, but it did eliminate the runaway instabilities and make the emulator timing more consistent over longer periods.

After making these changes, Tom and I independently reran the programs we had analyzed earlier to obtain theoretical run times. We found that the emulator was now generally operating within 2-3% of those theoretical times. For an emulator that operates internally at the instruction level rather than the clock level, I am quite gratified that we got that close. Flushed with this success, we declared victory over the problem of emulator timing accuracy.

Then Tom reminded me that there was this other program...

The Remaining Timing Problem

In the back of Prof. Knuth's copy of the 1956 Burroughs 205 Central Computer Handbook he donated to the Computer History Museum, there is a strip of paper that has been pasted onto the page. The strip has four lines that appear to have been printed by an IBM 407 tabulator, which read:
777 FAST DRUM ZERO. USES 4000 LOOP. ZEROES 6000-6019 AND 0000-3999 IN 1.1 SEC.
This is a 205 machine-language program in card-load format that, as the first line indicates, clears the 205 main memory in 1.1 seconds. It turns out that Prof. Knuth himself wrote this program ca. 1960.

I had discovered this piece of code early in my research into the 205 before beginning work on the emulator. Largely to better understand how the 205 worked, I manually disassembled the code and did a detailed analysis of it. Since this was a program for which we had a reported execution time, Tom suggested we should see how long it actually takes to run in the emulator with its corrected timing.

The program runs so quickly that it's difficult to time it manually, but the number I came up with using the Firefox browser was 3.2 seconds, almost three times longer than its reported run time. So much for the emulator timing being accurate to within 2-3%. That surprisingly slow performance launched an effort that eventually took me more time and thought than the original timing analysis and corrections had.

The first question to resolve was whether the reported 1.1 second run time was realistic. As a basis for comparison, I constructed a test that cleared the 4000 words of 205 main memory in a straightforward way -- by storing individual words in a loop. The following code illustrates the technique:
0050  0 0000 02 3999      STC  3999    clear A
0051  0 0000 72 0050      SB   0050    set B to 3999
0052  0 0000 30 0053      CUB  0052    block and execute in 7000 loop
0053  1 0000 02 0000      STC  0000    zero word a B+0
0054  0 0000 22 7053      DB   7053    decrement B and loop if B >= 0
0055  0 9999 08 9999      STOP 9999    halt
This code takes about 69 seconds in the current retro-205 emulator under Firefox. That is because accessing successive locations in main memory requires a full drum rotation of 16.8ms between accesses. Just multiplying 4000 times 0.0168 is 67.2 seconds, so this test generates a reasonable result. Reducing the time of this straightforward approach by a factor of more than 60 would be quite an accomplishment.

Prof. Knuth’s program works by taking advantage of the block-transfer instructions in the 205. These instructions transfer 20 words to or from main memory and one of the loops in one drum revolution plus the latency necessary to reach the starting address in main memory. The program initializes the 6000 loop to zeroes, then does five successive BF6 (Block From 6000 Loop to Main Memory) instructions to clear 100 words, iterating and decrementing the B register to step through all 4000 words in 40 increments. The really clever thing the program does is to modify the successive BF6 addresses by 40 instead of 20, allowing the fetch and setup of the next BF6 to occur during the intervening rotation of the drum, a technique known as interleaving.

To confirm the reported time, I took the disassembled code for the program, and once again referring to Prof. Knuth's timing notes, attempted to determine not only the amount of time each instruction should be expected to take, but also how the drum position changed with each instruction, and therefore the amount of drum latency and transfer time to expect for each memory access the program did. That analysis showed that one iteration clearing 100 words should take a total of 298 word-times, or 25.04ms. 40 such iterations should therefore take about 1002ms. Adding the time to initialize the program and then the 6000 loop to zero, the total time should be very close to 1.1 seconds. Thus, I had to accept that the reported time for the program was correct.

The next question was then, why did the emulator take three times longer than that? I added a temporary trace to the emulator to record real-world times taken by the individual instructions as they executed in that program. This showed that the revised setCallback() mechanism was doing a superb job of regulating the adjusted delays, but every couple of hundred milliseconds there is a large deviation averaging about 100ms. I suspect that this was caused by Firefox inserting its background tasks into the run-time queue and taking precedence over the emulator's execution. To compensate for this, the setCallback() mechanism was adjusting delays for emulated instructions downward, so that they were executing in about half their theoretical execution time. That is probably about as fast as the emulator could go. Performance under Google Chrome is better than under Firefox, but only by about 10%.

I finally concluded that this was a case where the emulator simply could not adjust the delays low enough, and consequently could not keep up with real-world time. The design of the performance throttling mechanism was predicated not only on introducing throttling delays during drum latency, but also on using those delays to absorb the overhead of browser background activities. Alas, Prof. Knuth's coding of the program so efficiently minimized drum latency that there isn't enough latency left to absorb all of that overhead. Thus the emulation of the program ran too slowly.

The final question was what to do about this situation. One possibility would be to use an approach to throttling performance similar to the one I use with the retro-B5500 and retro-220 emulators. Those systems had core memories, with memory access latencies less than one-thousandth that of the 205. Throttling in those emulators works by allowing the processor to run unthrottled for a time slice equivalent to several milliseconds of emulated time, then comparing the internal clock to real-world time and pausing execution for the difference to obtain the appropriate average performance.

This approach is eminently feasible for retro-205, and would probably require at most several days of work to implement. If employed in the 205 emulator, it would serve to aggregate multiple drum latencies and per-access throttling delays into one larger delay. That would result in considerably lower overhead within the emulator for memory accesses, and may free up sufficient time for the browser overhead functions to execute without impacting the overall accuracy of emulator performance. Then again, it may not.

In the end, Tom and I decided to do nothing, at least for now. The current throttling mechanism appears to provide quite accurate timing for all of the other processor-intensive programs we have tested, and was defeated only by one very small, very carefully-crafted program that took maximum advantage of drum latencies overlapping instruction execution to minimize total run time.

Besides, clearing memory is something you normally do only once before loading a program for execution. Having to wait an addition two seconds for that is not exactly a crisis.