Skip to main content

Programming with Sets

(Content adapted from Critchlow & Eck)

On a computer, all data are represented, ultimately, as strings of zeros and ones. At times, computers need to work with sets. How can sets be represented as strings of zeros and ones?

A set is determined by its elements. Given a set AA and an entity xx, the fundamental question is, does xx belong to AA or not? If we know the answer to this question for each possible xx, then we know the set. For a given xx, the answer to the question, "Is xx a member of AA," is either yes or no. The answer can be encoded by letting 1 stand for yes and 0 stand for no. The answer, then, is a single bit, that is, a value that can be either zero or one. To represent the set AA as a string of zeros and ones, we could use one bit for each possible member of AA. If a possible member xx is in the set, then the corresponding bit has the value one. If xx is not in the set, then the corresponding bit has the value zero.

Now, in cases where the number of possible elements of the set is very large or infinite, it is not practical to represent the set in this way. It would require too many bits, perhaps an infinite number. In such cases, some other representation for the set can be used. However, suppose we are only interested in subsets of some specified small set. Since this set plays the role of a universal set, let's call it UU. To represent a subset of UU, we just need one bit for each member of UU. If the number of members of UU is nn, then a subset of UU is represented by a string of nn zeros and ones. Furthermore, every string of nn zeros and ones determines a subset of UU, namely that subset that contains exactly the elements of UU that correspond to ones in the string. A string of nn zeros and ones is called an nn-bit binary number. So, we see that if UU is a set with nn elements, then the subsets of UU correspond to nn-bit binary numbers.

To make things more definite, let UU be the set {0,1,2,,31}\{0,1,2,\dots,31\}. This set consists of the 32 integers between 0 and 31, inclusive. Then each subset of UU can be represented by a 32-bit binary number. We use 32 bits because most computer languages can work directly with 32-bit numbers. For example, the programming languages Java, C, and C++ have a data type named int. A value of type int is a 32-bit binary number.1 Before we get a definite correspondence between subsets of UU and 32-bit numbers, we have to decide which bit in the number will correspond to each member of UU. Following tradition, we assume that the bits are numbered from right to left. That is, the rightmost bit corresponds to the element 0 in UU, the second bit from the right corresponds to 1, the third bit from the right to 2, and so on. For example, the 32-bit number

10000000000000000000010011101101000000000000000000001001110110

corresponds to the subset {1,2,4,5,6,9,31}\{1,2,4,5,6,9,31\}. Since the leftmost bit of the number is 1, the number 31 is in the set; since the next bit is 0, the number 30 is not in the set; and so on.

From now on, I will write binary numbers with a subscript of 2 to avoid confusion with ordinary numbers. Furthermore, I will often leave out leading zeros. For example, 11012_2 is the binary number that would be written out in full as

0000000000000000000000000000110100000000000000000000000000001101

and which corresponds to the set {0,2,3}\{0,2,3\}. On the other hand 1101 represents the ordinary number one thousand one hundred and one.

Even with this notation, it can be very annoying to write out long binary numbersand almost impossible to read them. So binary numbers are almost never written out as sequences of zeros and ones in computer programs. An alternative is to use hexadecimal numbers. Hexadecimal numbers are written using the sixteen symbols 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F. These symbols are knows as the hexadecimal digits. Each hexadecimal digit corresponds to a 4-bit binary number, as shown in this table:

Hex.BinaryHex.Binary
0000020000_28100021000_2
1000120001_29100121001_2
2001020010_2A101021010_2
3001120011_2B101121011_2
4010020100_2C110021100_2
5010120101_2D110121101_2
6011020110_2E111021110_2
7011120111_2F111121111_2

To represent a longer binary number, several hexadecimal digits can be strung together. For example, the hexadecimal number C7 represents the binary number 110001112_2. In Java and many related languages, a hexadecimal number is written with the prefix "0x". Thus, the hexadecimal number C7 would appear in the program as 0xC7. I will follow the same convention here. Any 32-bit binary number can be written using eight hexadecimal digits (or fewer if leading zeros are omitted). Thus, subsets of {0,1,2,,31}\{0,1,2,\dots,31\} correspond to 8-digit hexadecimal numbers. For example, the subset {1,2,4,5,6,9,31}\{1,2,4,5,6,9,31\} corresponds to 0x80000276, which represents the binary number 10000000000000000000010011101102_2. Similarly, 0xFF corresponds to {0,1,2,3,4,5,6,7}\{0,1,2,3,4,5,6,7\} and 0x1101 corresponds to the binary number 00010001000000012_2 and to the set {0,8,12}\{0,8,12\}.

Now, if you have worked with binary numbers or with hexadecimal numbers, you know that they have another, more common interpretation. They represent ordinary integers. Just as 342 represents the integer 3102+4101+21003\cdot 10^2 + 4\cdot 10^1 +2\cdot 10^0, the binary number 11012_2 represents the integer 123+122+021+1201\cdot 2^3 +1\cdot 2^2 +0\cdot 2^1 +1\cdot 2^0, or 13. When used in this way, binary numbers are known as base-2 numbers, just as ordinary numbers are base-10 numbers. Hexadecimal numbers can be interpreted as base-16 numbers. For example, 0x3C7 represents the integer 3162+12161+71603\cdot 16^2 + 12\cdot 16^1 + 7\cdot 16^0, or 874. So, does 11012_2 really represent the integer 13, or does it represent the set {0,2,3}\{0,2,3\}\,? The answer is that to a person, 11012_2 can represent either. Both are valid interpretations, and the only real question is which interpretation is useful in a given circumstance. On the other hand, to the computer, 11012_2 doesn't represent anything. It's just a string of bits, and the computer manipulates the bits according to its program, without regard to their interpretation.

Of course, we still have to answer the question of whether it is ever useful to interpret strings of bits in a computer as representing sets.

Operations on Binary Numbers

If all we could do with sets were to "represent" them, it wouldn't be very useful. We need to be able to compute with sets. That is, we need to be able to perform set operations such as union and complement.

Many programming languages provide operators that perform set operations. In Java and related languages, the operators that perform union, intersection, and complement are written as \,|\,, &\&, and \sim. For example, if xx and yy are 32-bit integers representing two subsets, XX and YY, of {0,1,2,,31}\{0,1,2,\dots,31\}, then xyx\,|\,y is a 32-bit integer that represents the set XYX\cup Y. Similarly, x&yx\,\&\,y represents the set XYX\cap Y, and x\sim x represents the complement, X\overline{X}.

The operators \,|\,, &\&, and \sim are called bitwise logical operators because of the way they operate on the individual bits of the numbers to which they are applied. The languages descended from C have one more bitwise logical operator: aba\,^\wedge\,b takes the exclusive-OR at each corresponding bit position of aa and bb. Recall that the exclusive-OR, also written pqp\oplus q, is the variation of OR that is true when either pp or qq is true, but not both. If 0 and 1 are interpreted as the logical values false and true, then the bitwise logical operators perform the logical operations \lor, \land, ¬\lnot, and \oplus on individual bits. To see why this is true, let's look at the computations that these operators have to perform.

Let kk be one of the members of {0,1,2,,31}\{0,1,2,\dots,31\}. In the binary numbers xx, yy, xyx\,|\,y, x&yx\,\&\,y, and x\sim x, the number kk corresponds to the bit in position kk. That is, kk is in the set represented by a binary number if and only if the bit in position kk in that binary number is 1. Considered as sets, x&yx\,\&\,y is the intersection of xx and yy, so kk is a member of the set represented by x&yx\,\&\,y if and only if kk is a member of both of the sets represented by xx and yy. That is, bit kk is 1 in the binary number x&yx\,\&\,y if and only if bit kk is 1 in xx and bit kk is 1 in yy. When we interpret 1 as true and 0 as false, we see that bit kk of x&yx\,\&\,y is computed by applying the logical "and" operator, \land, to bit kk of xx and bit kk of yy. Similarly, bit kk of xyx\,|\,y is computed by applying the logical "or" operator, \lor, to bit kk of xx and bit kk of yy. And bit kk of x\sim x is computed by applying the logical "not" operator, ¬\lnot, to bit kk of xx. In each case, the logical operator is applied to each bit position separately. (Of course, this discussion is just a translation to the language of bits of the definitions of the set operations in terms of logical operators: AB={xxAxB}A\cap B=\{x\mid x\in A \land x\in B\}, AB={xxAxB}A\cup B=\{x\mid x\in A \lor x\in B\}, and A={xU¬(xA)}\overline{A}=\{x\in U\mid \lnot(x\in A)\}.)

For example, consider the binary numbers 10110102_2 and 101112_2, which represent the sets {1,3,4,6}\{1,3,4,6\} and {0,1,2,4}\{0,1,2,4\}. Then 10110102_2 &\& 101112_2 is 100102_2. This binary number represents the set {1,4}\{1,4\}, which is the intersection {1,3,4,6}{0,1,2,4}\{1,3,4,6\}\cap\{0,1,2,4\}. It's easier to see what's going on if we write out the computation in columns, the way you probably first learned to do addition:

1 0 1 1 0 1 0{\{6,4,3,1}\}
&\&1 0 1 1 1{\{4,2,1,0}\}
==1 0 0 1 0{\{4,1}\}

Note that in each column in the binary numbers, the bit in the bottom row is computed as the logical "and" of the two bits that lie above it in the column. I've written out the sets that correspond to the binary numbers to show how the bits in the numbers correspond to the presence or absence of elements in the sets. Similarly, we can see how the union of two sets is computed as a bitwise "or" of the corresponding binary numbers.

1 0 1 1 0 1 0{\{6,4,3,1}\}
\mid1 0 1 1 1{\{4,2,1,0}\}
==1 0 1 1 1 1 1{\{6,4,3,2,1,0}\}

The complement of a set is computed using a bitwise "not" operation. Since we are working with 32-bit binary numbers, the complement is taken with respect to the universal set {0,1,2,,31}\{0,1,2,\dots,31\}. So, for example,

10110102=111111111111111111111111101001012\sim 1011010_2 = 11111111111111111111111110100101_2

Of course, we can apply the operators &\&, \,|\,, and \sim to numbers written in hexadecimal form, or even in ordinary, base-10 form. When doing such calculations by hand, it is probably best to first translate the numbers into binary form. For example,

0xAB7 & 0x168E=1010101101112&10110100011102= 1010\,1011\,0111_2\,\&\,1\,0110\,1000\,1110_2
=00010100001102= 0\,0010\,1000\,0110_2
== 0x286

Bit masks

When computing with sets, it is sometimes necessary to work with individual elements. Typical operations include adding an element to a set, removing an element from a set, and testing whether an element is in a set. However, instead of working with an element itself, it's convenient to work with the set that contains that element as its only member. For example, testing whether 5A5\in A is the same as testing whether {5}A\{5\}\cap A\not=\emptyset. The set {5}\{5\} is represented by the binary number 1000002_2 or by the hexadecimal number 0x20. Suppose that the set AA is represented by the number xx. Then, testing whether 5A5\in A is equivalent to testing whether 0x20 &\& x0x\not=0. Similarly, the set A{5}A\cup\{5\}, which is obtained by adding 5 to AA, can be computed as xx | 0x20. The set A{5}A\smallsetminus \{5\}, which is the set obtained by removing 5 from AA if it occurs in AA, is represented by $x\,\&\,\sim$0x20.

The sets {0}\{0\}, {1}\{1\}, {2}\{2\}, {3}\{3\}, {4}\{4\}, {5}\{5\}, {6}\{6\}, \dots, {31}\{31\} are represented by the hexadecimal numbers 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, \dots, 0x80000000. In typical computer applications, some of these numbers are given names, and these names are thought of as names for the possible elements of a set (although, properly speaking, they are names for sets containing those elements). Suppose, for example, that aa, bb, cc, and dd are names for four of the numbers from the above list. Then aca\,|\,c is the set that contains the two elements corresponding to the numbers aa and cc. If xx is a set, then x&dx\,\&\,\sim d is the set obtained by removing dd from xx. And we can test whether bb is in xx by testing if x&b0x\,\&\,b\not=0.

Here is an actual example, which is used in the Macintosh operating system. Characters can be printed or displayed on the screen in various sizes and styles. A font is a collection of pictures of characters in a particular size and style. On the Macintosh, a basic font can be modified by specifying any of the following style attributes: bold, italic, underline, outline, shadow, condense, and extend. The style of a font is a subset of this set of attributes. A style set can be specified by or-ing together individual attributes. For example, an underlined, bold, italic font has style set underline | bold | italic. For a plain font, with none of the style attributes set, the style set is the empty set, which is represented by the number zero.

The Java programming language uses a similar scheme to specify style attributes for fonts, but currently there are only two basic attributes, Font.BOLD and Font.ITALIC. A more interesting example in Java is provided by event types. An event in Java represents some kind of user action, such as pressing a key on the keyboard. Events are associated with "components" such as windows, push buttons, and scroll bars. Components can be set to ignore a given type of event. We then say that that event type is disabled for that component. If a component is set to process events of a given type, then that event type is said to be enabled. Each component keeps track of the set of event types that are currently enabled. It will ignore any event whose type is not in that set. Each event type has an associated constant with a name such as AWTEvent.MOUSE_EVENT_MASK. These constants represent the possible elements of a set of event types. A set of event types can be specified by or-ing together a number of such constants. If cc is a component and xx is a number representing a set of event types, then the command "cc.enableEvents(xx)" enables the events in the set xx for the component cc. If yy represents the set of event types that were already enabled for cc, then the effect of this command is to replace yy with the union, yxy\,|\,x. Another command, "cc.disableEvents(xx)", will disable the event types in xx for the component cc. It does this by replacing the current set, yy, with y&xy\,\&\,\sim x.

The bitwise operators are also useful when performing low-level bit manipulation, such as when writing code that interfaces with hardware devices. A binary number may be interpreted as a bit mask, where each 1 bit selects its bit position. For example, the mask 010100002_2 selects the 64's place and the 16's place. One way to produce such a mask is to use another bit operator: the shift (<<). The expression x<<kx\,\texttt{<<}\,k results in xx being shifted kk bits to the left, with bits of 0 pushed in from the right (effectively multiplying by 2k2^k). Since 64=2664=2^6 and 16=2416=2^4, the above mask may be produced by 1<<6    1<<41\,\texttt{<<}\,6\;|\;1\,\texttt{<<}\,4.

Using a mask mm, the following simple operations are possible on a binary number xx:

  • xmx\,|\,m turns on (sets to 1) all of the selected bits of xx, leaving the rest unchanged;
  • x&mx\,\&\,\sim m turns off all of the selected bits of xx;
  • x&mx\,\&\,m turns off all of the non-selected bits of xx;
  • xmx\,^\wedge\,m "toggles" (flips between 0 and 1) all of the selected bits of xx;
  • x\sim x toggles all of the bits of xx.

The following table shows samples of these operations:

x:011010112m:010100002xm:011110112x&m:001010112x&m:010000002xm:001110112x:100101002\begin{array}{rl} x: & 01101011_2\\ m: & 01010000_2\\ \hline x\,|\,m: & 01111011_2\\ x\,\&\,\sim m: & 00101011_2\\ x\,\&\,m: & 01000000_2\\ x\,^\wedge\,m: & 00111011_2\\ \sim x: & 10010100_2 \end{array}

Binary numbers in ReasonML

You can try out various operations on binary numbers in the ReasonML code block below. Here is a table of the corresponding number syntax and operations:

C++ and JavaReasonML
0x2A0x2A
0b1010100b101010
x & yx land y
x \mid yx lor y
x ^ yx lxor y
~ xlnot(x)
x << nx lsl n
x >> nx lsr n

There is no simple way to print out a number in binary in ReasonML, but you can use Printf.sprintf("0x%x", n) to convert n to a hexadecimal string.

Evaluate the code by pressing the button or hitting Ctrl-Enter.

Exercises

  1. Suppose that the numbers xx and yy represent the sets AA and BB. Show that the set ABA\smallsetminus B is represented by x&(y)x \,\&\, (\sim y).

    Answer

    For each bit in the output, the result will be 1 if the corresponding bit in xx was 1 while the bit in yy was 0. That is, the element represented by that bit must be in AA and not in BB, which is exactly the condition needed for an element to be in ABA\smallsetminus B.

  2. Write each of the following binary numbers in hexadecimal:

    • 101101102_2
      Answer

      0xB6

    • 102_2
      Answer

      0x2

    • 1111000011112_2
      Answer

      0xF0F

    • 1010012_2
      Answer

      0x29

  3. Write each of the following hexadecimal numbers in binary:

    • 0x123
      Answer

      1001000112_2

    • 0xFADE
      Answer

      11111010110111102_2

    • 0x137F
      Answer

      10011011111112_2

    • 0xFF11
      Answer

      11111111000100012_2

  4. Give the value of each of the following expressions as a hexadecimal number:

    • 0x73 | 0x56A
      Answer

      0x57B

    • \sim 0x3FF0A2FF
      Answer

      0xC00F5D00

    • (0x44 | 0x95) &\& 0xE7
      Answer

      0xC5

    • 0x5C35A7 &\& 0xFF00
      Answer

      0x3500

    • 0x5C35A7 &\& \sim 0xFF00
      Answer

      0x5C00A7

    • \sim(0x1234 &\& 0x4321)
      Answer

      0xFFFFFDDF

  5. Find a calculator (or a calculator program on a computer) that can work with hexadecimal numbers. Write a short report explaining how to work with hexadecimal numbers on that calculator. You should explain, in particular, how the calculator can be used to do the previous problem.

  6. This question assumes that you know how to add binary numbers. Suppose xx and yy are binary numbers. Under what circumstances will the binary numbers x+yx+y and xyx\,|\,y be the same?

    Answer

    They will be the same if there is no carry from any column to the next. This will be true if there is no bit position where both xx and yy are

    1. Another way of saying this is that x%y=0x\,\%\,y=0.
  7. In addition to hexadecimal numbers, the programming languages Java, C, and C++ support octal numbers. Look up and report on octal numbers in Java, C, or C++. Explain what octal numbers are, how they are written, and how they are used.

    Answer

    Octal numbers are in base 8, so each digit of octal (0 through 7) corresponds to three bits of binary. Octal literals are written with a leading zero in many C-like languages, which is why 010 - 1 is 7 instead of 9 (which can be an obscure source of bugs for people who are not aware of this convention).

  8. In the UNIX (or Linux) operating system, every file has an associated set of permissions, which determine who can use the file and how it can be used. The set of permissions for a given file is represented by a nine-bit binary number. This number is sometimes written as an octal number. Research and report on the UNIX system of permissions. What set of permissions is represented by the octal number 752? by the octal number 622? Explain what is done by the UNIX commands "chmod g+rw filename" and "chmod o-w filename" in terms of sets. (Hint: Look at the man page for the chmod command. To see the page, use the UNIX command "man chmod". If you don't know what this means, you probably don't know enough about UNIX to do this exercise.)

  9. Java, C, and C++ each have a boolean data type that has the values true and false. The usual logical and, or, and not operators on boolean values are represented by the operators &&\&\&, |\,|, and !. C and C++ allow integer values to be used in places where boolean values are expected. In this case, the integer zero represents the boolean value false while any non-zero integer represents the boolean value true. This means that if xx and yy are integers, then both x&yx\,\&\,y and x&&yx\,\&\&\,y are valid expressions, and both can be considered to represent boolean values. Do the expressions x&yx\,\&\,y and x&&yx\,\&\&\,y always represent the same boolean value, for any integers xx and yy? Do the expressions xyx\,|\,y and xyx\,|\,|\,y always represent the same boolean values? Explain your answers.

    Answer

    No, they are not always the same. For example, 1 & 2 is 0 (false), while 1 && 2 is 1 (true). However, the only way to get 0 (false) from either bitwise OR (|) or logical OR (||) is if both operands are 0, so the expressions xyx\,|\,y and xyx\,|\,|\,y will always represent the same boolean values (although their integer values may well differ).

  10. Suppose that you, as a programmer, want to write a subroutine that will open a window on the computer's screen. The window can have any of the following options: a close box, a zoom box, a resize box, a minimize box, a vertical scroll bar, a horizontal scroll bar. Design a scheme whereby the options for the window can be specified by a single parameter to the subroutine. The parameter should represent a set of options. How would you use your subroutine to open a window that has a close box and both scroll bars and no other options? Inside your subroutine, how would you determine which options have been specified for the window?

    Answer

    Assign each option a different power of two; for example CLOSE = 1, ZOOM = 2, RESIZE = 4, MINIMIZE = 8, VSB = 16, and HSB = 32. A particular combination of options may be passed as the bitwise OR of these constants: CLOSE | VSB | HSB. In the subroutine, masking that parameter with each option will give a zero/non-zero value depending on whether the option was specified. For example, you should draw a close box if options & CLOSE is true (non-zero).

  11. Consider the following sequence of operations on two integer variables, xx and yy (again, this should work the same in all C-like languages):

x = x ^ y;
y = y ^ x;
x = x ^ y;

What is the net effect of this sequence on the values stored in xx and yy?

Answer

It swaps the values in xx and yy. After the second assignment, xx is x ^ y and yy is y ^ (x ^ y), which equals the original xx. The third assignment then stores (x ^ y) ^ x (in terms of the original xx and yy) into xx, which equals the original yy.

  1. Bitwise operations are also useful when working with character data. In the ASCII character encoding (which is also the first 128 characters of Unicode), the digits '0' through '9' have codes 48 through 57; the uppercase latin alphabet, 'A' through 'Z', have 65 through 90, and the corresponding lowercase letters have codes 97 through 122.
  • Convert the endpoints of each of these code ranges to binary.
    Answer

    '0' is 0110000 and '9' is 0111001. In hexadecimal, these are 0x30 and 0x39.

    'A' is 1000001 and 'Z' is 1011010. In hexadecimal, these are 0x41 and 0x5A.

    'a' is 1100001 and 'z' is 1111010. In hexadecimal, these are 0x61 and 0x7A.

  • Give an expression using only integer constants and bitwise operations that will convert the character code cc for an ASCII digit into its corresponding integer value. Do not worry about what it will do to non-digits.
    Answer

    We just need to select the lowest four bits: c & 0xF

  • Give a similar expression that will convert an integer nn in the range 0 to 9 into the corresponding ASCII digit code. Do not worry about error cases.
    Answer

    Given a number 0 through 9, we just need to set the 16's and 32's bits to 1: n | 0x30 (we can also do n + 48, or n + '0').

  • Give expressions that will take a letter whose ASCII code is cc and
    1. convert it to uppercase,
    2. convert it to lowercase, and
    3. toggle it between upper- and lowercase. Do not worry about error cases.
      Answer

      Converting to upper-case corresponds to turning off (resetting) the 32's bit: c & ~0x20, or c & 0x5F (since we only care about the bottom seven bits for ASCII). Converting to lower-case corresponds to turning on (setting) that bit: c | 0x20. Toggling the case corresponds to toggling that bit: c ^ 0x20.


  1. Actually, in some versions of C and C++, a value of type int is a 16-bit number, while in others it is a 64-bit number. A 16-bit number can be used to represent a subset of the set {0,1,2,,15}\{0,1,2,\dots,15\}. The principle, of course, is the same.