Exploring some Data Structures

Data structure is a way of organizing and storing data so that it can be efficiently accessed and updated. Thus, it is important to choose suitable data structures for different kinds of data. There are many pre-developed data structures that programmers can use directly, and it is mostly helpful if one knows how they work.

1. Array and Linked List

Array and Linked List are both used for storing data with similar type of elements. The main differences are that arrays store elements in sequential memory locations, which makes the elements easy and fast to access by indices. Whereas elements stored in linked lists are mostly not in contiguous memory locations, thus they rely on additional reference (along with the data itself) in each node to link to the next node.

Due to their design differences discussed above, their advantages and disadvantages are:

Array:

  • Advantages:
    • Each element can be accessed directly by its index (elements are stored consecutively)
  • Disadvantages:
    • Pre-defined fixed size (memory allocated at compile time), cannot store data dynamically
    • Slow to add or delete elements since each operation requires moving many elements

Linked List:

  • Advantages:
    • Dynamic and flexible size which can change at runtime
    • Can efficiently insert or delete elements simply by re-pointing the references in each node
  • Disadvantages:
    • If getting access to one element, all previous elements have to be traversed. So it takes linear time, which can be slower

2. Hash Table

2.1 Hash Table

Hash Table is a data structure which stores key-value pairs / entries. It uses hashing method to transform the original key to a hash code and stores the hash code as the index of the value.

For example, a key-value pair or entry will be stored in hash table first by transforming the key to an index by a pre-defined hashing method:

  • key –> hashing method –> index

Then the index and value will be stored:

  • index,value

This way, it speeds up the lookup performance (with an average time complexity of O(1)), which doesn’t depend on the number of entries. The main principle is to distribute all the entries into a fixed amount of “buckets” based on the hash codes of the keys. For instance, if there are 2000 key-value pairs to store and the hash table has 1000 buckets, then each bucket can have only 2 entries on average, which is very efficient to find. However, the condition for this system to work is that the entries have to be distributed evenly (NOT all 2000 entries in the SAME bucket). That is to say, the hash codes need to be distributed evenly. In C#, one can override the GetHashCode() method for an more optimal hashing method implementation.

2.2 Hash Collision

If more than one keys return the same hash code from the hash method, it is called hash collision. Although the ideal scenario of a hash method would be not generating the same hash code for different keys, it is not possible to avoid especially for large datasets. The better we can do is to minimize the occurrences of hash collisions as well as considering the performance.

Since hash collision is inevitable, there are some common strategies one can use to deal with the collisions:

  • Separate chaining / Open hashing

    In this method, one can save pointers in the cells in a hash table. The pointer points to the next value with the same hash code in a linked list. This way, the values with the same hash code can be stored in a same cell. For example:

    • When (hash code/index, value) = (0, A) comes in => Entry[0] = A
    • When (hash code/index, value) = (0, B) comes in => B.next = A, Entry[0] = B
    • When (hash code/index, value) = (0, C) comes in => C.next = B, Entry[0] = C

However, the disadvantages is also very obvious, which is the difficulty of tracking the lists and identify values.

  • Open addressing / Closed hashing:

    The main idea of this method is: when the collision happens, probe another empty cell and put the value there. There are different ways of probing the next empty cell. The most straightforward one is linear probing which probes the empty cell sequentially, and inserts the value when one empty cell occurs.

    When finding the values that are stored through linear probing, it follows the same procedure (find the cell of the initial hash code first, and iterate the following cells linearly until finding the target value). Then it becomes clear that one cannot delete a cell which is inserted by linear probing, otherwise the cells after the deleted cell would be lost tracking. So when deleting, a keyword which represents ‘deleted’ is usually marked on the cell instead of really deleting.

    There are also other methods for finding the empty cells, such as quadratic probing or double hashing. The disadvantage of open address is that the more collisions, the less empty space, the more time it requires to probe.

2.3 GetHashCode() in C#

In C#, GetHashCode() is a method under Object class. It uses to return a hash code for the instance of the object, which is used for identify and insert an object in a hash-based collection. Two objects that return different hash codes means that they are not equal, but NOT vice versa (collision). The hash codes of instances are not unique values and may be differ from platform to platform, so they should not used as ID for objects.

3. Dictionary

Dictionary is also a data structure to store key-value pairs. The keys in a dictionary need to be simple types, such as integers or strings, and the keys also have to be unique, otherwise it will overwrite the existing value of that key. In C#, dictionaries are implemented as hash tables. While in C++, dictionary is implemented as map which is internally a red-black tree or AVL tree.