手写简易 HashMap

桩白墨
Java
2022-07-30
117热度
0评论

导航

这是我去年碰到的一道面试题，当时直接整懵逼了。源码大致看过，原理也知道，突然要写还真写不出来。于是这两天又把这个问题翻出来，尝试写了一版。

1 明确需求

实现一个简易的 HashMap，包含 hashmap 的核心功能（put、get）
实现拉链法即可，不必实现红黑树
最好实现扩容逻辑
不支持 key 为 null 的情况（要支持也可以，需要单独处理，比较简单）

为了让需求更加明确，写了单元测试：

@Test
public void should_work() {
  MyMap<Integer, Integer> m = new MyHashMap<>();
  m.put(1, 1001);
  m.put(2, 1002);
  m.put(1, 1003);
  m.put(3, 1004);
  m.remove(3);
  assertEquals(2, m.size());
  assertEquals(1002, (int)m.get(2));
  assertEquals(1003, (int)m.get(1));
  assertTrue(m.containsKey(1));
  assertTrue(m.containsValue(1003));
  assertFalse(m.containsKey(3));
  assertFalse(m.containsValue(1004));
}

@Test
public void should_resize() {
  MyHashMap<Integer, Integer> m = new MyHashMap<>();
  for (int i = 0; i < 50; i++) {
    m.put(i, i+1000);
  }
  assertEquals(128, m.getCapacity());

  for (int i = 50; i < 100; i++) {
    m.put(i, i+1000);
  }
  assertEquals(256, m.getCapacity());
}

一共写了两个 case。第一个 case 测试 map 的基本功能，第二个 case 测试扩容。

2 大概思路

先把接口定义出来，然后开始实现。

接口有 Map 和 Entry 两个。

关键实现：

拉链法。定义一个 table 数组，作为“桶”，桶中放入 Entry 的链表。插入元素时，使用“头插法”。
直接用 Objects.hash() 方法计算 hash 值，然后取余数找到对应的桶，简化操作。
根据负载因子计算阈值，发生扩容时，对元素进行遍历，逐个根据 hash 值重新插入新的桶中。
只做扩容，不做缩容（我看到 JDK1.8 的源码中没有做缩容）。

3 实现步骤

3.1 定义接口

定义 MyMap 接口，以及内部的 Entry 接口。

public interface MyMap<K, V> {
  int size();
  void put(K key, V val);
  V get(K key);
  void remove(K key);
  boolean containsKey(K key);
  boolean containsValue(V val);

  interface Entry<K, V> {
    K getKey();
    V getValue();
    void setValue(V val);
    int hashCode();
  }
}

3.2 写单元测试

针对接口写测试。代码已经在前面写了，此处略。

（在实际过程中，我先把单元测试的代码都注释掉，然后按照细分功能点逐步打开注释，打开一个，实现一个。）

3.3 MyHashMap 初始化

定义一个实现类 MyHashMap，实现 MyMap 接口。做一些基本的初始化操作：

定义两个常量：INIT_CAPACITY，LOAD_FACTOR。在原版 HashMap 中，负载因子是可以自定义的，我们这里写死即可。
定义字段：size（元素的个数），table（桶的数组，用于存放所有元素），capacity（table数组的容量），threshold（阈值，等于 capacity * LOAD_FACTOR）。

3.4 定义内部的 Node 类，实现 Entry

Node 可以实现链表操作。

static class Node<K, V> implements MyMap.Entry<K, V> {
    private K key;
    private V val;
    private int hashCode;
    Node<K, V> next;

    Node(K key, V val, Node<K, V> next) {
      this.key = key;
      this.val = val;
      this.next = next;
      hashCode = hash(key);
    }

    @Override
    public K getKey() {
      return key;
    }

    @Override
    public V getValue() {
      return val;
    }

    @Override
    public void setValue(V val) {
      this.val = val;
    }

    @Override
    public int hashCode() {
      return hashCode;
    }
  }

3.5 实现 put 操作

先尝试寻找相同 key 的元素，如果找到了，进行更新操作。

如果没找到，说明是一个新的键值对，用头插法进行插入。如果达到扩容条件，进行扩容。

@Override
public void put(K key, V val) {
  Node<K, V> node = findNode(key);
  if (node != null) {
    node.setValue(val);
    return;
  }
  int idx = getIndex(key);
  table[idx] = new Node<>(key, val, table[idx]);
  size++;
  if (size > threshold) {
    resize();
  }
}

private Node<K, V> findNode(K key) {
  int idx = getIndex(key);
  Node<K, V> node = table[idx];
  while (node != null) {
    if (Objects.equals(node.getKey(), key)) {
      return node;
    }
    node = node.next;
  }
  return null;
}

private int getIndex(K key) {
  int hashCode = hash(key);
  return hashCode % table.length;
}

3.6 实现 get 操作

get 操作非常简单。主要就是上一步的 findNode 方法。先找到桶，然后再去链表里找。

@Override
public V get(K key) {
  Node<K, V> node = findNode(key);
  return node == null ? null : node.getValue();
}

3.7 实现 remove 操作

先找到桶（链表），然后删除链表中的目标元素。

@Override
public void remove(K key) {
  int idx = getIndex(key);
  table[idx] = removeFromList(table[idx], key);
}

private Node<K, V> removeFromList(Node<K, V> head, K key) {
  Node<K, V> dummyHead = new Node<>(null, null, head);
  Node<K, V> pre = dummyHead;
  Node<K, V> cur = head;
  while (cur != null) {
    if (Objects.equals(cur.getKey(), key)) {
      pre.next = cur.next;
      size--;
      return dummyHead.next;
    }
    pre = cur;
    cur = cur.next;
  }
  return dummyHead.next;
}

3.8 实现 containsKey 和 containsValue

这两个比较简单。containsKey 就是 findNode 查一下。containsValue 需要做一次遍历。

@Override
public boolean containsKey(K key) {
  return findNode(key) != null;
}

@Override
public boolean containsValue(V val) {
  for (Node<K, V> head : table) {
    while (head != null) {
      if (Objects.equals(head.getValue(), val)) {
        return true;
      }
      head = head.next;
    }
  }
  return false;
}

至此，基本 api 都实现了。

3.9 实现 resize 扩容

这是最后一步，做完大功告成。

将容量扩大为原来的 2 倍
创建一组新的桶，对原来的元素进行迁移
更新容量、阈值等字段

@SuppressWarnings({"unchecked"})
private void resize() {
  if (table == null) {
    capacity = INIT_CAPACITY;
    table = new Node[capacity];
    threshold = (int) (capacity * LOAD_FACTOR);
    return;
  }
  int newCap = capacity * 2;
  Node<K, V>[] newTable = new Node[newCap];
  for (Node<K, V> head : table) {
    while (head != null) {
      Node<K, V> next = head.next;
      int newIdx = head.hashCode() % newCap;
      head.next = newTable[newIdx];
      newTable[newIdx] = head;
      head = next;
    }
  }

  table = newTable;
  capacity = newCap;
  threshold = (int) (capacity * LOAD_FACTOR);
}

4 完整源码

已上传 github。地址：

https://github.com/plough/sort-algorithms/tree/master/src/main/java/com/plough/datastructure/map

5 HashMap 附加问题

5.1 为什么容量一定要是 2 的整数倍？

在我们简易版本的实现中，其实是没有这个要求的。对于原版的 HashMap，保持容量为 2 的整数倍，可以提高散列均匀性，减少 hash 碰撞概率。

为什么？

原版中，getIndex 的实现是 hash(key) & (n-1)。当 n 为偶数时，最后一位始终为 1，完成 & 运算后，最后一位可能是 1 或 0。如果 n 为奇数，& 运算后，最后一位只可能是 0，也就是说所有奇数位置都无法使用。

5.2 为什么负载因子是 0.75？

首先，一定要有一个负载因子，而且它必然有一个相对合适的取值。如果负载因子过大，hashmap 会变得紧凑，增加了碰撞概率；如果负载因子过小，hashmap 中会有很多空闲的桶，浪费空间。

那么，为什么是 0.75 呢？这是一个数学问题。这个值是根据“泊松分布”算出来的，最佳取值在 0.7-0.8 之间，于是便取了 0.75。

至于具体怎么算的，我数学不好，就没深究了。感兴趣可以其他地方再查查资料。