“Be wary of time complexity pitfalls"

mengqinyuan

mengqinyuan

Posted on August 1, 2024

“Be wary of time complexity pitfalls"

Be wary of time complexity pitfalls

Write Here

A bilibili vedio also shows this : https://www.bilibili.com/video/BV16u4m1c7cU/?spm_id_from=333.999.0.0 and I think it's a good vedio, but its language is Chinese.

Image description

time complexity

  • What does time complexity mean?
  • Time complexity is a measure of how much time an algorithm takes to run as a function of the size of its input. It is a way to describe the efficiency of an algorithm, and it is used to compare different algorithms and determine which one is the most efficient.

  • How to calculate time complexity?

  • To calculate time complexity, we need to consider the number of operations performed by the algorithm as a function of the size of the input. The most common way to measure the number of operations is by counting the number of times a particular operation is performed.

  • What are some common pitfalls when calculating time complexity?

    1. Not considering the input size: If we only consider the number of operations performed by the algorithm, we may underestimate the time complexity. For example, if we count the number of times a loop runs, but we don't take into account the size of the input, then the time complexity may be too high.
    1. Not considering the algorithm's efficiency: Some algorithms may perform many operations even when the input size is small, which can lead to high time complexity. For example, sorting algorithms like bubble sort and selection sort have quadratic time complexity, even though they may only need to swap two elements in a small array.
    1. Not considering the algorithm's worst-case scenario: Some algorithms have a best-case scenario where they perform fewer operations than the worst-case scenario. For example, searching algorithms like binary search have a best-case scenario where they find the target element in logarithmic time, but they have a worst-case scenario where they need to examine all elements in the array.

Let's see some examples of time complexity

Here is a question:
Find the max 10 integers in a list.

import random
ls = [random.randint(1, 100) for _ in range(n)]
Enter fullscreen mode Exit fullscreen mode

Here is the test function:

import time
def benchmark(func, *args) -> float:
    # bench mark for testing
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    return end - start
Enter fullscreen mode Exit fullscreen mode

Solution1: Use heap

Here is the solution which uses the heapq module:

def find_max_n(ls, n):
    import heapq
    return heapq.nlargest(n, ls)
Enter fullscreen mode Exit fullscreen mode

Or we write it in the python way:

def find_largest_n(nums, n):
    if n <= 0:
        return []

    max_heap = []

    for num in nums:
        if len(max_heap) < n:
            max_heap.append(num)
            # call sift_down
            for i in range((len(max_heap) - 2) // 2, -1, -1):
                _sift_down(max_heap, i)
        elif num > max_heap[0]:
            max_heap[0] = num
            _sift_down(max_heap, 0)

    return max_heap

def _sift_down(heap, index):
    left = 2 * index + 1
    right = 2 * index + 2
    largest = index

    if left < len(heap) and heap[left] > heap[largest]:
        largest = left

    if right < len(heap) and heap[right] > heap[largest]:
        largest = right

    if largest != index:
        heap[index], heap[largest] = heap[largest], heap[index]
        _sift_down(heap, largest)

Enter fullscreen mode Exit fullscreen mode

Solution2: Use sort

Here is the solution which uses the sort function:

def find_max_n(ls, n):
    return sorted(ls, reverse=True)[:n]
Enter fullscreen mode Exit fullscreen mode

Testing

Almost everyone know that The time complexity of the heap, is O(n log k), and the time complexity of the sort function is O(n log n).

It seems that the First solution is better than the second one. But it is not true in python.

Look at the final code:

import time
def benchmark(func, *args) -> float:
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    return end - start

def find_max_n_heapq(ls, n):
    import heapq
    return heapq.nlargest(n, ls)

def find_max_n_python_heap(nums, n):
    if n <= 0:
        return []

    max_heap = []

    for num in nums:
        if len(max_heap) < n:
            max_heap.append(num)
            # call sift_down
            for i in range((len(max_heap) - 2) // 2, -1, -1):
                _sift_down(max_heap, i)
        elif num > max_heap[0]:
            max_heap[0] = num
            _sift_down(max_heap, 0)

    return max_heap

def _sift_down(heap, index):
    left = 2 * index + 1
    right = 2 * index + 2
    largest = index

    if left < len(heap) and heap[left] > heap[largest]:
        largest = left

    if right < len(heap) and heap[right] > heap[largest]:
        largest = right

    if largest != index:
        heap[index], heap[largest] = heap[largest], heap[index]
        _sift_down(heap, largest)


def find_max_n_sorted(ls, n):
    return sorted(ls, reverse=True)[:n]

# test
import random
for n in range(10, 10000, 100):
    ls = [random.randint(1, 100) for _ in range(n)]
    print(f"n = {n}")
    print(f"Use    Heapq: {benchmark(find_max_n_heapq, ls, n)}")
    print(f"Python Heapq: {benchmark(find_max_n_python_heap, ls, n)}")
    print(f"Sorted      : {benchmark(find_max_n_sorted, ls, n)}")
Enter fullscreen mode Exit fullscreen mode

I run it in Python3.11 VScode, Here is the result:

n is not big

n = 900
Use Heapq: 0.002430099993944168
Python Heapq: 0.06343129999004304
Sorted : 9.280000813305378e-02
n = 910
Use Heapq: 9.220000356435776e-05
Python Heapq: 0.07715500006452203
Sorted : 9.360001422464848e-05
n = 920
Use Heapq: 9.440002031624317e-05
Python Heapq: 0.06573969998862594
Sorted : 0.00012450001668185
n = 930
Use Heapq: 9.689992293715477e-05
Python Heapq: 0.06760239996947348
Sorted : 9.66999214142561e-05
n = 940
Use Heapq: 9.600003249943256e-05
Python Heapq: 0.07372559991199523

Sorted : 9.680003859102726e-05

n = 950
Use Heapq: 9.770004544407129e-05
Python Heapq: 0.07306570000946522
Sorted : 0.000119799980893731
n = 960
Use Heapq: 9.980006143450737e-05
Python Heapq: 0.0771204000338912
Sorted : 0.00022829999215900898
n = 970
Use Heapq: 0.0001601999392732978
Python Heapq: 0.07493270002305508
Sorted : 0.00010840001050382853
n = 980
Use Heapq: 9.949994273483753e-05
Python Heapq: 0.07698320003692061
Sorted : 0.00010300008580088615
n = 990
Use Heapq: 9.979994501918554e-05
Python Heapq: 0.0848745999392122
Sorted : 0.00012620002962648869

if n is big ?

n = 10000
Use Heapq: 0.003642000025138259
Python Heapq: 9.698883199947886
Sorted : 0.0010799999581649
n = 11000
Use Heapq: 0.0014836000045761466
Python Heapq: 10.537632800056599
Sorted : 0.0012236000038683414
n = 12000
Use Heapq: 0.001384599949233234
Python Heapq: 12.328411899972707
Sorted : 0.001322699943557381
n = 13000
Use Heapq: 0.0020017001079395413
Python Heapq: 15.637207800056785
Sorted : 0.0015075999544933438
n = 14000
Use Heapq: 0.0017026999266818166
Python Heapq: 17.298848500009626
Sorted : 0.0016967999981716275
n = 15000
Use Heapq: 0.0017773000290617347
Python Heapq: 20.780625900020823
Sorted : 0.0017105999868363142

What I find & how to improve it

When n is big, Sorted costs a little time (sometimes even better than use heapq), but Python Heapq costs a lot of time.

  • Why Sorted costs a little time, but Python Heap costs a lot of time?
  • Because sorted() is a bulit-in function in Python, you can find python office document about it.

Bulit-in function is faster than heapq because it is written in C, which is a compiled language.

  • How to improve it?
  • You can use the built-in function sorted() instead of heapq.nlargest() to improve the performance of your code. The sorted() function is a built-in function in Python, which is implemented in C and is therefore much faster than heapq.sort()

Maybe heapq.nlargest is faster in some situatios, but I cannot make sure that.

What about use other language?

  • Use other language like C++, C#, Java, etc.
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <random>

// C++ benchmark function
double benchmark(std::function<void(std::vector<int>, int)> func, std::vector<int> ls, int n) {
    auto start = std::chrono::high_resolution_clock::now();
    func(ls, n);
    auto end = std::chrono::high_resolution_clock::now();
    return std::chrono::duration<double>(end - start).count();
}

// Using std::partial_sort to find the n largest elements
void find_max_n_sorted(std::vector<int>& ls, int n) {
    std::partial_sort(ls.begin(), ls.begin() + n, ls.end(), std::greater<int>());
    ls.resize(n);
}

// Manual heap implementation
void _sift_down(std::vector<int>& heap, int index) {
    int left = 2 * index + 1;
    int right = 2 * index + 2;
    int largest = index;

    if (left < heap.size() && heap[left] > heap[largest])
        largest = left;

    if (right < heap.size() && heap[right] > heap[largest])
        largest = right;

    if (largest != index) {
        std::swap(heap[index], heap[largest]);
        _sift_down(heap, largest);
    }
}

std::vector<int> find_max_n_python_heap(std::vector<int> nums, int n) {
    if (n <= 0)
        return {};

    std::vector<int> max_heap;

    for (int num : nums) {
        if (max_heap.size() < n) {
            max_heap.push_back(num);
            for (int i = (max_heap.size() - 2) / 2; i >= 0; --i)
                _sift_down(max_heap, i);
        } else if (num > max_heap[0]) {
            max_heap[0] = num;
            _sift_down(max_heap, 0);
        }
    }

    return max_heap;
}

int main() {
    std::mt19937 gen(std::random_device{}());
    std::uniform_int_distribution<> dis(1, 100);

    for (int n = 10; n <= 10000; n += 100) {
        std::vector<int> ls(n);
        std::generate(ls.begin(), ls.end(), [&]() { return dis(gen); });

        std::vector<int> ls_copy1 = ls;
        std::vector<int> ls_copy2 = ls;

        std::cout << "n = " << n << std::endl;
        std::cout << "Use    Heapq: " << benchmark(find_max_n_sorted, ls_copy1, n) << std::endl;
        std::cout << "Python Heapq: " << benchmark(find_max_n_python_heap, ls_copy2, n) << std::endl;

        // Note: C++ std::nth_element or std::partial_sort could be used here for comparison
    }

    return 0;
}

Enter fullscreen mode Exit fullscreen mode
import java.util.*;
import java.time.Duration;
import java.time.Instant;

public class MaxElementsBenchmark {

    public static double benchmark(Runnable func) {
        Instant start = Instant.now();
        func.run();
        Instant end = Instant.now();
        return Duration.between(start, end).toNanos() / 1e9;
    }

    public static List<Integer> findMaxNHeap(ArrayList<Integer> ls, int n) {
        PriorityQueue<Integer> minHeap = new PriorityQueue<>();
        for (int num : ls) {
            if (minHeap.size() < n) {
                minHeap.offer(-num);
            } else if (-num > minHeap.peek()) {
                minHeap.poll();
                minHeap.offer(-num);
            }
        }
        List<Integer> result = new ArrayList<>(minHeap);
        Collections.sort(result);
        Collections.reverse(result);
        return result;
    }

    public static List<Integer> findMaxNManualHeap(ArrayList<Integer> nums, int n) {
        if (n <= 0) return new ArrayList<>();

        ArrayList<Integer> maxHeap = new ArrayList<>();
        for (int num : nums) {
            if (maxHeap.size() < n) {
                maxHeap.add(num);
                siftDown(maxHeap, maxHeap.size() - 1);
            } else if (num > maxHeap.get(0)) {
                maxHeap.set(0, num);
                siftDown(maxHeap, 0);
            }
        }
        return maxHeap;
    }

    private static void siftDown(ArrayList<Integer> heap, int index) {
        int left = 2 * index + 1;
        int right = 2 * index + 2;
        int largest = index;

        if (left < heap.size() && heap.get(left) > heap.get(largest))
            largest = left;

        if (right < heap.size() && heap.get(right) > heap.get(largest))
            largest = right;

        if (largest != index) {
            Collections.swap(heap, index, largest);
            siftDown(heap, largest);
        }
    }

    public static List<Integer> findMaxNSorted(ArrayList<Integer> ls, int n) {
        Collections.sort(ls, Collections.reverseOrder());
        return ls.subList(0, Math.min(n, ls.size()));
    }

    public static void main(String[] args) {
        Random rand = new Random();
        for (int n = 10; n <= 10000; n += 100) {
            ArrayList<Integer> ls = new ArrayList<>();
            for (int i = 0; i < n; i++) {
                ls.add(rand.nextInt(100) + 1);
            }

            System.out.println("n = " + n);
            System.out.println("Use    Heapq: " + benchmark(() -> findMaxNHeap(ls, n)));
            System.out.println("Python Heapq: " + benchmark(() -> findMaxNManualHeap(ls, n)));
            System.out.println("Sorted      : " + benchmark(() -> findMaxNSorted(ls, n)));
        }
    }
}

Enter fullscreen mode Exit fullscreen mode
  • The result is different from the C++ version. The C++ version is faster than the Java version. This is because the C++ version uses a built-in function, while the Java version uses a manual heap implementation.

Conclusion

When we are dealing with large data, we should use built-in function instead of heapq.sort() to improve the performance of our code. We must be wary of time complexity pitfalls when dealing with large data. Sometimes time complexity pitfalls are unavoidable, but we should try to avoid them as much as possible.

About Me

Hello, I'm mengqinyuan. I'm a student. I love to learn new things.
You can see my github: [MengQinYuan's Github][https://github.com/mengqinyuan]

Other things

Eight Queens Problem: [https://www.geeksforgeeks.org/8-queen-problem/]
Time & Space Complexity: [https://www.geeksforgeeks.org/time-complexity-and-space-complexity/]
Heap: [https://www.geeksforgeeks.org/heap-data-structure/]

💖 💪 🙅 🚩
mengqinyuan
mengqinyuan

Posted on August 1, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related