Performance testing of an exact algorithm for the Subset Sum Problem on different personal computers
Mykhailo Lenskyi
Oles Honchar Dnipro National University
Hanna Mykhalchuk
Oles Honchar Dnipro National University
This paper investigates the performance of an exact algorithm for the NP-complete Subset Sum Problem on different personal computers. The Subset Sum Problem asks whether there exists a subset of a given set of integers that sums up to a specified target value. Finding exact solutions for large instances is computationally challenging due to the exponential growth of possible subsets. The proposed algorithm employs a parallel backtracking approach, consisting of a breadth-first search phase executed on the CPU and a depth-first search phase executed on the GPU. The breadth-first phase starts with an empty subset, iteratively adding elements while storing intermediate subsets and sums in a queue. Branch pruning rules discard unproductive subsets. When a depth limit is reached, the queue is transferred to the GPU memory, and multiple threads perform depth-first searches in parallel on different subtrees, using a stack for intermediate subsets and sums. The depth-first phase follows the same pruning logic. It terminates when all threads complete their work or a solution is found. The algorithm was implemented in C# using the ILGPU library, which provides high-level abstractions for GPU programming. Performance tests were conducted on three personal computers with different CPU and GPU configurations: Intel Core i5-8250U with Nvidia GeForce MX150, AMD Ryzen 7 5800H with Nvidia GeForce GTX 1650, and AMD Ryzen 5 7600 with Nvidia GeForce RTX 4070 Ti SUPER. Results demonstrate significant speedups achieved by GPU parallelization compared to sequential CPU execution. For an input set of 1000 elements, the fastest time of 567ms was obtained on the RTX 4070 Ti SUPER with 16384 threads, which is almost 12 times faster than the same computer’s CPU execution and 43 times faster than the weakest tested system. Optimal thread counts vary for different input sizes and GPUs. The massive parallelism, specialized architecture, and high memory bandwidth of GPUs contribute to their efficiency for this highly parallelizable problem. The proposed algorithm can be used as a building block for solving other related problems.