Skip to content

Tuning hashcat for real GPU throughput

Benchmarks lie if you read them wrong. Workload profiles, optimized kernels, thermal throttling, multi-GPU and segmenting big attacks, with honest cloud rental math.

Published on 4 min read

The first thing people do with a new GPU is run a benchmark and quote the headline number. That number is real for about thirty seconds. Sustained throughput on a real attack, on a warm card, with the kernel your hash actually uses, is the figure that decides whether a job finishes tonight or next week. Tuning is the gap between those two.

Benchmark the mode you are running

hashcat -b walks every mode and prints a hashrate for each. Useful for comparing cards, useless for planning a specific job, because the algorithm dominates everything.

hashcat -b              # full benchmark, every mode
hashcat -b -m 0         # MD5 only
hashcat -b -m 3200      # bcrypt only

Run those two and the gap is the whole story. MD5 on a current GPU benchmarks in the tens of billions of hashes per second (read 21000.0 MH/s as 21 billion H/s). bcrypt at cost 5 on the same card lands in the low tens of thousands. That is not a typo, it is six or seven orders of magnitude. On fast hashes the GPU is the whole game and tuning matters enormously. On slow hashes the algorithm has already won; a faster card buys you almost nothing, and your effort belongs in the wordlist instead.

Workload profiles

-w sets how aggressively hashcat feeds the GPU, from 1 to 4.

hashcat -m 0 -w 3 hashes.txt rockyou.txt

Profile 1 is for a machine you are actively using. Profile 3 is the default and right for most dedicated runs. Profile 4 squeezes out the last few percent by handing the card very long batches, and on a desktop where the same GPU drives your display that backfires: the screen stutters, input lags, and on some drivers the watchdog kills the kernel for not yielding. Headless box, -w 4. Daily driver, cap it at 3 and do not be surprised when 4 makes the machine unusable for a single-digit speed gain.

Optimized kernels and the length trap

-O enables optimized kernels. They are faster, sometimes by a wide margin. The catch is a hard cap on candidate length, often 31 characters or less depending on the mode, and hashcat will not warn you mid-run that it skipped everything longer.

hashcat -m 0 -O hashes.txt rockyou.txt

This is the gotcha covered in finding the right hashcat mode: a clean run with zero hits does not prove the password is uncrackable, it might prove -O never tried the long ones. Use -O for fast hashes and short masks where you know the length ceiling is irrelevant. Drop it the moment you suspect long passphrases, and rerun without it before you call a hash uncracked.

Thermal throttling is why your numbers fall

Benchmarks run cold. Ten minutes into a real job the die is hot, the card hits its temperature limit, and the firmware lowers clocks to protect itself. Sustained hashrate settles below the burst figure, sometimes far below on a cramped case with poor airflow.

hashcat -m 0 -w 3 --hwmon-temp-abort=90 hashes.txt rockyou.txt

Watch temperature and fan with hardware monitoring (it is on by default; do not pass the disable flag). Judge a card by its steady-state number after it has warmed up. If clocks are dropping under load, the fix is airflow and a fan curve, not a hashcat flag.

Multi-GPU and slow candidates

-d selects devices by index, so you can pin a job to specific cards or split work across them.

hashcat -I                                    # list device indices
hashcat -m 3200 -d 1,2 hashes.txt rockyou.txt # run on cards 1 and 2

-S switches to the slow-candidate path. Counterintuitively, for slow hashes like bcrypt this can be faster, because the bottleneck is the hash itself rather than candidate generation, and the slow path keeps the pipeline fed more efficiently. Benchmark both ways on your hash; do not assume.

Segmenting big attacks

A keyspace that takes a week on one box should be split. -s (skip) and -l (limit), also spelled --skip and --limit, carve a contiguous slice of the keyspace so you can hand each segment to a different machine.

hashcat -m 0 -a 3 hashes.txt ?a?a?a?a?a?a?a?a --keyspace   # total size
hashcat -m 0 -a 3 hashes.txt ?a?a?a?a?a?a?a?a -s 0           -l 5000000000
hashcat -m 0 -a 3 hashes.txt ?a?a?a?a?a?a?a?a -s 5000000000  -l 5000000000

Query --keyspace, divide by the number of workers, and dispatch the slices. This is hand-rolled distribution and it works for a handful of nodes. Past that, the overhead of coordinating by hand outweighs a real distributed setup.

The cloud rental math, honestly

Renting GPUs by the hour is genuinely worth it for a short burst on a fast hash. A few hours on rented hardware can replace days on your own card, and you pay only for the run. Where it stops making sense is sustained cracking of slow hashes. bcrypt does not care that you rented eight high-end cards; the cost factor caps your guesses per second per device, so you pay premium hourly rates to grind through a keyspace that the algorithm was specifically designed to make grinding through expensive. Rent for fast-hash bursts and short mask exhaustion. Do not rent a fleet to brute a well-configured KDF, because the math that protects the defender protects them against your rented fleet too.

Related articles

How AS-REP roasting lets an unauthenticated attacker pull a crackable krb5asrep hash from accounts with preauth disabled, and how defenders catch it.
An HS256 token carries everything an attacker needs to verify a guessed secret offline. How weak HMAC keys fall to hashcat -m 16500, and how to forge tokens after.
Poison LLMNR and NBT-NS with Responder to capture a NetNTLMv2 challenge response, crack it with hashcat mode 5600, and know when to relay instead.