Tuning hashcat for real GPU throughput
Benchmarks lie if you read them wrong. Workload profiles, optimized kernels, thermal throttling, multi-GPU and segmenting big attacks, with honest cloud rental math.
The first thing people do with a new GPU is run a benchmark and quote the headline number. That number is real for about thirty seconds. Sustained throughput on a real attack, on a warm card, with the kernel your hash actually uses, is the figure that decides whether a job finishes tonight or next week. Tuning is the gap between those two.
Benchmark the mode you are running
hashcat -b walks every mode and prints a hashrate for each. Useful for comparing cards, useless for planning a specific job, because the algorithm dominates everything.
hashcat -b # full benchmark, every mode
hashcat -b -m 0 # MD5 only
hashcat -b -m 3200 # bcrypt only
Run those two and the gap is the whole story. MD5 on a current GPU benchmarks in the tens of billions of hashes per second (read 21000.0 MH/s as 21 billion H/s). bcrypt at cost 5 on the same card lands in the low tens of thousands. That is not a typo, it is six or seven orders of magnitude. On fast hashes the GPU is the whole game and tuning matters enormously. On slow hashes the algorithm has already won; a faster card buys you almost nothing, and your effort belongs in the wordlist instead.
Workload profiles
-w sets how aggressively hashcat feeds the GPU, from 1 to 4.
hashcat -m 0 -w 3 hashes.txt rockyou.txt
Profile 1 is for a machine you are actively using. Profile 3 is the default and right for most dedicated runs. Profile 4 squeezes out the last few percent by handing the card very long batches, and on a desktop where the same GPU drives your display that backfires: the screen stutters, input lags, and on some drivers the watchdog kills the kernel for not yielding. Headless box, -w 4. Daily driver, cap it at 3 and do not be surprised when 4 makes the machine unusable for a single-digit speed gain.
Optimized kernels and the length trap
-O enables optimized kernels. They are faster, sometimes by a wide margin. The catch is a hard cap on candidate length, often 31 characters or less depending on the mode, and hashcat will not warn you mid-run that it skipped everything longer.
hashcat -m 0 -O hashes.txt rockyou.txt
This is the gotcha covered in finding the right hashcat mode: a clean run with zero hits does not prove the password is uncrackable, it might prove -O never tried the long ones. Use -O for fast hashes and short masks where you know the length ceiling is irrelevant. Drop it the moment you suspect long passphrases, and rerun without it before you call a hash uncracked.
Thermal throttling is why your numbers fall
Benchmarks run cold. Ten minutes into a real job the die is hot, the card hits its temperature limit, and the firmware lowers clocks to protect itself. Sustained hashrate settles below the burst figure, sometimes far below on a cramped case with poor airflow.
hashcat -m 0 -w 3 --hwmon-temp-abort=90 hashes.txt rockyou.txt
Watch temperature and fan with hardware monitoring (it is on by default; do not pass the disable flag). Judge a card by its steady-state number after it has warmed up. If clocks are dropping under load, the fix is airflow and a fan curve, not a hashcat flag.
Multi-GPU and slow candidates
-d selects devices by index, so you can pin a job to specific cards or split work across them.
hashcat -I # list device indices
hashcat -m 3200 -d 1,2 hashes.txt rockyou.txt # run on cards 1 and 2
-S switches to the slow-candidate path. Counterintuitively, for slow hashes like bcrypt this can be faster, because the bottleneck is the hash itself rather than candidate generation, and the slow path keeps the pipeline fed more efficiently. Benchmark both ways on your hash; do not assume.
Segmenting big attacks
A keyspace that takes a week on one box should be split. -s (skip) and -l (limit), also spelled --skip and --limit, carve a contiguous slice of the keyspace so you can hand each segment to a different machine.
hashcat -m 0 -a 3 hashes.txt ?a?a?a?a?a?a?a?a --keyspace # total size
hashcat -m 0 -a 3 hashes.txt ?a?a?a?a?a?a?a?a -s 0 -l 5000000000
hashcat -m 0 -a 3 hashes.txt ?a?a?a?a?a?a?a?a -s 5000000000 -l 5000000000
Query --keyspace, divide by the number of workers, and dispatch the slices. This is hand-rolled distribution and it works for a handful of nodes. Past that, the overhead of coordinating by hand outweighs a real distributed setup.
The cloud rental math, honestly
Renting GPUs by the hour is genuinely worth it for a short burst on a fast hash. A few hours on rented hardware can replace days on your own card, and you pay only for the run. Where it stops making sense is sustained cracking of slow hashes. bcrypt does not care that you rented eight high-end cards; the cost factor caps your guesses per second per device, so you pay premium hourly rates to grind through a keyspace that the algorithm was specifically designed to make grinding through expensive. Rent for fast-hash bursts and short mask exhaustion. Do not rent a fleet to brute a well-configured KDF, because the math that protects the defender protects them against your rented fleet too.