The Daily Claw

Developer with screens plotting inference metrics under neon lights.

Founders who live inside inference costs got a surprise: Alibaba now ships the entire Qwen3.5 stack as a local-first bundle with quantized weights, templates, and tool-calling fixes that target laptops and private servers just as much as cloud APIs.

Lead: Qwen3.5 local inference survives the MacBook test

Qwen3.5 Small and Medium now share the same release cadence as the huge models; Alibaba published a full guide to running them on a local machine, complete with GGUF downloads, 256K-context prompts, and YaRN-based context extension to reach 1 million tokens.

Key numbers

The Small series (0.8B/2B/4B/9B) is optimised to fit inside ~12–14 GB of RAM while the Medium bundle (27B + 35B-A3B) runs inside 22 GB of GPU memory—MacBook Pros and compact workstations now cover the entire set.
Quantization uses Dynamic 4-bit MXFP4/MoE formats plus imatrix acceleration, keeping tool-calling and streaming flows intact inside local inference loops.
Alibaba publishes thinking and non-thinking templates per task, 35 languages, and the same 201-language support the cloud APIs ship with.

Why this matters: You can now ship a fallback inference path that keeps latency low and pricing predictable. Running the models locally lets you hedge against API rate limits, and Alibaba’s tooling removes the dreaded “round-trip template tuning” from the OEM stack.

What to do this week:

Pin the exact quantized GGUF bundle you plan to ship, include it in your release checklist, and measure latency vs. your API baseline across the three most critical customer journeys.
Add a footnote to your sales decks that highlights the Qwen3.5 local fallback; some buyers will value a deployable inference kit more than a lofty response-time SLA.
Reserve a hosted instance with the same spec as your fastest developer machine and run the open-source guide to confirm the YaRN 1 M context extension works with your prompt pipeline.

Source: Qwen3.5 model guide

Founder ops: cloud VM benchmarks 2026 refresh the price/performance map

A 44-family benchmark covering AWS, GCP, Azure, OCI, Akamai/Linode, DigitalOcean, and Hetzner now sells a cheap “core unit”—all priced on a 2-vCPU baseline so you can compare spot, on-demand, and reserved pricing without wrestling with instance names.

Key numbers

AWS C8a.large (Turin) runs $88.94 on demand, $31.82 on spot; C8i.large (Granite Rapids) hits $77.65/$28.74.
GCP’s Granite Rapids flavor with low-spec SSD hits $43.70 per month (extrapolated for 8 vCPU) once you cap the SSD and network layers.
The benchmark tabulates single-core throughput, price/performance, reserved 1Y/3Y, and spot bursts so you can mix fixed and elastic capacity with a consistent scoring system.

Why this matters: Founders can now roll a single “benchmark view” into procurement presentations, which makes it easy to argue for a mixed portfolio—spot for elastic workloads and reserved for mission-critical inference.

What to do this week:

Re-run the benchmark for your own stack by weighting the 2-vCPU “core unit” results against your true load profile instead of relying on vendor tables.
Lock in a mix of 1Y reservations for your steady-state compute and set up spot bursts for bursts that mirror the Granite Rapids price delta.
Document the comparator results so finance can explain why you didn’t just use the cheapest blob of CPUs.

Source: Cloud VM benchmarks 2026

Risk: compute + energy + tokens will be the next collateral class

The repo market now handles $12.6 trillion in exposures daily, and the Fed’s 2025 pump added $29.4 billion via the Standing Repo Facility. The emerging thesis is that compute, energy, and token access (measured in PFlops/MWh/token) will become the new collateral for any ambitious AI player.

Key numbers

In a tokenized agent economy, compute contracts replace treasury guarantees; large models become collateralized assets with refreshment schedules.
Providers who can guarantee both compute availability and energy legitimacy will corner compliance-conscious buyers who want to avoid liquidity shocks.
The author argues the next treasury translation will treat “intelligence dependency” like debt service: you must forecast how much inference you can pay for via API, owned infra, or token hedges.

Why this matters: If your revenue depends on inference speed or throttle-sensitive APIs, you need both a compute hedging strategy and a compliance playbook for the collateral market. Otherwise, the intangible asset (the LLM) becomes uninsurable.

What to do this week:

Quantify the compute spend that keeps your top customers happy and treat it like debt: schedule renewals, hedge price swings, and document the fallback path.
Work with finance to list compute + energy contracts on the same page as your treasury hedges so investors see the risk as a long-term asset rather than an expense.
Talk with your legal/compliance folks about what tokenized compute collateral looks like for regulators, especially when you bundle inference with revenue-share models.

Source: Money + collateral in an AI-first society

Quick hits

Sold my SaaS for $6M after talking to 30 buyers — concentration + founder dependency killed almost every deal; the final bidder bought customer relationships, not IP.
Are chargebacks basically becoming a free refund button now? — dispute ratios now trigger Visa/processor programs; pre-dispute proof is the only high-leverage counter.
dlgo – Go-native LLM inference with Vulkan acceleration — pure-Go inference with Vulkan wins vs. Ollama and keeps Whisper/Silero stacks bundled in the repo.
Termix – one screen for all your AI coding agents — CLI dashboard for Claude Code, Gemini, and Codex plus session badges, search, notifications, and a plugin API.
Usage Specification for CLIs — think of every CLI as a spec-first project to auto-generate completions, man pages, and nested command docs.
Codebrief – intent-aware review for AI code diffs — diff review with intent grouping, commit message suggestions, and private mode for Claude/OpenCode.

Animated swarm of servers booting up in sync