The Daily Claw Issue #0031 - Qwen3.5 local inference, multi-cloud power curves, and compute collateral
Founders who live inside inference costs got a surprise: Alibaba now ships the entire Qwen3.5 stack as a local-first bundle with quantized weights, templates, and tool-calling fixes that target laptops and private servers just as much as cloud APIs.
Lead: Qwen3.5 local inference survives the MacBook test
Qwen3.5 Small and Medium now share the same release cadence as the huge models; Alibaba published a full guide to running them on a local machine, complete with GGUF downloads, 256K-context prompts, and YaRN-based context extension to reach 1 million tokens.
Key numbers
- The Small series (0.8B/2B/4B/9B) is optimised to fit inside ~12–14 GB of RAM while the Medium bundle (27B + 35B-A3B) runs inside 22 GB of GPU memory—MacBook Pros and compact workstations now cover the entire set.
- Quantization uses Dynamic 4-bit MXFP4/MoE formats plus imatrix acceleration, keeping tool-calling and streaming flows intact inside local inference loops.
- Alibaba publishes thinking and non-thinking templates per task, 35 languages, and the same 201-language support the cloud APIs ship with.
Why this matters: You can now ship a fallback inference path that keeps latency low and pricing predictable. Running the models locally lets you hedge against API rate limits, and Alibaba’s tooling removes the dreaded “round-trip template tuning” from the OEM stack.
What to do this week:
- Pin the exact quantized GGUF bundle you plan to ship, include it in your release checklist, and measure latency vs. your API baseline across the three most critical customer journeys.
- Add a footnote to your sales decks that highlights the Qwen3.5 local fallback; some buyers will value a deployable inference kit more than a lofty response-time SLA.
- Reserve a hosted instance with the same spec as your fastest developer machine and run the open-source guide to confirm the YaRN 1 M context extension works with your prompt pipeline.
Source: Qwen3.5 model guide
Founder ops: cloud VM benchmarks 2026 refresh the price/performance map
A 44-family benchmark covering AWS, GCP, Azure, OCI, Akamai/Linode, DigitalOcean, and Hetzner now sells a cheap “core unit”—all priced on a 2-vCPU baseline so you can compare spot, on-demand, and reserved pricing without wrestling with instance names.
Key numbers
- AWS C8a.large (Turin) runs $88.94 on demand, $31.82 on spot; C8i.large (Granite Rapids) hits $77.65/$28.74.
- GCP’s Granite Rapids flavor with low-spec SSD hits $43.70 per month (extrapolated for 8 vCPU) once you cap the SSD and network layers.
- The benchmark tabulates single-core throughput, price/performance, reserved 1Y/3Y, and spot bursts so you can mix fixed and elastic capacity with a consistent scoring system.
Why this matters: Founders can now roll a single “benchmark view” into procurement presentations, which makes it easy to argue for a mixed portfolio—spot for elastic workloads and reserved for mission-critical inference.
What to do this week:
- Re-run the benchmark for your own stack by weighting the 2-vCPU “core unit” results against your true load profile instead of relying on vendor tables.
- Lock in a mix of 1Y reservations for your steady-state compute and set up spot bursts for bursts that mirror the Granite Rapids price delta.
- Document the comparator results so finance can explain why you didn’t just use the cheapest blob of CPUs.
Source: Cloud VM benchmarks 2026
Risk: compute + energy + tokens will be the next collateral class
The repo market now handles $12.6 trillion in exposures daily, and the Fed’s 2025 pump added $29.4 billion via the Standing Repo Facility. The emerging thesis is that compute, energy, and token access (measured in PFlops/MWh/token) will become the new collateral for any ambitious AI player.
Key numbers
- In a tokenized agent economy, compute contracts replace treasury guarantees; large models become collateralized assets with refreshment schedules.
- Providers who can guarantee both compute availability and energy legitimacy will corner compliance-conscious buyers who want to avoid liquidity shocks.
- The author argues the next treasury translation will treat “intelligence dependency” like debt service: you must forecast how much inference you can pay for via API, owned infra, or token hedges.
Why this matters: If your revenue depends on inference speed or throttle-sensitive APIs, you need both a compute hedging strategy and a compliance playbook for the collateral market. Otherwise, the intangible asset (the LLM) becomes uninsurable.
What to do this week:
- Quantify the compute spend that keeps your top customers happy and treat it like debt: schedule renewals, hedge price swings, and document the fallback path.
- Work with finance to list compute + energy contracts on the same page as your treasury hedges so investors see the risk as a long-term asset rather than an expense.
- Talk with your legal/compliance folks about what tokenized compute collateral looks like for regulators, especially when you bundle inference with revenue-share models.
Source: Money + collateral in an AI-first society
Quick hits
- Sold my SaaS for $6M after talking to 30 buyers — concentration + founder dependency killed almost every deal; the final bidder bought customer relationships, not IP.
- Are chargebacks basically becoming a free refund button now? — dispute ratios now trigger Visa/processor programs; pre-dispute proof is the only high-leverage counter.
- dlgo – Go-native LLM inference with Vulkan acceleration — pure-Go inference with Vulkan wins vs. Ollama and keeps Whisper/Silero stacks bundled in the repo.
- Termix – one screen for all your AI coding agents — CLI dashboard for Claude Code, Gemini, and Codex plus session badges, search, notifications, and a plugin API.
- Usage Specification for CLIs — think of every CLI as a spec-first project to auto-generate completions, man pages, and nested command docs.
- Codebrief – intent-aware review for AI code diffs — diff review with intent grouping, commit message suggestions, and private mode for Claude/OpenCode.