Category guide

Datasets & LoRA packs — training data and fine-tuned weights

This category covers curated image/text datasets for training or fine-tuning and packaged LoRA / adapter weights (e.g. style, character, product line). Buyers search buy LoRA pack SDXL, product photography LoRA download, or sell AI training dataset; Truve emphasizes license clarity because rights are the main risk in this niche. The long-form sections below explain how to read captions files, how LoRA rank and training steps affect portability, and how USDC on Base checkout on Truve coexists with whatever license you attach — written so Google can index one authoritative page instead of ten conflicting forum posts.

Datasets vs LoRA packs: what you are actually buying

A dataset listing usually ships captioned images, JSONL metadata, or parquet shards meant for further training or distillation. A LoRA pack listing ships small adapter tensors plus instructions: base model, training resolution, rank, alpha, and recommended inference settings. Buyers who type buy LoRA pack Stable Diffusion often want plug-and-play inference; dataset buyers may be researchers assembling multimodal corpora. Never blur the two without labeling — refund disputes come from mismatched expectations, not from model quality alone.

Hybrid bundles (dataset + starter LoRA) are welcome when you clearly separate directories and licenses for each half. Cross-link to ComfyUI workflows if you also sell graphs tuned for your adapters.

Buyer due diligence: consent chains, PII, and poisoned training rows

Inspect whether faces are model-released or synthetic-only. For scraped web data, verify robots.txt compliance claims and whether captions were machine-translated without human spot checks. Watch for NSFW bleed in otherwise SFW packs — a few mis-tagged files can poison workplace policies. If hashes are published, spot-verify random shards against the manifest to detect truncated downloads.

Security-minded teams should scan archives for polyglot files or unexpected executables masquerading as images — rare, but cheap insurance before feeding data into internal trainers.

Seller documentation that survives procurement and legal review

Ship a DATA_CARD.md-style summary: collection period, geographic bias, demographic skew you know about, and labeling methodology. For LoRAs, publish training loss curves screenshots or final metrics buyers can sanity-check. List every third-party asset touching the pipeline — fonts on reference mood boards, stock photos, music stills — even if not redistributed, because diligence questionnaires ask anyway.

When offering commercial vs research-only tiers, use separate SKUs or clearly separated folders so buyers cannot accidentally violate the cheaper license.

Training stacks: SDXL, Flux, captioning languages, and resolution ladders

State whether images were resized with bicubic vs area downscaling and whether aspect buckets were used. Multilingual captions should note tokenizer quirks for non-Latin scripts. If you trained with EMA weights or special schedulers, say so because reproducing your exact aesthetic may depend on it. For Flux-era adapters, disclose whether you used dev or schnell bases and any guidance distillation steps.

Evaluation: trigger words, overfitting tests, and safety filters

Publish a grid of nine prompts spanning low and high CFG, plus a negative prompt block you recommend. Show failure cases: when the LoRA bleeds background color into skin, tell buyers which slider usually mitigates it. If you ship a bundled safety concept list for internal studio use, mark it experimental and jurisdiction-dependent.

Payments, fees, and encrypted delivery

Truve charges sellers a 5% platform fee on successful sales; buyers pay in USDC on Base. Files unlock after on-chain settlement using Lit-backed access control described on the homepage. Large archives may face launch size caps near fifty megabytes until infrastructure scales — split logically if needed.

Finance teams sometimes ask whether crypto purchases trigger 409A or grant reporting questions for AI startups — Truve cannot advise on tax law, but clear receipts and immutable transaction hashes simplify reconciliation compared to informal PayPal memos.

Export formats: WebDataset shards, Parquet, CSV sidecars, and raw folders

WebDataset tar shards stream well into large-scale trainers but confuse beginners; if you ship them, include a five-line Python snippet that materializes a preview folder. Parquet columns should use documented schemas and avoid embedding huge blobs inline without compression notes. Raw folder + CSV sidecars remain the lowest-friction path for boutique studios — just keep UTF-8 normalization consistent across operating systems.

When images live on disk and captions live elsewhere, ship a join key that survives renames (content hash) rather than only sequential integers, which break when curators delete rows.

Benchmarks buyers can rerun without your internal cluster

Provide a frozen evaluation prompt set (20–50 prompts) and expected qualitative anchors (“should not add extra fingers at CFG 7”). If you publish CLIP or aesthetic scores, disclose evaluator versions because metrics drift across package releases. For LoRA, include a tiny Comfy graph or diffusers snippet that loads nothing but public base weights plus your adapter so third parties can replicate claims on consumer GPUs.

Inventory hygiene: deduplication, near-duplicates, and broken symlinks

Run perceptual hash clustering before you ship; near-duplicate frames inflate metrics and waste buyer disk. Validate that every path in your manifest resolves — broken symlinks in tarballs are a top support ticket driver. If you exclude outliers (corrupted downloads during crawl), say how many rows were dropped so buyers understand distribution shift.

What must be in every listing

  • License — commercial vs research, redistribution of images, model outputs.
  • Consent / PII — no real faces without release; no scraped private data.
  • Base model — SDXL 1.0, Flux, etc., and training resolution.
  • Trigger words & usage notes — how to invoke the LoRA safely.

Examples

Editorial portrait style LoRA — 30 training steps, 2k curated CC-licensed references.
SKU flat-lay dataset — 500 masked product shots for e-commerce fine-tunes.
Character pack — consistent outfit + pose grid with artist-signed release.

File size

At launch Truve targets bundles under 50 MB per listing; larger archives may follow when infrastructure expands.

If your LoRA weights alone exceed caps, consider shipping fp16 vs bf16 variants separately or documenting an external checksum-verified mirror while keeping the canonical purchase receipt on Truve. Always keep SHA256 manifests inside the listing archive so buyers can detect partial downloads.

Ethical sourcing and synthetic-only datasets

Synthetic pipelines (rendered 3D scenes, GAN-generated diversity fillers) should be labeled as such so buyers do not assume photographic provenance. When mixing synthetic and real imagery, separate folders and licenses. Climate and biometric sensitive domains (facial recognition vendors) may impose extra contractual riders — mention if your pack is not suitable for those markets.

Glossary for dataset and LoRA SEO

Caption is paired text supervision. Rank in LoRA trades parameter count vs expressivity. Alpha scales adapter contribution at inference. TI (textual inversion) differs from LoRA but sometimes ships in the same ZIP — call it out. Prior preservation regularization reduces concept bleeding during fine-tunes.

People also search style LoRA pack download, character LoRA consistent outfit, and product photography dataset captions — mirror that language naturally in your Truve listing body without stuffing keywords into unreadable lists.

Handoff to internal MLOps: what enterprise buyers append

Attach optional Slurm job templates or Kubernetes PVC sizing hints if you have tested them. Mention whether your pack benefits from gradient accumulation tricks on 24 GB cards versus needing multi-GPU Deepspeed. Those breadcrumbs help senior ML engineers greenlight purchases faster than glossy hero images alone. Version those ops snippets like code.

List datasets and LoRAs when seller registration opens — join the waitlist.

Join the waitlist →

FAQ — datasets & LoRA

Can I sell a LoRA trained on copyrighted characters?

Only if you have rights. Fan art without a license is a legal risk — Truve expects honest listings.

Do you host Civitai-style model cards?

Truve provides description, images, and download — rich cards are on the roadmap.

What about NSFW content?

Subject to Truve ToS and card-network rules for payments; adult listings may be restricted at launch.

Can buyers use outputs commercially?

Whatever the seller’s license says. State it explicitly in the listing text.

Do you allow medical imaging datasets?

Only with jurisdiction-appropriate approvals documented; restricted categories may be blocked at launch.

Can I fine-tune on top of a buyer’s private data?

That is a service relationship outside a simple download; the base pack should still be self-contained.

What metadata schema do you recommend?

JSONL with stable keys (path, caption, hash) is widely tool-friendly.

How do I prove no copyrighted characters were used?

Publish training source manifest and synthetic provenance statements; legal risk remains with misrepresentation.

Are diffusers-format adapters supported listings?

Yes if you document conversion steps to A1111/Comfy native formats when needed.

Can datasets include audio or video?

If within size limits and licenses; tag modalities clearly for search filters.

What about watermarks in training images?

Disclose watermark frequency; buyers may reject packs that teach models to reproduce logos.