<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[abhishek.t’s Substack]]></title><description><![CDATA[My personal Substack]]></description><link>https://dcrey7.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!W8Sz!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Feb54c6a3-861d-4cc7-b692-c856d60c9882_723x888.png</url><title>abhishek.t’s Substack</title><link>https://dcrey7.substack.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 16 Jun 2026 20:58:26 GMT</lastBuildDate><atom:link href="https://dcrey7.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[abhishek.t]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[dcrey7@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[dcrey7@substack.com]]></itunes:email><itunes:name><![CDATA[abhishek.t]]></itunes:name></itunes:owner><itunes:author><![CDATA[abhishek.t]]></itunes:author><googleplay:owner><![CDATA[dcrey7@substack.com]]></googleplay:owner><googleplay:email><![CDATA[dcrey7@substack.com]]></googleplay:email><googleplay:author><![CDATA[abhishek.t]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[KICKY AI]]></title><description><![CDATA[Local AI powered shot analyzer]]></description><link>https://dcrey7.substack.com/p/world-fut-coach</link><guid isPermaLink="false">https://dcrey7.substack.com/p/world-fut-coach</guid><dc:creator><![CDATA[abhishek.t]]></dc:creator><pubDate>Mon, 15 Jun 2026 00:45:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e6ed6c76-e9fb-4c8b-b14d-9ce8d30e0f52_1011x372.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1><strong>Building KICKY AI a local, zero-label football shot analyzer</strong></h1><p><em>How I turned my own Saturday-football YouTube footage into a system that tells you: was it a goal, who scored, which foot, how hard and how to improve running on local, open models.</em></p><p><strong>&#9917;&#127758; It&#8217;s World Cup 2026.</strong> The BBC just launched a <strong>3D immersive experience</strong> for it: switch camera angles, follow any player, see bird&#8217;s-eye tactics, all rebuilt live from multi-camera <strong>skeletal tracking</strong> on broadcast matches (<a href="https://www.bbc.co.uk/sport">BBC Sport</a>). It&#8217;s stunning. It&#8217;s also built for the pros, on broadcast-quality feeds, with a stadium full of cameras.</p><p>But most of us play football on a Sunday pitch with <strong>one phone on a tripod</strong>. So I asked the smaller question: <em>what about the rest of us?</em> This is my World Cup build, a personal football analyst that runs on <em>my own</em> footage, fully local.</p><blockquote><p>&#128250; I record my football journey and these builds on YouTube <strong><a href="https://www.youtube.com/@dcrey7">youtube.com/@dcrey7</a></strong>.</p></blockquote><div><hr></div><h2><strong>A love letter to tracking (and its trade-offs)</strong></h2><p>Object tracking is one of the most beautiful problems in computer vision. Give a model a video and ask it to follow a thing through space and time players, a ball, a goal and suddenly you can <em>understand</em> a game instead of just watching it. The last two years have been wild: text- promptable segmentation, grounding VLMs, real-time DETRs. The <strong>state of the art is closer than ever to a true real-time segmentation system</strong>.</p><p>But &#8220;closer than ever&#8221; is not &#8220;there yet&#8221; and the gap is a <strong>trade-off between time and quality</strong>. You can have fast <em>or</em> clean; getting both, on <em>your</em> footage, is still hard. For my use case there is a lot of <strong>cleaning up</strong> to do smoothing noisy detections, rejecting teleporting balls, recovering frames the segmenter missed so <strong>World Cup Heros is not real-time; it&#8217;s post-processed</strong>. That&#8217;s an honest design choice, not a bug.</p><p>And sports tracking specifically has some of the nastiest challenges in the field. The three biggest, in my experience:</p><ol><li><p><strong>The use case</strong> what counts as a &#8220;goal,&#8221; a &#8220;shot,&#8221; a &#8220;possession&#8221; is domain logic no model gives you for free.</p></li><li><p><strong>Perspective</strong> one fixed, far, monocular camera means no depth and lots of foreshortening.</p></li><li><p><strong>Ball detection</strong> a small, fast, low-contrast object is the single hardest thing to track.</p></li></ol><p>This whole project is the story of fighting those three.</p><blockquote><p>&#128591; <strong>Huge thanks to <a href="https://roboflow.com/">Roboflow</a> and <a href="https://github.com/SkalskiP">Piotr Skalski (SkalskiP)</a>.</strong> His sports-CV videos and open notebooks football AI, basketball jump-shot detection, fine-tuning RF-DETR, segmenting video with SAM gave me the ideas and the scaffolding to build this. Several of those notebooks are literally in <code>notebooks/refernce/</code> of this repo.</p></blockquote><div><hr></div><h2><strong>How this project started</strong></h2><h3><strong>1 &#183; Understanding the football use case</strong></h3><p>I didn&#8217;t start from a model I started from <strong>my own data</strong>. I play football in Paris most Saturdays, and I record my football journey on YouTube. One of those videos was a <strong>shooting practice with my friend Adam Hakeem a pro player from Singapore &#128016;</strong>: two of us, one goal, one phone on a tripod behind the pitch.</p><p>That defined the use case precisely: a <strong>personal football AI</strong> that, on <em>our</em> amateur footage, can <strong>detect our shots, understand our poses, and help coach us</strong> and crucially, one that runs <strong>completely locally</strong> on open models I own, not a cloud API. (More on why local matters to me at the end.)</p><h3><strong>2 &#183; The data problem</strong></h3><p>Here&#8217;s the thing about &#8220;football AI&#8221;: almost all of it is built on <strong>broadcast-quality footage</strong> 4K, multi-camera, perfect angles and even then the public datasets are small. Amateur, single-camera footage is a different, harder world, and there&#8217;s very little of it.</p><p>But I&#8217;ve been recording my games for a while, so I was lucky: <strong>I had the content.</strong> I shot the session on an <strong>iPhone 15 Pro</strong>. The catch and it matters a lot downstream is that after uploading and pulling the video back <strong>from YouTube, I only get a re-encoded 1080p / 30 fps version</strong>. So the effective input is far from the original sensor quality: compressed, 30 fps, 1080p. Every limitation below traces back to that.</p><h3><strong>3 &#183; Ball detection the hardest part</strong></h3><p>If there&#8217;s one villain in this project, it&#8217;s the ball. After a lot of trials multiple <strong>SAM</strong> variants and <strong>LocateAnything</strong> models the conclusion is blunt: <strong>the ball is brutally hard to capture in every frame.</strong> It&#8217;s:</p><ul><li><p><strong>small</strong> (often 3&#8211;4 pixels at this camera distance),</p></li><li><p><strong>fast</strong> (so at 30 fps it smears across frames),</p></li><li><p><strong>lit inconsistently</strong> (sun, shadow, the players&#8217; own shadow across the pitch),</p></li><li><p>and <strong>prone to blending into the background</strong> (white-ish ball on bright grass/concrete).</p></li></ul><p>And it&#8217;s not just a &#8220;use a better model&#8221; problem. <strong>Even after training on ball datasets it stays hard</strong>, for two reasons rooted in &#167;2:</p><ol><li><p><strong>Resolution isn&#8217;t good enough</strong> a 3&#8211;4 px ball simply doesn&#8217;t carry enough signal.</p></li><li><p><strong>Low fps gives blurry balls</strong> at 30 fps a struck ball becomes a motion-blur streak that doesn&#8217;t <em>look</em> like a ball to a detector.</p></li></ol><p>The data backs this up. I logged every clip where <strong>SAM3 returned zero ball detections</strong> (<code>out/noball_clips.json</code>) it&#8217;s a real chunk of the set, including several test clips. On one test clip, <code>spf_008</code>, <strong>SAM3 found the ball in 0 of 117 frames.</strong> That&#8217;s the problem in one number.</p><div><hr></div><h2><strong>The annotation tools (because someone has to label)</strong></h2><p>To even <em>measure</em> any of this I needed ground truth, and I wasn&#8217;t going to hand-draw thousands of boxes. So I built three tiny <strong>browser annotation tools</strong> (in <code>notebooks/</code>) pair-programmed with <strong>OpenAI&#8217;s Codex CLI</strong>, which was great at turning &#8220;spin up a range-streaming video server with a keyboard-driven labelling page&#8221; into working single-cell tools without me yak-shaving the plumbing. Each spins up a local HTTP server (with range-streamed video so you can scrub a big file) and serves a keyboard-driven page. Frame-navigable, persisted to JSON, no SaaS.</p><p><strong>Video clipper</strong> scrub the full match, mark <code>IN</code>/<code>OUT</code>, stack cut points, export the 72 shot clips:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g_np!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g_np!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png 424w, https://substackcdn.com/image/fetch/$s_!g_np!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png 848w, https://substackcdn.com/image/fetch/$s_!g_np!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png 1272w, https://substackcdn.com/image/fetch/$s_!g_np!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g_np!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png" width="1280" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1176498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dcrey7.substack.com/i/202056622?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g_np!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png 424w, https://substackcdn.com/image/fetch/$s_!g_np!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png 848w, https://substackcdn.com/image/fetch/$s_!g_np!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png 1272w, https://substackcdn.com/image/fetch/$s_!g_np!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef6f683-c3bd-4f0d-85de-52c94c1f75d0_1280x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><strong>Clip rater</strong> triage each clip 5&#8594;1 stars so I work the good ones first:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oF2s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oF2s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png 424w, https://substackcdn.com/image/fetch/$s_!oF2s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png 848w, https://substackcdn.com/image/fetch/$s_!oF2s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png 1272w, https://substackcdn.com/image/fetch/$s_!oF2s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oF2s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png" width="1280" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1010830,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dcrey7.substack.com/i/202056622?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oF2s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png 424w, https://substackcdn.com/image/fetch/$s_!oF2s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png 848w, https://substackcdn.com/image/fetch/$s_!oF2s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png 1272w, https://substackcdn.com/image/fetch/$s_!oF2s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d80bfc-447e-48f5-8e87-ba784ed1b2de_1280x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><strong>Goal annotator</strong> the ground-truth <strong>goal / no-goal / unsure</strong> labeller that every result here is scored against:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mvyl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mvyl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png 424w, https://substackcdn.com/image/fetch/$s_!mvyl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png 848w, https://substackcdn.com/image/fetch/$s_!mvyl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png 1272w, https://substackcdn.com/image/fetch/$s_!mvyl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mvyl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png" width="1280" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:868081,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dcrey7.substack.com/i/202056622?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mvyl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png 424w, https://substackcdn.com/image/fetch/$s_!mvyl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png 848w, https://substackcdn.com/image/fetch/$s_!mvyl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png 1272w, https://substackcdn.com/image/fetch/$s_!mvyl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71ee94e3-de3a-4719-a3bf-9d0560a8666e_1280x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><blockquote><p>Lesson learned: my first idea (single-frame click-to-label) was useless for a fast ball I rebuilt everything to be prev/next/jump navigable with persisted state.</p></blockquote><div><hr></div><h2><strong>Auto-labelling with foundation models</strong></h2><p>The core trick for &#8220;zero hand labels&#8221; is to let <strong>foundation models be the labeller</strong>, then distil a fast model from their output.</p><h3><strong>SAM3 the workhorse, and the best single model here</strong></h3><p><strong><a href="https://github.com/facebookresearch/sam3">SAM3</a></strong> is text-promptable video segmentation. I give it <code>"ball"</code>, <code>"person"</code>, <code>"goal post"</code> and it returns tracked <strong>masks</strong>. On clean clips it is genuinely excellent crisp player and goal masks, the right ball and honestly it&#8217;s <strong>the best single model in this pipeline</strong>. When SAM3 sees the ball, everything downstream is easy.</p><h3><strong>LocateAnything the rescue for the tiny / tilted ball</strong></h3><p>When SAM3 comes up empty (the small/blurred/tilted ball from &#167;3), I fall back to <strong>NVIDIA LocateAnything-3B</strong>, a grounding VLM that localizes an object from a text phrase. Because it <em>reasons</em> about the scene, it&#8217;s far more robust to <strong>scale and orientation</strong> than a pure segmenter. I run it <strong>only on the frames/clips SAM3 missed</strong>, for ball, goal and person.</p><p><strong>Case A </strong><code>spf_008</code><strong>: SAM3 0 / 117 ball frames &#8594; LocateAnything recovered 42 / 117.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kr_m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kr_m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png 424w, https://substackcdn.com/image/fetch/$s_!Kr_m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png 848w, https://substackcdn.com/image/fetch/$s_!Kr_m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png 1272w, https://substackcdn.com/image/fetch/$s_!Kr_m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kr_m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png" width="1280" height="766" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:766,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1640759,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dcrey7.substack.com/i/202056622?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kr_m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png 424w, https://substackcdn.com/image/fetch/$s_!Kr_m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png 848w, https://substackcdn.com/image/fetch/$s_!Kr_m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png 1272w, https://substackcdn.com/image/fetch/$s_!Kr_m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7d5b699-f1a5-4928-8998-d6ff1e27874d_1280x766.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>That circled blob is a ~4-pixel ball in shadow at distance. SAM3 never saw it; the VLM did.</p><p><strong>Case B </strong><code>spf_067</code><strong>, the tilted-camera clip.</strong> SAM3 returned <strong>0 / 108</strong> ball frames. LocateAnything recovered the <strong>goal post 108/108</strong> and the <strong>player 108/108</strong> all the content was there, the camera was just rotated, and the orientation-robust VLM handled it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-eaD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-eaD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png 424w, https://substackcdn.com/image/fetch/$s_!-eaD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png 848w, https://substackcdn.com/image/fetch/$s_!-eaD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png 1272w, https://substackcdn.com/image/fetch/$s_!-eaD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-eaD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png" width="1280" height="766" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:766,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1765907,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dcrey7.substack.com/i/202056622?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-eaD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png 424w, https://substackcdn.com/image/fetch/$s_!-eaD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png 848w, https://substackcdn.com/image/fetch/$s_!-eaD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png 1272w, https://substackcdn.com/image/fetch/$s_!-eaD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840d933a-1951-42c0-9c21-7031e43b7850_1280x766.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>SAM3 does the bulk; LA patches the holes. Together they auto-label the <strong>entire</strong> dataset with <strong>no human boxes</strong>.</p><div><hr></div><h2><strong>Distilling a real-time student RF-DETR-Seg</strong></h2><p>SAM3 + LA is accurate but <strong>slow</strong> (a VLM call per missed frame) you can&#8217;t ship it. So I used their combined output as a <strong>training set</strong> and distilled one fast student: <strong>RF-DETR-Seg</strong> (Roboflow, 384 px).</p><pre><code><code>+--------+--------+-------------+
| split  | images | annotations |
+--------+--------+-------------+
| train  |  1,767 |       6,489 |
| valid  |    338 |       1,233 |
| test   |    636 |       2,401 |
+--------+--------+-------------+
</code></code></pre><h3><strong>Class design: 3 from the teacher &#8594; 5 for the student</strong></h3><p>The teacher (SAM3 + LA) only ever labels <strong>3 classes</strong>: <strong>ball, player, goal</strong>. That&#8217;s all the foundation models know.</p><p>But possession and goal events are exactly the <em>interactions</em> between those things so before training the student, I <strong>derive two extra classes purely from geometry</strong>:</p><ul><li><p><code>player_with_ball</code> = a player whose mask intersects the ball (possession),</p></li><li><p><code>ball_in_goal</code> = a ball whose mask intersects the goal (a goal-contact candidate).</p></li></ul><p>RF-DETR is then trained on these <strong>5 classes</strong> (ball, player, goal, player_with_ball, ball_in_goal). The point: instead of leaving &#8220;who has the ball&#8221; and &#8220;is the ball in the goal&#8221; <em>entirely</em> to inference-time geometry, the detector itself <strong>learns</strong> the interaction context, which makes the <strong>player- and goal-classification more robust</strong> (e.g. it can distinguish a player on the ball from one standing away, and a ball in the net from one merely near it). The shipped real-time checkpoint uses the 3 base classes; the 5-class variant bakes the interactions into the model directly.</p><p>The student is <strong>~50&#215; faster</strong>, runs on ZeroGPU in the demo, and because its player masks are cleaner and gap-free MediaPipe Pose resolves on <strong>every</strong> test clip. It even <strong>recovers the ball on clips SAM3 missed</strong>, because it learned from the LA-patched labels.</p><h3><strong>Training curves</strong></h3><p>15 epochs at 384 px. Both losses are still falling and box mAP@50:95 peaks at <strong>0.548 (epoch 11)</strong> the model is <em>slightly undertrained</em>, more epochs / higher resolution would still help.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!djp_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!djp_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png 424w, https://substackcdn.com/image/fetch/$s_!djp_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png 848w, https://substackcdn.com/image/fetch/$s_!djp_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png 1272w, https://substackcdn.com/image/fetch/$s_!djp_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!djp_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png" width="1456" height="447" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:447,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:174938,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dcrey7.substack.com/i/202056622?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!djp_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png 424w, https://substackcdn.com/image/fetch/$s_!djp_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png 848w, https://substackcdn.com/image/fetch/$s_!djp_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png 1272w, https://substackcdn.com/image/fetch/$s_!djp_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ba0515-1f23-4e00-ad53-2c87237aafc3_1998x614.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>But look at the <strong>per-class panel on the right</strong> it&#8217;s the whole story of this project in one chart:</p><pre><code><code>+--------+--------------+
| class  | val AP@50:95 |
+--------+--------------+
| goal   |    ~0.89     |
| player |    ~0.66     |
| ball   |    ~0.08     |   &lt;- the whole story
+--------+--------------+
</code></code></pre><p>Goal and player are easy; the <strong>ball sits near the floor</strong>. That single curve is the visual proof of &#167;3 a 3&#8211;4 px, motion-blurred ball at 30 fps is brutal for <em>any</em> detector, even one trained directly on it. This is exactly why the SAM3 + LA teacher (which reasons about the scene) still beats the student on goal/leg, and why <strong>resolution is the #1 lever</strong> for the future.</p><blockquote><p>Honest admission: I was <em>lazy</em> I distilled from SAM3 + LA instead of labelling by hand. It works, but it&#8217;s <strong>not</strong> the ceiling (see Future work).</p></blockquote><div><hr></div><h2><strong>From masks to meaning geometry &amp; physics</strong></h2><p>The model only outputs masks. Everything a fan cares about is derived with <strong>pure geometry and physics</strong>, no extra training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Eh6H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Eh6H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png 424w, https://substackcdn.com/image/fetch/$s_!Eh6H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png 848w, https://substackcdn.com/image/fetch/$s_!Eh6H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png 1272w, https://substackcdn.com/image/fetch/$s_!Eh6H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Eh6H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1509556,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dcrey7.substack.com/i/202056622?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Eh6H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png 424w, https://substackcdn.com/image/fetch/$s_!Eh6H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png 848w, https://substackcdn.com/image/fetch/$s_!Eh6H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png 1272w, https://substackcdn.com/image/fetch/$s_!Eh6H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7beed6ab-6d69-4e6a-a0e0-9cf94108a9f4_1280x720.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><ul><li><p><strong>Who has the ball &#8594; who shot it:</strong> <code>ball &#8745; player</code> over time gives possession; the <strong>last touch before the ball reaches the goal</strong> is the shooter. Players are drawn as masks labelled <code>P1</code>/<code>P2</code>/<code>P3</code>, brightening to <code>P# BALL</code> on possession.</p></li><li><p><strong>Pose &amp; shooting foot:</strong> on the kick frame I crop generously around the shooter (the kicking leg swings <em>outside</em> the player box) and run <strong>MediaPipe Pose</strong>. The <strong>foot</strong> is whichever ankle is nearer the ball at contact.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZCI2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZCI2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png 424w, https://substackcdn.com/image/fetch/$s_!ZCI2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png 848w, https://substackcdn.com/image/fetch/$s_!ZCI2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png 1272w, https://substackcdn.com/image/fetch/$s_!ZCI2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZCI2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png" width="299" height="360" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:360,&quot;width&quot;:299,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:11893,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dcrey7.substack.com/i/202056622?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZCI2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png 424w, https://substackcdn.com/image/fetch/$s_!ZCI2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png 848w, https://substackcdn.com/image/fetch/$s_!ZCI2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png 1272w, https://substackcdn.com/image/fetch/$s_!ZCI2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1157dde-5f88-4522-b063-74b7e4f616fd_299x360.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><ul><li><p><strong>Did it go in? (the hard part):</strong> in monocular 2D, a ball <em>in front of</em> the goal and a ball <em>in the net</em> are the <strong>same pixels</strong>. So I don&#8217;t just check &#8220;ball inside goal box&#8221; I look at the <strong>trajectory over time</strong>: a kick spike, the ball <strong>arriving</strong> at the goal (not starting there else a kickoff in front of goal reads as a goal), then <strong>rest-in-net vs rebound vs pass-over</strong>. Detection noise made the ball &#8220;teleport,&#8221; so a <strong>Hampel filter</strong> (sliding-window median/MAD) removes outliers first. Goal accuracy climbed across these fixes: <strong>58% &#8594; 75% &#8594; 83% &#8594; 92%</strong> on the test set.</p></li><li><p><strong>Shot speed:</strong> the ball is 22 cm, so its pixel width gives a px&#8594;metre scale; <code>km/h = peak_px &#215; fps &#215; (0.22 / ball_px_width) &#215; 3.6</code>, with a robust peak (92nd percentile, capped 140) so one noisy frame doesn&#8217;t report a 900 km/h rocket.</p></li></ul><div><hr></div><h2><strong>Coaching the shot a vision-LLM that </strong><em><strong>sees</strong></em><strong> it, two ways</strong></h2><p>Numbers and masks tell you <em>what</em> happened; the last step turns that into <strong>what to fix</strong>. A <strong>vision-language model looks at the strike frame(s)</strong> and grades the shot, grounded on the measured facts (goal/no-goal, foot, speed) and the <strong>biomechanics</strong> I derive from MediaPipe Pose kicking-knee bend, trunk lean, hip drive at contact, each with a &#8220;good range&#8221; and a flag.</p><p>Two design notes that mattered:</p><ul><li><p><strong>The result is already decided</strong> by the geometry/physics above, so the prompt hands the model that verdict <em>as fact</em> and tells it never to contradict it otherwise a small VLM will happily read &#8220;no goal&#8221; off a blurry frame. It opens by celebrating a goal or encouraging a miss, then gets specific: <strong>Verdict &#8594; Fix &#8594; Fix &#8594; Drill</strong>.</p></li><li><p><strong>The coach is streamed separately</strong> from the analysis, so the segmentation video, pose and stats appear instantly and the written feedback fills in a moment later.</p></li></ul><p>And there are <strong>two coaches behind one toggle</strong> same prompt, same grounded facts:</p><ul><li><p>&#9729;&#65039; <strong>Online NVIDIA Nemotron-Nano-12B-v2-VL</strong> (12B). I serve the <strong><a href="https://huggingface.co/Vastined/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16-GGUF">GGUF (Q4_K_M) + mmproj</a></strong> on a <strong>llama.cpp</strong> server on <strong>Modal</strong> (A10G GPU). The bigger model gives the richer, more polished feedback.</p></li><li><p>&#9889; <strong>Offline <a href="https://huggingface.co/openbmb/MiniCPM-V-4.6">MiniCPM-V-4.6</a></strong> (OpenBMB, ~1.3B SigLIP2 + Qwen3.5-0.8B). It runs <strong>on the Space&#8217;s own ZeroGPU</strong>, no external API at all. A genuinely pocket-sized VLM that still reads the frame and coaches true to the local-first spirit of the whole project.</p></li></ul><p>One is cloud-grade, one is fully on-device; you pick.</p><div><hr></div><h2><strong>The demo UI a custom stadium on </strong><code>gradio.Server</code></h2><p>The Space isn&#8217;t the default Gradio Blocks layout. It&#8217;s a <strong>bespoke football-stadium frontend</strong> (hand-written HTML/JS, an SVG top-down pitch as the background) talking to Gradio&#8217;s backend engine via <code>gradio.Server</code>: <code>@app.api()</code> exposes the analysis and coach as queued, ZeroGPU-aware endpoints, and the page is served from the same app and calls them with <code>@gradio/client</code>.</p><p>The details are where the fiddly hours went. Analysis and coaching are <strong>two separate calls</strong>, so the segmentation video, pose card and stats show up in a few seconds and the written feedback streams in after, the shot never waits on the language model. There&#8217;s a gallery of held-out clips so you can try it in one click without uploading anything, and a single toggle flips the coach between the online and offline models. And the small stuff that makes it feel alive: a goal fires confetti &#127881;, the loader cycles football one-liners (&#8221;Counting your stepovers&#8230;&#8221;), the rendered clip is <code>+faststart</code> so it plays instantly, and the layout is pinned so it never breaks on resize.</p><p>It&#8217;s the same model stack, just wearing a kit instead of a form.</p><div><hr></div><h2><strong>Results &amp; honest evaluation</strong></h2><p>Same downstream pipeline on each detector, <strong>only the detector swapped</strong> (fair comparison).</p><p><strong>Held-out 12-clip test set:</strong></p><pre><code><code>+-----------------------------+------+------+--------------+-------------+
| Detector                    | Goal | Leg  | Pose-capture | Speed       |
+-----------------------------+------+------+--------------+-------------+
| SAM3 + LA (teacher)         | 83 % | 82 % |     92 %     | ~50x slower |
| RF-DETR-Seg-Small (student) | 75 % | 75 % |    100 %     | real-time   |
+-----------------------------+------+------+--------------+-------------+
</code></code></pre><p><strong>Per-split (goal / leg / pose-capture %):</strong></p><pre><code><code>+---------+-------+-----------+-------------------+
| split   | clips | SAM3 + LA | RF-DETR-Seg-Small |
+---------+-------+-----------+-------------------+
| test    |   12  | 83/82/92  |    75/75/100      |
| train   |   49  | 73/71/84  |    59/53/86       |
| val     |   11  | 45/88/73  |        --         |
| overall |   72  | 71/75/84  |        --         |
+---------+-------+-----------+-------------------+
</code></code></pre><p>The val goal number (45%) is low and noisy 11 clips, skewed to the hard tilted/ball-in-front cases. I report it rather than hide it. <strong>The ceiling is real:</strong> monocular goal detection tops out in the low-to-mid 80s% because of the in-front-vs-in-net depth ambiguity.</p><p>There&#8217;s something I actually like about that number, though. The big models here, SAM3, LocateAnything, the 12B Nemotron coach, are <em>teachers</em>, too heavy to ever sit on my footage every weekend. The thing that actually ships is small: a distilled detector that runs real-time on a free GPU and a 1.3B coach that fits on the same little Space. It&#8217;s not as sharp as a stadium of cameras, and it doesn&#8217;t need to be. It just needs to run on <em>my</em> phone clip, on <em>my</em> hardware, without sending my Saturday football to anyone&#8217;s cloud.</p><div><hr></div><h2><strong>The pipeline, end to end</strong></h2><pre><code><code>clip &#9472;&#9472;&#9654; SAM3 ("ball"/"person"/"goal post")  &#9472;&#9472;&#9488;
         &#9492;&#9472; 0 detections? &#9472;&#9654; LocateAnything-3B &#9472;&#9508;  (auto-labels, zero hand annotation)
                                                &#9660;
                                   RF-DETR-Seg-Small  (distilled, real-time)
                                                &#9474; ball / player / goal masks
                &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
                &#9660;                                &#9660;                               &#9660;
        ball &#8745; player                    ball &#8745; goal + physics            ball pixel width
        &#8594; who has it &#8594; shooter           &#8594; goal? + timestamp              &#8594; shot speed (km/h)
                &#9474;
                &#9660;
        MediaPipe Pose on the kick frame &#8594; shooting foot (L/R) + knee/trunk/hip angles
                &#9474;
                &#9660;
        goal &#183; foot &#183; speed &#183; pose-angles &#9472;&#9472;&#9654; vision-LLM coach &#8594; Verdict &#183; Fix &#183; Fix &#183; Drill
            (&#9729;&#65039; Nemotron-Nano-12B-v2-VL on Modal &#183; &#9889; MiniCPM-V-4.6 on the Space's ZeroGPU)
</code></code></pre><div><hr></div><h2><strong>What&#8217;s next (future improvements)</strong></h2><p>I optimised for &#8220;lazy and local,&#8221; not for maximum accuracy. The obvious upgrades:</p><ul><li><p><strong>High-quality manual annotation.</strong> Distilling from SAM3 + LA is convenient but it&#8217;s a ceiling-limiter a few hours of careful hand labels would lift goal/ball accuracy significantly. This is the single biggest lever.</p></li><li><p><strong>Bigger backbone &amp; higher resolution.</strong> Move from RF-DETR-Seg-<strong>Small</strong> to a <strong>Large</strong> model at higher input resolution directly attacks the tiny-ball problem.</p></li><li><p><strong>Homography &#8594; top-down pitch map.</strong> Estimate the pitch plane and project players/ball onto a <strong>2D map</strong>. That adds spatial robustness, makes &#8220;in front of vs in the net&#8221; tractable, and unlocks proper positional analytics.</p></li><li><p><strong>More sessions / players</strong> to generalise beyond one pitch and two players.</p></li></ul><div><hr></div><h2><strong>Why I built this (and why local)</strong></h2><p>I&#8217;m a <strong>big football fan</strong>, and I record my football journey on <strong>YouTube</strong>. I&#8217;d earlier built a <strong><a href="https://github.com/dcrey7/100dayscodepython/tree/main/day35_euroFootballoutcome">Euro football-outcome prediction project</a></strong> (part of my 100-days-of-code) and was <strong>awarded by FIFA</strong> for its unique method so blending football and AI is a bit of a personal tradition.</p><p>Now it&#8217;s <strong>World Cup 2026</strong>, the whole world is watching football, and I finally wanted to build the thing I&#8217;d always wanted: a personal football AI on my <em>own</em> footage. I had the <strong>content</strong> (years of recordings, including the shooting sessions with Adam) and, thanks to <strong>Roboflow and SkalskiP</strong>, the <strong>ideas</strong> player/jump detection, ball physics, RF-DETR, SAM to actually pull it off.</p><p>And the part I care about most: <strong>I&#8217;m a big fan of local models and owning them locally.</strong> Everything here runs on <strong>open models I control</strong> SAM3, LocateAnything, RF-DETR, MediaPipe, MiniCPM-V, Nemotron no proprietary API in the loop. The offline coach runs on the Space&#8217;s own GPU; even the &#8220;online&#8221; one is an open NVIDIA model I <strong>self-host on Modal</strong>, not a closed endpoint. The <strong>Build-Small Hackathon</strong> was the perfect excuse to finally sit down and build it.</p><p>I still drop my own session clips into it now that it&#8217;s working, which is the real test for me: the BBC&#8217;s 3D experience lights up when a World Cup match is on TV, mine lights up when Adam and I finish shooting on a Saturday. Same idea, my pitch.</p><div><hr></div><h2><strong>Links</strong></h2><ul><li><p>&#127916; <strong>Live demo</strong> <a href="https://huggingface.co/spaces/build-small-hackathon/kicky-ai">https://huggingface.co/spaces/build-small-hackathon/kicky-ai</a></p></li><li><p>&#129302; <strong>Model</strong> <a href="https://huggingface.co/build-small-hackathon/kicky-ai-rfdetr-seg">https://huggingface.co/build-small-hackathon/kicky-ai-rfdetr-seg</a></p></li><li><p>&#128230; <strong>Dataset</strong> <a href="https://huggingface.co/datasets/build-small-hackathon/kicky-ai-spf">https://huggingface.co/datasets/build-small-hackathon/kicky-ai-spf</a></p></li><li><p>&#128250; <strong>YouTube <a href="https://www.youtube.com/@dcrey7">@dcrey7</a></strong> (if this was useful, a subscribe means a lot &#128591;)</p></li></ul><p><em>Stack: SAM3 &#183; NVIDIA LocateAnything-3B &#183; RF-DETR-Seg (Roboflow) &#183; MediaPipe Pose &#183; NVIDIA Nemotron-Nano-12B-v2-VL (llama.cpp on Modal) &#183; MiniCPM-V-4.6 (OpenBMB, on-Space ZeroGPU) &#183; OpenAI Codex (annotation tools) &#183; gradio.Server &#183; OpenCV.</em></p>]]></content:encoded></item><item><title><![CDATA[Coming soon]]></title><description><![CDATA[This is abhishek.t&#8217;s Substack.]]></description><link>https://dcrey7.substack.com/p/coming-soon</link><guid isPermaLink="false">https://dcrey7.substack.com/p/coming-soon</guid><dc:creator><![CDATA[abhishek.t]]></dc:creator><pubDate>Sat, 12 Jul 2025 12:40:01 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!W8Sz!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Feb54c6a3-861d-4cc7-b692-c856d60c9882_723x888.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is abhishek.t&#8217;s Substack.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dcrey7.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dcrey7.substack.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item></channel></rss>