Skip to content
AI Primer
release

Markov AI releases Computer Use Large on Hugging Face: 48,478 videos and 12,300 hours

Markov AI released Computer Use Large on Hugging Face with 48,478 screen recordings spanning about 12,300 hours across six professional apps. Use it to train and evaluate GUI agents on real software workflows with a large CC-BY dataset.

3 min read
Markov AI releases Computer Use Large on Hugging Face: 48,478 videos and 12,300 hours
Markov AI releases Computer Use Large on Hugging Face: 48,478 videos and 12,300 hours

TL;DR

  • Markov AI put Computer Use Large on Hugging Face as what posts describe as the “world’s largest open-source dataset of computer-use recordings,” with 48,478 screen-recording videos totaling about 12,300 hours launch post dataset page.
  • The dataset is licensed CC-BY-4.0, which makes it easier to use for research and development around GUI and computer-use agents than many closed or unclear-screen-recording corpora license note dataset page.
  • According to the dataset page, the videos cover six professional software categories — AutoCAD, Blender, Excel, Photoshop, Salesforce, and VS Code — and are meant for training and evaluating agents on real software workflows.
  • The Hugging Face page says the curation pipeline trims away non-screen content, removes audio, and filters for actual screen-recording segments, while a supporting repost frames it as a 10,000-plus-hour release across apps including Salesforce processing details supporting repost.

What shipped

Computer Use Large is a new Hugging Face dataset for desktop-agent work, built from 48,478 screen recordings of professional software use and released under CC-BY-4.0 launch post. The Hugging Face listing at the dataset page positions it for “training & evaluating computer use agents,” not just passive video understanding, which matters because the source material is grounded in GUI workflows rather than synthetic trajectories dataset page.

The current coverage spans six applications: AutoCAD, Blender, Excel, Photoshop, Salesforce, and VS Code app coverage. That gives the corpus a mix of office, creative, CAD, CRM, and coding environments, which is broader than single-app desktop datasets and more relevant for benchmarking cross-domain computer-use behavior.

How it was processed and why engineers may care

The dataset page says the videos were sourced from YouTube tutorials and then processed to keep only screen-centric segments: audio was stripped with ffmpeg, intros and outros were removed, frames were sampled every 10 seconds, and a vision-language model, Gemini Flash, was used to classify whether frames were genuine screen-recording content processing details. Videos with less than 10 seconds of screen activity were discarded, and the metadata includes fields such as original and trimmed duration, upload date, screen-content percentage, and segment counts metadata fields.

For engineers, the practical value is less about a new model release and more about data availability. A large, openly licensed corpus with per-video metadata and category splits can support pretraining, eval set construction, and comparisons across app domains using Hugging Face’s loading flow load_dataset API. The supporting repost reinforces the scale claim, calling it the “world’s largest open-source dataset of computer-use recordings” and highlighting 10,000-plus hours across enterprise and productivity software supporting repost.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR1 post
How it was processed and why engineers may care1 post
Share on X