Arrk helped the Customer improve the speed, accuracy and scalability of market data collection through API-driven data extraction.
Customer
The Customer provides market intelligence services to the built environment sector. Its platform is used by organisations across the construction industry, including building material suppliers, contractors, consultants, architects, designers, electricians, plumbers and more. The platform helps users access information that supports business development and decision-making.
The Challenge
Business As-Is
The customer gathers construction-related market intelligence primarily from publicly available sources. Information was spread across a wide range of websites, portals, folders and document formats, making it difficult to extract data consistently and at scale. Accessing and processing these sources manually was time consuming and increasingly difficult to scale.
Existing crawling and scraping approaches delivered inconsistent results, creating a need for a more reliable solution.
Business To-Be
The customer aimed to become one of the first providers to offer AI-enabled market intelligence, requiring near real-time data ingestion with high levels of completeness and accuracy.
Identified Business Use Cases
- GDPR-Compliant Crawling: Replace traditional proxy services with GDPR-compliant crawling and scraping techniques.
- Infrastructure Optimisation: Reduce infrastructure bottlenecks and associated costs.
- Automation of Data Gathering: Reduce manual intervention by automating extraction from diverse sources and formats.
- Intelligent Data Organisation: Ensure data is structured and validated for downstream processing.
Solution
Given the complexity of the existing environment, a single large scale implementation carried significant delivery risk. Arrk recommended an iterative PoC strategy for rapid evaluation. After multiple iterations, we selected Zyte API as the core solution, complemented by additional services.
Adopted Tech Stack:
- Python with Poetry
- AWS Services – EventBridge, Step Function, Lambda, SQS, S3, Batch, ECR
- OpenSearch
- Zyte API
- New Relic
- RDS (Aurora PostgreSQL)
Key Challenges and Resolutions
- Source Diversity & Failures: Multiple iterations were required to identify common failure patterns and address them.
- Configuration Complexities: JSON-based configurations were introduced in place of YAML configurations to improve flexibility during testing and data extraction.
- File Type & Format Issues: Addressed challenges associated with a wide variety of websites, folder structures and document formats.
- Document Access & Reading: Extracting data from documents while managing CAPTCHA controls, disclaimers and varying file formats presented a significant challenge. The implementation of MistralOCR improved text extraction accuracy across a wide range of document types.
- Infrastructure Sizing & Cost: Balanced resource allocation, cost and processing time to achieve an efficient operating model.
- Large Document Processing: Resolved gateway timeouts by implementing asynchronous ingestion architecture using AWS Batch.
- Parallel System Runs: Blue-green deployments were maintained for several weeks to support stability during rollout. In-house subject matter experts were engaged throughout testing to support ongoing refinement and improvement.
Key Outcomes
- Reduced manual operational effort by automating repetitive, time-consuming tasks.
- Significantly enhanced user experience and interaction quality.
- Improved consistency, accuracy and reliability of extracted data through workflow automation and validation processes.
- Established a scalable data extraction platform that can support additional business functions in the future.
Conclusion
By trusting Arrk, the customer implemented a scalable, centralised service for acquiring accurate market data. The platform now supports both the Customer’s core business requirements and those of other group companies, helping maximise the value of a single investment.



