Contact us
Send us your idea to discover how we can help you bring it to life.

Ai comprehensive database creator

Bot which creates a database from any website
Industry
Research
Geography
USA
Type
Web
Technologies
Machine Learning
Project overview:

AI startup that indexes the entire web - provides Diffbot Knowledge Graph (DKG) query services in Google Sheets and Microsoft Excel to instantly enrich existing user databases with comprehensive and publicly available information about companies and organizations. The main challenge was to build the best extraction API to process pages found by Crawlbot
The following features were identified for functionality expansion:

  • human-level recognition with no manual exploration or input.
  • the ability to quickly and efficiently obtain data on a large sample of people/companies
  • a diagram of connections with other entities
  • unlike traditional web scraping tools, Diffbot doesn't require any rules to read the content on a page
  • implementation of bulk request option. The ability to quickly and efficiently obtain data on a large sample of people/companies
Emphasoft engagement:

Our team joined in 2020 with 2 fullstack developers.
Our team members mainly worked with Java and Cloud frameworkes 

Number of our contractors:
1 Back end
1 Front end
1 QA testing

Usually 1-2 people were simultaneously attached to the project. Our key responsobilies was to create side infrastructure stuff.

Key results:
  • Distributed, world-class crawling infrastructure processing millions of pages daily.

  • Plug-and-play scraping and Knowledge Graph access

  • Clean structured data (such as JSON or CSV), ready for use

  1. Innovation: Yandex, Microsoft, Amazon and other market titans use Diffbot
  2. Ai, ML: Implementation of innovative technologies allows for scaling the bot to any amount of content. Content is interpreted by a machine learning model trained to identify the key attributes on a page based on its type
  3. Data management and storage: used MySQL, opportunity to send bulk requests. High speed of work.
  4. Content update: platform is able to update the information and build a knowledge base.
Case Studies
Schedule a Discovery Call