Ai comprehensive database creator
AI startup that indexes the entire web - provides Diffbot Knowledge Graph (DKG) query services in Google Sheets and Microsoft Excel to instantly enrich existing user databases with comprehensive and publicly available information about companies and organizations. The main challenge was to build the best extraction API to process pages found by Crawlbot
The following features were identified for functionality expansion:
- human-level recognition with no manual exploration or input.
- the ability to quickly and efficiently obtain data on a large sample of people/companies
- a diagram of connections with other entities
- unlike traditional web scraping tools, Diffbot doesn't require any rules to read the content on a page
- implementation of bulk request option. The ability to quickly and efficiently obtain data on a large sample of people/companies
Our team joined in 2020 with 2 fullstack developers.
Our team members mainly worked with Java and Cloud frameworkes
Number of our contractors:
1 Back end
1 Front end
1 QA testing
Usually 1-2 people were simultaneously attached to the project. Our key responsobilies was to create side infrastructure stuff.
-
Distributed, world-class crawling infrastructure processing millions of pages daily.
-
Plug-and-play scraping and Knowledge Graph access
-
Clean structured data (such as JSON or CSV), ready for use
- Innovation: Yandex, Microsoft, Amazon and other market titans use Diffbot
- Ai, ML: Implementation of innovative technologies allows for scaling the bot to any amount of content. Content is interpreted by a machine learning model trained to identify the key attributes on a page based on its type
- Data management and storage: used MySQL, opportunity to send bulk requests. High speed of work.
- Content update: platform is able to update the information and build a knowledge base.