Product-oriented market research database

Problem

Currently there are no market research tools available that consider the existing products, services, and research initiatives with high potential. This limitates the ability of both founders and investors to base their decisions on the real driver of markets: the goods people can pay for. Or, even more importantly, the ones they will be able to pay for, once the R&D phase of novel technologise is completed.

Solution

The solution to this issue is rather simple: one has to obtain all the information available for products on the market. So how come nobody has tried this before? My guess is that the data is vast and hard to efficiently obtain using traditional methods, plus the need for such a product has not been apparent, because the already existing solutions provide a semi-satisfactory efficiency level. With the emergence of AI, however, web scraping became much more efficient, and extracting structured information from company websites is possible using LLMs. The only issue left is to find the right data sources and the structure that would accomodate the needs of the product I'm aiming to develop: a product-oriented market research database.

Workflow

As outlined above, the goal was simple: find sources containing valuable data about available products and services, then extract them gathering their "id" (a.k.a. names, so that more products can be put into the same category, even if multiple companies provide it) and any other relevant information (like industry, companies who sell it, etc.). Another crucial aspect is to map the relations of products, and their underlying interconnectedness. This is also one of the key differentiators, as this could easily highlight inefficiencies in supply chains and operations. For the research aspect, it is highly similar, with the addition that determining their current state of development is also necessary for meaningful integration into this structure. Altogether, the following sources were identified:

  • Company websites for products and services extraction;
  • Research papers and patent databases for identifying key emerging technologies;
  • Industry and supply chain reports for obtaining the connections between the given technologies.
  • At each stage, a "node" or a "dependency" can be added to the database, if it is not already present, or if it is, it can be appended with the new information. This also foreshadows the fact that the database will be a graph database, as it is the most efficient way to store and query such data.

    Prototype

    For a very simplistic prototype, I've scraped IBIS World's company database and arXiv's recent submissions pages, and used an LLM to establish hypothetical connections between nodes that it deemed "dependent on each other". Since there were a total of ~16000 nodes retrieved, I had to use a K-means algorithm to cluster them into 300 groups based on their semantic similarities obtained by an open sources LLM's embedding API. The result is flawed in many ways, mostly because of the improper prompting and low effort put into the initial scraping algorithm, yet the end result still provides a good understanding of what the proper end product could be. Below, you can find the prototype graph, in which you can see the cluster leader nodes and their connections as well, plus when clicked, each node will display all the data associated to them that was obtained during the scraping procedure.

    Click a Node to see the details!

    Final remarks

    Even though the prototype is far from perfect, I find this a good starting point for further development. An actual, market ready database would require substantial capital for the scraping, but with the right algorithm and directives, a similar graph of higher quality could be obtained. Then everybody could use it to their liking, but for founders and investors this could be crucial to provide a perspective different from what they were used to so far. This product-oriented market research database would solve a problem that has been present for a long time, and could be a game changer in the way we perceive the market.