Amazon classifies each individual product within its catalog into numerical categories commonly known as “nodes.” These nodes are then arranged in a meaningful and hierarchical manner reflecting “parent nodes” and “leaf nodes.” A leaf node is a more precise and more specific sub-category of the parent node. In other words, parent nodes represent the most general classification of products and each leaf or “child” reflect a specific and relevant subdivision. For example, node 283155 is the parent node for “books,” and node 5 reflects “computer & technology books” — a specific kind of book. In this example, 283155 is the parent and 5 is the child or leaf. At the present time, Amazon boasts 100,000+ nodes. However, many of them are either inaccessible through the API or do not contain practical information.
The process of discovering all of Amazon’s nodes is performed through repeated API requests. A minimum of one second should pass between each unique request for most associates. Since Amazon does not make available a master root starting point containing all parents, the process of finding all the nodes can be time consuming.
Because a master root list containing all parents does not exist within the Amazon API, the first step to creating a database of BrowseNodes is to obtain a list of diverse categories and their associated nodes. The most diverse list of categories found in one place is located on the “Amazon Site Directory” page. Obviously, this page would contain links to help search engines discover deeper product classifications and would represent everything Amazon has to offer. Most links on this page contain node-specific URL addresses, which are found using PHP. After non-essential HTML and duplicate references have been removed from the HTML and links, the condensed list gets saved to the mySQL database in the SampleNode_US table in the format of one node per row.
At this point, every row in the SampleNode_US table runs through the API once again. But this time the purpose is to determine each row’s ancestor. Duplicate ancestors from returned API data are removed and the results are then added to their own database table, RootNode_US. In this manner, the root BrowseNode containing all parents is discovered through structuring the resulting data returned from the API.
Lastly, each row in the RootNode_US tables gets passed through the API in order to obtain children Browse Node IDs. Each child BrowseNode, in turn, also is passed to the API in search of deeper children. When no more children can be found, then the next parent node or child is loaded and run though. The process repeats until each node has been explored for all their children. Results are saved and/or updated in the Node_US table. It takes about 2-3 weeks for the script to parse all nodes after factoring in the required time delay between API requests.
Source by T. Grijalva Jr.