Academy

Shopping carts are a good example of big data with the need of advanced analytics. Here we present new MDX+ features of icCube – categories, subcubes, and arrays – coupled with the power and flexibility of MongoDB applied to a shopping cart model.

Shopping transactions can be massive. A typical model might have millions, or hundreds of millions, of shopping carts with an average of 10 to 50 products. Doing a bit of math, 10 million shopping carts with an average of 50 products will produce 500 million single-product transactions; 100 million shopping carts will produce 5 billion single-product transactions. That’s starting to be big. To focus on customer behaviour, our analysis will center on shopping carts’ profitability not individual products. Why we’re interested in being able to handle this problem at shopping cart level instead of product transaction is easy to understand: we’re speaking about 50 times less rows. This is important for loading data, saving data onto memory/disks, as well as performing analytical operations. Our model will have a large number of dimensions – e.g., customer information, geo information, product categories etc. – with a high sparsity, so pre-aggregation is not a valid solution.

Let’s start by showing how a typical shopping cart MongoDB document would be like:

{ "_id" : ObjectId("53bfe3f3578ddc736f679423"), "id" : 2, "customer" : { "id" : 46, "name" : "Danielle", "age" : 24, "gender" : "M" }, "date" : ISODate("2014-05-10T14:00:00Z"), "Shopping Cart" : [{ "Group" : "Foodstuff", "Family" : "Fresh fruits", "Product" : "Berries", "Catalogue Price" : 4.0, "Customer Price" : 4.0, "Margin" : 0.61 }, { "Group" : "Foodstuff", "Family" : "Seafood", "Product" : "Lobster", "Catalogue Price" : 12.0, "Customer Price" : 12.0, "Margin" : 1.81 } .. }

To make it simple, we’ve included just a bit of information about the customer and a shopping cart with a list of products. In a relational database, the same structure would require a couple of tables (Shopping_Cart, Customer, Shopping_Cart_Product_Transaction…) with their foreign keys.

How would we map this structure in icCube ?

Thanks to the icCube’s MongoDB driver, we can directly embed JSON arrays that will be transparently transformed in many-to-many relation (each cart has N products). No worries, icCube will handle this for you.

Without the array support, we would have to solve this in MongoDB with a denormalization, in a relational database by performing a ‘cross-product’ between two tables, both resulting in a row for each product transaction. There are two problems with this approach. The first problem is the need for a lot more memory and disk I/O as we store each product shopping transaction (fifteen times more). The second problem is that in order to count the number of shopping carts, we have to perform a distinct count on the shopping cart id, and this is, by far, the slowest aggregation method available in an OLAP Server.

Fortunately, in icCube we can handle directly arrays as returned by MongoDB, so we can use the fastest aggregation method available: ‘count rows’. And this is quicker than distinct count by some order of magnitude with the same amount of fact rows, so it’s a huge difference as we have 15 less fact rows.

There is a bit of work to create the dimensions and facts to build a schema. We did it for you, so you can check the resulting icCube schema here.

Once the schema is created, we can start playing with icCube.

It’s important to pay attention that this schema will have a few dimensions that present a many-to-many relation.

What is the impact of a many-to-many relation?

In the classical relation, each parent is the aggregation of its children. If we use the sum, a parent will be the sum of its children; that’s simple and intuitive. What is also important is that two children do not share any value (or row); their intersection is always empty. Let’s say no shopping carts could have two products ‘Banana’ and ‘TV’.

If the relation is a many-to-many, as in products – shopping carts, this principle is broken. Not only a parent is no longer the sum of its children, but also the intersection of two children might not be empty. There are shopping carts with ‘Bananas’ and ‘TVs’. We will have multiple shopping carts with two different products and with the number of shopping carts no longer the sum of the shopping carts for each product.

If you’re not used to MDX, you might be saying, yes and what’s the problem?

MDX was mainly built to handle simple relations, so set operations between different members are not properly supported. For people used to RDBM, there is no equivalent of SQL join clauses. For example, filtering all shopping carts that have ‘Bananas’ without ‘TVs’ is complicated or just impossible. Bear in mind that we can have an already complex MDX query with calculated members over which we might apply this condition.

How could you solve this with icCube?

In order to support this kind of operation, we introduced in our latest icCube version a set of functions – SubCube ( SubCubeUnion ,SubCubeIntersect … ):

SubCubeMinus( [Products].[Bananas], [Products].[Tvs] )

These will create a SubCube with all shopping carts that have ‘Bananas’ without ‘TVs’.

It’s a lot for a single blog post. Stay tuned for the second and third part.

By David Alvarez