How to design an index with infinite columns support

0

Hi all, I'm looking for some advices about how to design indexing for documents with "unlimited" fields. We have an excel like application, users can create their own table and add unlimited columns and rows. Typical user can have hundreds of columns and tens of thousands of rows per table. If we have an index per user account then the data is indexed with this format { "id": 1, "table1_col1_name": "", "table1_col2_location": "", "table2_col1_name": "", "table2_col2_phone": "", ... }. This results in having hundreds of mapped fields and quickly hit the Limit of total fields [1000] in index [ap-index] has been exceeded error as expected. If we use table level indexing, still we have the same scalability issue.

How can we design an indexing strategy that can support unlimited columns and also has less number of indexes as much as possible? It should also support all feature listed in constraints below.

What'd the best practices in this scenario?

Constraint: We want to use almost all features of OpenSearch because we have a very flexible search tool, i.e. multi match, term, wildcard, bool_prefix, search as you type, geo distance, aggregation, sorting, pagination etc.

Potential solution: I'm thinking to have a field to keep all columns in an array but I'm not sure if this limits querying or aggregation somehow. For example the following document:

{
  "table1_col1_0_address": {
    "addressLine1": "Maecenas tempus mauris"
  },
  "table1_col1_0_autoid": {
    "value": 4028,
    "display": "0004028"
  },
  "table1_col1_0_email": {
    "email": "abdulhalim1@assetpanda.com"
  },
  "table1_col1_0_gps_location": "1,1",
  "table1_col1_0_ip": "1.1.1.1",
  "table1_col1_0_colours": [
    "red"
  ],
  "table1_col1_0_deleted_time": "15:00:00.000+00:00",
  "table1_col1_0_updated_date": "2024-01-04",
  "table1_col1_1_added_datetime": "2024-01-01T00:00:00.000Z",
  "id": "01HPJ56KXAFN7WXF7YGF11STCF"
}

Would be:

{
  "id": "01HPJ56KXAFN7WXF7YGF11STCF",
  "columns": [
    {
      "name": "table1_col1_0_address",
      "value": {
        "address": {
          "addressLine1": "Maecenas tempus mauris"
        }
      }
    },
    {
      "name": "table1_col1_0_autoid",
      "value": {
        "autoid": {
          "value": 4028,
          "display": "0004028"
        }
      }
    },
    {
      "name": "table1_col1_0_email",
      "value": {
        "email": {
          "address": "ap@example.com"
        }
      }
    },
    {
      "name": "table1_col1_0_gps_location",
      "value": {
        "gps": "1,1"
      }
    },
    {
      "name": "table1_col1_0_deleted_time",
      "value": {
        "time": "15:00:00.000+00:00"
      }
    },
    ...
  ]
}

Example query:

GET aps-index/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "columns",
            "query": {
              "bool": {
                "must": [
                  {
                    "term": {
                      "columns.name.keyword": "table1_col1_0_email"
                    }
                  },
                  {
                    "term": {
                      "columns.value.email.address": "ap@example.com"
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "nested": {
            "path": "columns",
            "query": {
              "bool": {
                "must": [
                  {
                    "term": {
                      "columns.name.keyword": "table1_col1_0_autoid"
                    }
                  },
                  {
                    "range": {
                      "columns.value.autoid.value": {
                        "gte": 10,
                        "lte": 200000
                      }
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "nested_aggs": {
      "nested": {
        "path": "columns"
      },
      "aggs": {
        "filter_by_column_name": {
          "filter": {
            "term": {
              "columns.name.keyword": "table1_col1_0_autoid"
            }
          },
          "aggs": {
            "group_by": {
              "terms": {
                "field": "columns.value.autoid.value",
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}

This is an example index mapping:

{
  "aps-index": {
    "aliases": {},
    "mappings": {
      "properties": {
        "columns": {
          "type": "nested",
          "properties": {
            "name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "value": {
              "properties": {
                "address": {
                  "properties": {
                    "addressLine1": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    }
                  }
                },
                "autoid": {
                  "properties": {
                    "value": {
                      "type": "long"
                    },
                    "display": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    }
                  }
                },
                "date": {
                  "type": "date"
                },
                "phone": {
                  "properties": {
                    "countryCode": {
                      "type": "long"
                    },
                    "number": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        },
        "id": {
          "type": "long"
        }
      }
    },
    "settings": {
      "index": {
        "replication": {
          "type": "DOCUMENT"
        },
        "number_of_shards": "5",
        "provided_name": "aps-index",
        "number_of_replicas": "1"
      }
    }
  }
}

This seems working but I wonder what would be the cons of this approach when compared to the previous document schema? One downside of this is that the queries are more complex than the flat structure. Is there any other cons, i.e. performance, scalability etc.?

Thanks in advance!

PS: I posted the same question on OpenSearch forum, posting here for broader reach.

Halim
asked a month ago176 views
1 Answer
1
  • If extreme column scalability is your paramount concern: The nested structure is the clear winner. Be prepared for the trade-off in query complexity.
  • If your queries are predominantly simple, with less frequent use of complex aggregations: The nested structure might still be a good option, as the performance overhead might be acceptable.
  • If query performance and simplicity are top priorities, but column scalability is somewhat flexible: Consider alternatives discussed below.

Alternative Solutions and Optimizations

  1. Index a fixed set of frequently used columns as top-level fields, and store additional, less-queried columns within the nested structure. This strikes a balance between simplicity and scalability.
  2. Maintain a separate store describing column names, data types, etc. This can assist in dynamically constructing queries and interpreting results.
  3. If many columns in a row are often empty, investigate techniques to optimize indexing and reduce index size.
  4. Consider a more radical shift to an EAV model if extreme schema flexibility is required. However, this has significant implications on query complexity and performance.
profile picture
EXPERT
answered a month ago
profile picture
EXPERT
reviewed a month ago
  • Hi Sedat, thank you for the answer. Yes, column scalability is a must for now, we don't want to limit the users adding "unlimited" columns as long as we have a "doable" OpenSearch solution (also open to other databases if any better alternative for this use case). I'm aware of query complexity but I'm not sure if all queries will still be possible in nested structure? Our queries are not too complex, usually we need to search by keywords in all fields or selected fields. We also use aggregation, sorting and pagination but they seem to be "simple" for now. Of course, we might need more complex queries later, that's my concern, if there is any limitations when using nested structure?

    In terms of performance, probably I know it depends on many factors, but for a typical OpenSearch cluster, do you think there'd be a significant difference? If number matters, let's say we have a few tens of millions of records (less than 200M), each ~4KB, with tens of columns, and there are 10K tenants. BTW this is a multi-tenant application.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions