如何用暴力方法确定陌生数据的颗粒度

梁竹西

在实务中有时我们需要确定一个陌生数据的颗粒度（Granularity），下面的 R function 可以解决这个问题，原理是通过暴力去重（dedup）来使得维度变量浮出水面。
具体逻辑是逐次去掉变量（one by one），观察 number of unique rows 的变化：如果无变化，则该变量不是维度变量；如果有变化，则该变量是维度变量。
在实操中建议将这个 function 多运行几次，以便让变量的测试顺序充分打乱（shuffle the test order of the columns），以此来保证一些等效维度变量也能浮现出来（比如 Policy ID 可以用其他的 ID 组合来等效表示）。

granularity_cracker = function( dat ) {
  # check packages
  require(data.table)
  require(stringr)
  # number of unique rows of the data
  nrows_target = uniqueN(dat)
  print(str_glue('Number of unique rows of the data: ', nrows_target))
  print(str_glue('-----------------------------------'))
  # check number of unique rows by removing one column at a time
  cols = sample(names(dat))
  for( cc in cols ) {
    print(str_glue('*** testing ', cc))
    nrows = uniqueN(dat[, cols[cols != cc], with = F])
    print(str_glue("    ", nrows, " vs target: ", nrows_target))
    # if removing a certain column leads to the same number of unique rows, then remove
    if(nrows == nrows_target) { cols = cols[cols != cc] }
  }
  print(str_glue('-----------------------------------'))
  return(sort(cols))
}