SAS

A Hash Approach to Lookup Table in SAS

Posted by Qing on May 5, 2018

A follow up for my paper in PharmaSUG China 2017

Table lookup, 又称表查找,是一种常用的数据整合方法。SAS内提供了很多方法来执行表查找操作,例如使用if-then语句,format语句,merge语句,PROC SQL等等。这里介绍一种通过哈希实现的表查询方法,不仅快速,而且简洁方便。

什么是表查找?

熟悉数据分析或者数据库的人应该不会陌生,所谓的表查找通常指的是通过一定的键组合,在另外的数据表里面查找对应的值。
比如说一个表里面只有每个省的简称,例如湘,但是并不知道每个省的全称。省的全称存储在另外一个表中,这时候我们可以通过省的简称当做键值,通过表查找的方法从另外一个表中取得全称
用文字描述可能有点抽象,具体看下图。

常用表查找方法

  1. if-then statement
    最简单的方法莫过于用if-then语句直接指定了。
      data states;
       if state = 'Virginia' then capital = 'Richmond';
       else if state = 'Georgia' then capital = 'Atlanta';
       ...
    

    但是这个方法不是很灵活,一旦数据更新,程序便要修改。

  2. Merge statement
    另外一个常见的操作是通过merge语句,通过键,将数据合并过来。
      proc sort data = StatesCapital; by state;run;
      proc sort data = States; by state;run;
      data new;
       merge StatesCapital States;
       by state;
      run;
    

    这个语句大概是每个SAS程序员最常用的方法了。但是,用merge比较烦人的一点,每次在数据合并之前,都要按键对数据进行排序。

  3. PROC SQL
    高级的SAS程序员可能会用SQL实现,代码如下
      proc sql;
       create table new as
       select a.state, b.capital
       from States as a, StatesCapital as b
       where a.state = b.state;
      quit;
    

    我个人是非常喜欢这种写法的,这样的写法既不用提前排序,也比较直观。

以上是SAS里面比较经典的表查找方法,也是程序里面用得比较多的方法,其他例如format之类的就不一一例举了。


哈希实现地表查找方法

喜欢折腾的人可能会用如下的方法

data new;
    if 0 then set work.StatesCapital;
    if _n_ = 1 then do;
        declare hash hh (dataset: "work.StatesCapital");
        hh.definedata("capital");
        hh.definekey("state");
        hh.definedone();
    end;
    set States;
        rc =hh.find();
        if rc = 0 then capital = capital;
run;

如果不懂怎么使用SAS中的哈希表,可以参见A Hands-on Introduction to SAS® DATA Step Hash Programming Techniques

在data步里面嵌入Hash语句,实现非常高效的表查找。这样的写法对于广大SAS用户来说,确实有点难以接受。一是因为其逻辑跟一般的SAS编程不一样,二是因为需要的语句较多,写起来费劲,维护起来也难。至于对于提高的那一丁点效率。大多数人是不在乎的,毕竟行业的数据量就那么多,少个0.1秒差别不是很大。

那么,有什么改进的办法呢? 在介绍改进方法之前,先简要介绍一下哈希表算法。

哈希表算法

散列表(Hash table,也叫哈希表),是根据关键码值(Key:value)而直接进行访问的数据结构。也就是说,它通过把关键码值映射到表中一个位置来访问记录,以加快查找的速度。通过哈希表,查找效率可以提升至

哈希表类似于字典里面的索引,通过索引我们能快速找到我们要查找的单词。也可以将哈希表理解成文章的摘要,通过摘要我们能快速查找到对应的文章。

哈希表实现的自定义函数

上文中,在DATA步中使用哈希表实现表查找的方法,有点笨拙,恐怕大多数SAS用户无法接受这样的写法。那么,有什么办法可以改进呢?

从SAS9.2开始,自定义函数模块PROC FCMP里可以嵌套哈希表,从而实现自定义的表查找函数,具体的实现如下。

首先我们自定义一个SAS函数,将哈希表查找嵌入函数中。
  proc fcmp outlib=work.functions.samples;
      * 这里定义函数的名字,以及表查找的key(同时需要制定key的类型)。
      注意如果返回值是字符型变量,则需要制定返回值长度以及类型(用$表示);
      function hash_fcmp(state $) $25;
      * 指定查找表的位置;
      declare hash hh(dataset: "work.StatesCapital");
      * 指定查找返回的变量;
      rc=hh.definedata("capital");
      * 指定查找的键
      rc=hh.definekey("state");
      rc=hh.definedone();
      * 如果找到键对应的值,则返回值,否则返回指定的值;
      rc=hh.find();
      if rc eq 0 then return(capital);
      else return('Not Found');
      endsub;
  quit;
  * 指定SAS调用自定义函数的位置;
  options cmplib=work.functions;
然后便可以在DATA步或者SQL中直接调用
  * In Data Step;
  data new;
      set states;
      capital = hash_fcmp(state);
  run;

  * In Proc SQL;
  proc sql;
      create table new as select *,hash_fcmp(state) as capital
      from states;
  quit;

这样一来,复杂的哈希语句就被嵌套在一个SAS自定义函数当中,使用者只需简单的调用这个函数,便可实现表查找操作。

下表给出了在PROC FCMP中调用HAHS的基本语法

Method Syntax Description
DECLARE DECLARE hash object-name Create a new instance of hash object, create at parse time.
DEFINEKEY rc = object.DEFINEKEY(‘key 1’,’key n’) Set up key variables for hash object
DEFINEDATA rc = object.DEFINEDATA(‘dataset 1’,’dataset n’) Define data to be stored in hash object
DEFINEDONE rc = object.DEFINEDONE() Indicate the key and data specification is completed
DELETE rc = object.DELETE() Delete the hash object and free any resources allocated
FIND rc = object.FIND(‘key1’,’keyn’) Search a hash object based on the values of defined key, If look up is successful, defined data variable are updated
CHECK rc = object.CHECK() Search a hash object based on the values of defined key, data will not be updated whether If look up is successful
NUM_ITEMS rc = object. NUM_ITEMS() Return the number of items in hash object
ADD rc = object.ADD(key: value1, key: value n) Add data with associated key to hash object
REMOVE rc = object.REMOVE(key: value1, key: value n) Remove data with associated key to hash object

具体的使用方法可以参阅这篇文章:Hashing in PROC FCMP to Enhance Your Productivity

为什么要使用哈希表实现的自定义函数?

原因很简单,因为这种用法很简单、高效!

实践

前面讲了这么多原理,下面讲一个有意思地应用。通过自定义哈希表函数,实现一些平时在SAS中不太容易实现的操作。

下表(CharNum)存放着26个字母和每个字符对应的数值。

char val
a 1
b 2
c 3
d 4
z 26

给定一个26个字母组成的字符串,如何求字符串对应的数值的总和?例如,字符串’abc’对应的数值为:1(a) + 2(b) + 3(c) = 6

SAS实现

要解决这个问题,普通的SAS实现方法有一定的难度。但是通过哈希表自定义函数,可以很快地解决这个问题。

首先定义一个哈希函数。

proc fcmp outlib=width.functions.GetNum;
    function GetNum(char $);
    * Define Hash Table with Character Width Dataset;
    declare hash Calculate(dataset: "work.CharNum");
    rc = Calculate.defineKey("char");
    rc = Calculate.defineData("val");
    rc = Calculate.defineDone();

    * Retrieve Data from Hash and Sum up;
    rc = Calculate.find();
    if rc eq 0 then return(val);
    endsub;
run;

这个函数用来得到每个字母对应的数值。

在定义好函数之后是,通过直接调用这个函数,便可得到字符串值。

data _null_;
    string = 'abc';
    num = 0;
    do i = 1 to lengthn(string);
        char = substr(string,i,1);
        num = num + GetNum(char);
    end;
    put num =;
run;
>> 6

你甚至可以将循环嵌入自定义函数中,这样在data步内就不需要使用循环便可直接得到结果。

proc fcmp outlib=width.functions.GetNum;
    function GetNum(char $);
    * Define Hash Table with Character Width Dataset;
    declare hash Calculate(dataset: "work.CharNum");
    rc = Calculate.defineKey("char");
    rc = Calculate.defineData("val");
    rc = Calculate.defineDone();
    val_tot = 0;
    * Retrieve Data from Hash and Sum up;
    do i = 1 to lengthn(string);
        char = substr(string,i,1);
        rc = Calculate.find();
        if rc eq 0 then val_tot = val_tot+val;
    end;     
    return(val_tot);
    endsub;
run;

在DATA步中直接调用

data _null_;
    string = 'abc';
    num = GetNum(string);
    put num =;
run;
>> 6

因为SAS中缺少一些高级的数据结构,因此上面借助一个哈希函数来解决这个问题。在Python中字典属于一种自带哈希功能的高级数据结构,通过字典可以快速解决上述问题。
下面给出Python中的实现方式,大家可以体会到Python的强大之处。

HashTable = {'a':1, 'b':2, ..., 'z':26}
num = 0
string = 'abc'
for char in string:
    num += HashTable[char]
print(num)
>> 6

总结

通过在PROC FCMP中嵌入Hash进行表格查找并不是简单地提高查询效率,而是为了简化日常工作,增强程序可读性。另外,这种新的写法一定程度上也扩展了SAS的功能。
有兴趣的可以参阅一下文档,了解更多。
Hashing in PROC FCMP to Enhance Your Productivity
Load a SAS data set into a Hash Object using PROC FCMP
PROC FCMP and DATA Step Component Objects

A follow up for my paper in PharmaSUG China 2017

SAS provides many ways to perform table lookups operation, such as using if-then statement in DATA step, or format statement, or merge statement, and even PROC SQL. Here, I’ll introduce a new approach to look-up table which utilizes the HASH function. It’s not only efficient but also simple and convenient.

What is table lookup?

Table lookup should be familiar to anyone with data analysis or database. The so-called table lookup usually refers to finding corresponding values in another data table through certain key combinations. For example, there is one table called A and there is only state ID in this dataset and there is another table called B which have both state ID and state capital, and I’d like to know the corresponding capital for each state in table A. How can we achieve this? The answer is table look up. The text description may be a little abstract, see the figure below.

Frequently used table lookup methods

  1. if-then statement The simplest way is the use the if-then statement directly.
    data states;
     if state = 'Virginia' then capital = 'Richmond';
     else if state = 'Georgia' then capital = 'Atlanta';
     ...
    

    However, this way is not very flexible as the program must be modified once data is updated.

  2. Merge statement Another common operation is to use the merge statement.
    proc sort data = StatesCapital; by state;run;
    proc sort data = States; by state;run;
    data new;
     merge StatesCapital States;
     by state;
    run;
    

    This statement is probably the most common method in SAS programmers daily life. A annoying part is that dataset should be sorted before merge operation.

  3. PROC SQL Advanced SAS programmers may use SQL
    proc sql;
     create table new as
     select a.state, b.capital
     from States as a, StatesCapital as b
     where a.state = b.state;
    quit;
    

    Personally, I prefer this way. In this way, there is no need to sort the data in advance, and it is also intuitive.

These examples cited above are some classic table lookup table method in SAS, and it is also the most often used in SAS program.


Hash way to table lookup

People who like tossing may use the following method

data new;
    if 0 then set work.StatesCapital;
    if _n_ = 1 then do;
        declare hash hh (dataset: "work.StatesCapital");
        hh.definedata("capital");
        hh.definekey("state");
        hh.definedone();
    end;
    set States;
        rc =hh.find();
        if rc = 0 then capital = capital;
run;

if you do not know the hash method, you can seeA Hands-on Introduction to SAS® DATA Step Hash Programming Techniques

Hash in data step is very efficient. But it’s a bit difficult to adult for the majority of SAS users. The first reason is that the program logic behind hash is not the same as general SAS programming, and the second is that more statements are required compared with merge or SQL and thus the program is harder to maintain.

As for the slightes improvement in efficiency, most people do not care.

So, how can we improve? let’s get familiar with the hash table algorithm before introducing the improved methods.

Hash Table algorithm

A HASH table is a data structure that enables you to accessed value quickly according to a key value. That is, it accesses records by mapping key values to a location in the table to speed up the search. Through the hash table, the search efficiency can be improved to

The Hash table is similar to the index in the dictionary. Through the index we can quickly find the word we are looking for.
You can also compare the hash table to the summary of the article. Through the summary, we can quickly find the corresponding article.

Hash table in FCMP

in the above, we use Hash in DATA step to implement table lookup. it’s a little clumsy, and I am afraid that most SAS users cannot accept such a wording. So, how can we improve it?

Starting from SAS9.2,Hash statement can be embedded in PROC FCMP to define a customed table lookup function. The specific implementation is as follows.

  • Firstly, we define a customized SAS function through FCMP, and then embeded the Hash statement in the function.
    proc fcmp outlib=work.functions.samples;
        * Define function name as well as the lookup key(the type of key needs to be specified at the same time). Notice that if the return value is character type, you need to specify the length and type of the return value(indicated by $);
        function hash_fcmp(state $) $25;
        * specify location of lookup table;
        declare hash hh(dataset: "work.StatesCapital");
        * specify data variable to be returned;
        rc=hh.definedata("capital");
        * specify the key
        rc=hh.definekey("state");
        rc=hh.definedone();
        * if find the corresponding value then return, otherwise return the specified value
        rc=hh.find();
        if rc eq 0 then return(capital);
        else return('Not Found');
        endsub;
    quit;
    
    * Don't forget to specify where SAS to call the function
    options cmplib=work.functions;
    
  • then we can use this function in data step or SQL directly ```sas
  • In Data Step; data new; set states; capital = hash_fcmp(state); run;

  • In Proc SQL; proc sql; create table new as select *,hash_fcmp(state) as capital from states; quit; ``` In this way, complex hash statements are nested within a SAS custom function. Users could simply invoke this function to perform table lookups.

The following table shows the basic syntax for calling HAHS in PROC FCMP

Method Syntax Description
DECLARE DECLARE hash object-name Create a new instance of hash object, create at parse time.
DEFINEKEY rc = object.DEFINEKEY(‘key 1’,’key n’) Set up key variables for hash object
DEFINEDATA rc = object.DEFINEDATA(‘dataset 1’,’dataset n’) Define data to be stored in hash object
DEFINEDONE rc = object.DEFINEDONE() Indicate the key and data specification is completed
DELETE rc = object.DELETE() Delete the hash object and free any resources allocated
FIND rc = object.FIND(‘key1’,’keyn’) Search a hash object based on the values of defined key, If look up is successful, defined data variable are updated
CHECK rc = object.CHECK() Search a hash object based on the values of defined key, data will not be updated whether If look up is successful
NUM_ITEMS rc = object. NUM_ITEMS() Return the number of items in hash object
ADD rc = object.ADD(key: value1, key: value n) Add data with associated key to hash object
REMOVE rc = object.REMOVE(key: value1, key: value n) Remove data with associated key to hash object

Why should we use Hash in FCMP?

The reason is simple, because it’s simplicity and efficiency!

Practice

here’s an straightforward and interesting application. By defining hash in FCMP, we can achieve some operations that are usually not easily implemented in SAS.

Assume that there are corresponding numeric value to each character.

char val
a 1
b 2
c 3
d 4
z 26

Given a string,you are required to sum the numeric value of this string。For example, the corresponding numeric value of ‘abc’ is:1 + 2 + 3 = 6

  • SAS implementation

we can call this function directly to get the numeric value.

data _null_;
    string = 'abc';
    num = 0;
    do i = 1 to lengthn(string);
        char = substr(string,i,1);
        num = num + GetNum(char);
    end;
    put num =;
run;
>> 6

and the customized function is beening defined like this

proc fcmp outlib=width.functions.GetNum;
    function GetNum(char $);
    * Define Hash Table with Character Width Dataset;
    declare hash Calculate(dataset: "work.CharNum");
    rc = Calculate.defineKey("char");
    rc = Calculate.defineData("val");
    rc = Calculate.defineDone();

    * Retrieve Data from Hash and Sum up;
    rc = Calculate.find();
    if rc eq 0 then return(val);
    endsub;
run;

you can even embed loops in the fuction to make it more simple.

proc fcmp outlib=width.functions.GetNum;
    function GetNum(char $);
    * Define Hash Table with Character Width Dataset;
    declare hash Calculate(dataset: "work.CharNum");
    rc = Calculate.defineKey("char");
    rc = Calculate.defineData("val");
    rc = Calculate.defineDone();
    val_tot = 0;
    * Retrieve Data from Hash and Sum up;
    do i = 1 to lengthn(string);
        char = substr(string,i,1);
        rc = Calculate.find();
        if rc eq 0 then val_tot = val_tot+val;
    end;     
    return(val_tot);
    endsub;
run;

and call it directly in data step

data _null_;
    string = 'abc';
    num = GetNum(string);
    put num =;
run;
>> 6

Because of the lack of some advanced data structures in SAS, it is a bit tricky to implement these functions. The following gives the implementation in Python, we can appreciate the power of Python.

HashTable = {'a':1, 'b':2, ..., 'z':26}
num = 0
string = 'abc'
for char in string:
    num += HashTable[char]
print(num)
>> 6

Summary

Table lookup by embedding Hash in PROC FCMP not only simply increase query efficiency, but also simplifies routine work and enhances program readability. In addition, this new style of writing also expands the functionality of the SAS to some extent. Interested people can refer to the documentation to learn more.
Hashing in PROC FCMP to Enhance Your Productivity
Load a SAS data set into a Hash Object using PROC FCMP
PROC FCMP and DATA Step Component Objects