TL;DR

In this post we will be learning in a practical way everything I have been learning for the past few months regarding CodeQL for Python. I hope you like it as much as I do! :)

Learning resources
Environment SetUp
- Remote queries
- - Automation
- Local queries
Concepts
- Source
- - RemoteFlowSource
- - Source in Regular Expression injection query
- - Source in LDAP Injection injection query
- - Source in XXE query
- Sink
- - Sink in Regular Expression injection query
- - Sink in LDAP Injection injection query
- - Sink in XXE query
- Taint tracking configuration predicates
- - Additional taint steps
- - Sanitizers
- - Concepts ¿again?
Query development
- Basic approaches
- Codebase distribution
- Modeling
- - Concepts
- - Frameworks/Libraries
- - Taint tracking configuration and query-specific modeling
- - - Basic example of a taint tracking configuration
- Tests
- - Basic tests
- - Advanced tests
- Documentation (qhelp and qldocs)
- Submission
Advanced query modeling
- Regular Expression injection
- - Concepts
- - re library modeling
- - Taint tracking configuration
- - Complete query
- LDAP Injection
- - Concepts
- - ldap library modeling
- - - ldap2
- - - ldap3
- - - Everything together
- - Taint tracking configuration
- - Complete query
Bonus exercises
The end

Learning resources

CodeQL documentation (link)
GitHub Security Lab (link):
- C/C++: Apple XNU Kernel: Finding a memory exposure vulnerability with CodeQL (CVE-2017-13782)
GitHub Learning Lab:
- JS and C/C++
GitHub YouTube channel (sort by difficulty and learning quality):
- Java: Finding security vulnerabilities in Java with CodeQL - GitHub Satellite 2020
- C/C++: CodeQL Live Episode 1
- C/C++: Security: Workshop 2 - Finding security vulnerabilities in C/C++ with CodeQL
- JS: Finding security vulnerabilities in JavaScript with CodeQL - GitHub Satellite 2020
- Java: Variant analysis to find SQL injection using CodeQL - CVE-2019-6986
- General: Community-powered security analysis with CodeQL - GitHub Universe 2020
C/C++: Make Memcpy Safe Again: CodeQL
C/C++: CVE-2017-13782: CodeQL Study Note
Tutorial: [Live Stream] CodeQL Code Scanning Language Tutorial
Java: $3,000 CodeQL query for finding LDAP Injection - Github Security Lab - Hackerone

Environment SetUp

In order to be able to try out the examples this post will show, this section will help you understand what LGTM is and to set up a working codeql environment to run the queries on your end.

Remote queries

LGTM.com is a website holding github/codeql’s lgtm.com branch with an online codeql editor that lets you run any codeql snippet using the core codeql libraries.

Run an example!

As you can see, it lets you select several projects to run the query on (being able to create custom lists) and it also shows the results in a pretty way. The former example shows just a string, but using @kind path-problem (query metadata) and DataFlow::PathGraph is much prettier:

Run an example!

This post will refer you to LGTM each time there’s a codeql snippet whose behaviour may be shown.

Automation

The existence of a cloud-based CodeQL “instance” opens a wide range of ideas regarding automation. An aggressive automation clearly goes against LGTM ToS, so use this information at your own risk.

gagliardetto/lgtm-cli and JLLeitschuh/lgtm_hack_scripts let you follow repos (for them to be built by LGTM) based on GitHub API search or dependency network, create custom lists, and query already-built projects.

This automation helps measuring the impact and precission of the query, and lets you provide results for the bounty submissions if any. (see #submission).

Local queries

This is the way I’d recommend to run queries and play with them. Let’s start!

Clone jorgectf/codeql inside an empty folder.
Open the empty folder with VSCode.
Install the CodeQL extension.
Checkout Practical-CodeQL-Introduction branch:
- Open a terminal Terminal > New Terminal and run (cd codeql/ && git checkout Practical-CodeQL-Introduction).
- OR
- Go to Source Control pane, click main and choose Practical-CodeQL-Introduction.
Go to Testing pane, expand codeql > python / ql / test > experimental > query-tests > Security > Practical-CodeQL-Introduction and click the “play”/“run” button.
Once the tests have finished (they will intendedly fail because the results don’t match those from .expected file) a CodeQL database should have been created.
Go to the CodeQL pane, click Add a CodeQL database: From a Folder and choose codeql/python/ql/test/experimental/query-tests/Security/Practical-CodeQL-Introduction/Practical-CodeQL-Introduction.testproj.
Find a file called query.ql inside codeql/python/ql/src/experimental/Security/Practical-CodeQL-Introduction/.
You are ready to go! Feel free to run any query inside query.ql by writing the desired code and running it (Right Click > CodeQL: Run Query). You may also run an specific snippet by selecting it, right-click > CodeQL: Quick Evaluation.

In case CodeQL CLI doesn’t get installed (a binary capable of running everything related to codeql) head to Extensions > CodeQL > Extension Settings, find Code QL › Cli: Executable Path, add a random string like “a” inside the input form, click outside the form (for vscode to update the value) and remove the written input. You may see a vscode notification showing that the CodeQL CLI is being installed now.

Concepts

In order to fully understand the incoming points regarding query deveploment we need to look at a few concepts (some which you may already now, but focused on CodeQL).

Source

We may understand a “source” as the very first appearance of the code whose flow we want to follow. For example, a source could be user input or a hardcoded string (matching the form of an specific string), and we will sometimes be referring to it as “tainted” data (e.g., TaintTracking: coming from TaintTracking::Configuration, a class allowing us to specify and customize the source, sink and several other parts of a flow configuration).

RemoteFlowSource

Since most of the security-related queries' focus is to check whether user input flows into a specific part of the code (e.g., a function’s argument), CodeQL introduced a structure (see #concepts-again) that compiles every user input for developers not to worry about it. (Since CodeQL is under development, some frameworks may not be introduced yet, but the objective of this structure is to hold as much as user-input-providing functions as possible)

import python
import semmle.python.dataflow.new.RemoteFlowSources

from RemoteFlowSource rfs // create a 'rfs' variable of type RemoteFlowSource
select rfs // return all of its appearances

Practical Introduction to CodeQL

TL;DR

Learning resources

Environment SetUp

Remote queries

Automation

Local queries

Concepts

Source

RemoteFlowSource

Source in Regular Expression Injection query

Source in LDAP Injection query

Source in XXE query

Sink

Sink in Regular Expression Injection query

Sink in LDAP Injection query

Sink in XXE query

Taint tracking configuration predicates

Additional taint steps

Sanitizers

Concepts ¿again?

Query development

Basic approaches

Codebase distribution

Modeling

Concepts

Frameworks/Libraries

Taint tracking configuration and query-specific modeling

Basic example of a taint tracking configuration

Tests

Basic tests

Advanced tests

Documentation (qhelp and qldocs)

Submission

Advanced query modeling

Regular expression injection

Concepts

re library modeling

Taint tracking configuration

Complete query

LDAP Injection

Concepts

LDAP library modeling

LDAP 2

LDAP 3

Everything together

Taint tracking configuration

Complete query

Bonus exercises

The end