Emacs align-regexp explained in detail with examples

By gniuk, isgniuk@gmail.com. Date: 2020-11-18 . Last updated: 2020-11-20 .

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Emacs comes with powerful and useful functions to align text to a specific column, by regexp.

Align is a simplified align-regexp with some predefined rules applying to current major mode. Sometimes it doesn't work well. So we focus on align-regexp here.

== Example ==

Take the following unreasonable code as an example.

#define    A    0x41
#define  WHAT_EVER  0x1BADB002
#define     ABCDE     0xabcde

We want it aligned as follows:

#define A         0x41
#define WHAT_EVER 0x1BADB002
#define ABCDE     0xabcde

== Howto ==

0. Recommended settings before start:
     (setq align-to-tab-stop nil)C-x C-e if you want tabs treated the same as spaces.
     (defalias 'ar 'align-regexp)C-x C-e to help type align-regexp easily.
    Or just put them in your emacs init file.

1. Select the target lines.

2. C-u M-x align-regexp RET, then choose \(\s-*\)␣[A-Z], 1, 0, n

3. Select again and: C-u M-x align-regexp RET, then choose \(\s-*\)␣0, 1, 0, n

Or the previous two steps done in just one command:
C-u M-x align-regexp RET, then choose \(\s-*\)␣[A-Z0], 1, 0, y
Even more simple one:
C-u M-x align-regexp RET, then choose \(\s-*\)␣, 1, 0, y

Note: ␣ means exactly one space.

== Explain ==

First let's look at the function prototype of align-regexp.

C-h f align-regexp RET

(align-regexp BEG END REGEXP &optional GROUP SPACING REPEAT)

When used interactively, this function need 4 args entered by the user in the minibuffer:

REGEXP: \(\s-*\) [A-Z], 1, 0, n
GROUP: \(\s-*\) [A-Z], 1, 0, n
SPACING:\(\s-*\) [A-Z], 1, 0, n
REPEAT: \(\s-*\) [A-Z], 1, 0, n

REGEXP is the field matched to align with. The character need to be aligned is indicated by this REGEXP. Read on to find out how it works.

In our example, the pattern \(\s-*\)␣[A-Z] matches the as follows:

#define    A    0x41

#define  WHAT_EVER  0x1BADB002

#define     ABCDE     0xabcde

The subexpression \(\s-*\) is entered automatically by emacs when we invoke align-regexp with prefix C-u. The matched place in \(\) is where to insert or truncate characters to fulfill the alignment, and this is usually space field. \s- means whitespace(space or tab), same as \s␣ and ␣(if we have no tabs). \s-* means zero or more spaces.

So \(\s-*\) matches the lighter background part(space field), and ␣[A-Z] matches the darker part(a space along with an UPPER character).

Now let's talk about align. Align means make the first character exactly after the \(\) subexpression aligned at the same column. In our case, this character is the SPACE before the UPPER character. So these spaces on different lines are to be aligned.

The second arg GROUP is parenthesis group to modify. When invoked interactively there is usually exactly one parenthesis group \(\). So just leave default value 1. In some case, we need this to be -1, we will talk about it later when we do a right-alignment.

The third arg SPACING means the amount of spaces we want in the space field. In our case we need all the spaces in the parenthesis group to be deleted, so we provide 0.

The fourth arg REPEAT means whether we want repeating the match and alignment. If we need exactly one match and alignment, then we provide n. See the next section to get more on REPEAT.

== Get a deep understanding of \(\) ==

Let's get a deep understanding of \(\). Use the following REGEXP pattern varient to get the alignments done.

C-u M-x align-regexp RET, \(\s-+\)[A-Z0], 1, 1, y

Note there is no space between ) and [ this time.
\s-+ means 1 or more whitespaces, equals to \s-*␣ if no tabs. This REGEXP matches as follows:

#define    A    0x41

#define  WHAT_EVER  0x1BADB002

#define     ABCDE     0xabcde

The difference is the space field matched and the character chosen to be aligned. All spaces are now in the \(\) group, and we need one space in the final result, so the third arg SPACING need to be 1 instead of 0. If you need more spaces just provide the value you want. The REGEXP matches two columns and both need aligned, so REPEAT=y.

After understanding how align works, the REGEXP can be simplified:
C-u M-x align-regexp RET, \(\s-+\), 1, 1, y
This means make whatever character immediately after the last space aligned.
The regexp matches as follows. Notice the subtile difference of the matched part.

#define    A    0x41

#define  WHAT_EVER  0x1BADB002

#define     ABCDE     0xabcde

== Align to the right ==

Now let's talk about alignment to the right side. Say we need the following alignment.

#define         A    0x01
#define WHAT_EVER    0x1BADB002
#define     ABCDE    0xabcde

We can do it like this: C-u M-x align-regexp RET, \(\s-+[A-Z_]+\), -1, 1, n .

\s-+ means 1 or more spaces. [A-Z_]+ matches the UPPER and UPPER_WORD.

So this regexp matches as follows(the part in box on each line):

#define    A    0x41
#define  WHAT_EVER  0x1BADB002
#define     ABCDE      0xabcde

Now we set arg GROUP to -1 which means justify. According to the source code docs, justify means DO NOT delete non-whitespace characters in the group and only insert or delete spaces of the initial spaces in the group. The character after the last UPPER is chosen to be the alignment character(the char exactly after \(\)), i.e. the space in darker background. With GROUP = -1, spaces are inserted or deleted at the left side of the first UPPER to fulfill the alignment. In this way we get right-aligned.

We can match more characters before \(\), e.g. .*␣\(\s-*[A-Z_]+\), but this is not necessary in our case. If the REGEXP matches more fields and not all them are our target fields, the REGEXP should be changed to match more characters to distinguish the fields.

Post: use C-u M-x align-regexp RET, \(\s-*\)0x, 1, 4, n to do a left side alignment for the 0x part. Or C-u M-x align-regexp RET, \(\s-*0x\), -1, 4, n to do a right side alignment for the 0x part.

== More practice ==

①

struct stat64 {
        unsigned long long st_dev;      /* Device.  */
        unsigned long long st_ino;      /* File serial number.  */
        unsigned int    st_mode;        /* File mode.  */
        unsigned int    st_nlink;       /* Link count.  */
        unsigned int    st_uid;         /* User ID of the file's owner.  */
        unsigned int    st_gid;         /* Group ID of the file's group. */
        unsigned long long st_rdev;     /* Device number, if device.  */
        unsigned long long __pad1;
        long long       st_size;        /* Size of file, in bytes.  */
        int             st_blksize;     /* Optimal block size for I/O.  */
        int             __pad2;
        long long       st_blocks;      /* Number 512-byte blocks allocated. */
        int             st_atime;       /* Time of last access.  */
        unsigned int    st_atime_nsec;
        int             st_mtime;       /* Time of last modification.  */
        unsigned int    st_mtime_nsec;
        int             st_ctime;       /* Time of last status change.  */
        unsigned int    st_ctime_nsec;
        unsigned int    __unused4;
        unsigned int    __unused5;
};

Mark all the lines inside {} before aligh-regexp. (Tips: if you use Evil, vi{; if you use expand-region, M-x er/mark-inside-pairs.)

C-u M-x align-regexp RET, \(\s-*\)␣[s_], 1, 0, n

C-u M-x align-regexp RET, \(\s-*\)␣/, 1, 4, n

struct stat64 {
        unsigned long long st_dev;         /* Device.  */
        unsigned long long st_ino;         /* File serial number.  */
        unsigned int       st_mode;        /* File mode.  */
        unsigned int       st_nlink;       /* Link count.  */
        unsigned int       st_uid;         /* User ID of the file's owner.  */
        unsigned int       st_gid;         /* Group ID of the file's group. */
        unsigned long long st_rdev;        /* Device number, if device.  */
        unsigned long long __pad1;
        long long          st_size;        /* Size of file, in bytes.  */
        int                st_blksize;     /* Optimal block size for I/O.  */
        int                __pad2;
        long long          st_blocks;      /* Number 512-byte blocks allocated. */
        int                st_atime;       /* Time of last access.  */
        unsigned int       st_atime_nsec;
        int                st_mtime;       /* Time of last modification.  */
        unsigned int       st_mtime_nsec;
        int                st_ctime;       /* Time of last status change.  */
        unsigned int       st_ctime_nsec;
        unsigned int       __unused4;
        unsigned int       __unused5;
};

--------------------------------------------------------------------------------

②

my @primes = (
    1,2,3,5,7,
    11,13,17,19,23,
    29,31,37,41,43,
);

C-u M-x align-regexp RET, ,\(\s-*\)[0-9], 1, 1, y

my @primes = (
    1,  2,  3,  5,  7,
    11, 13, 17, 19, 23,
    29, 31, 37, 41, 43,
);

C-u M-x align-regexp RET, \([0-9]+,\), -1, 1, y

my @primes = (
      1,  2,  3,  5,  7,
     11, 13, 17, 19, 23,
     29, 31, 37, 41, 43,
);

--------------------------------------------------------------------------------

③

California 423,970 km²
Taiwan 36,008 km²
Japan 377,944 km²
Germany 357,021 km²
Iraq 438,317 km²
Iran 1,648,195 km²
Korea (North+South) 219,140 km²
Mexico 1,964,375 km²

C-u M-x align-regexp RET, \(\s-*␣[0-9,]+\), -1, 1, n

C-u M-x align-regexp RET, \(\s-+[[:digit:],]+\), -1, 1, n

C-u M-x align-regexp RET, .*␣\(\s-*[0-9,]+\), -1, 0, n

C-u M-x align-regexp RET, .*\(\s-*␣[0-9,]+\s-*\).*, -1, 1, n

California            423,970 km²
Taiwan                 36,008 km²
Japan                 377,944 km²
Germany               357,021 km²
Iraq                  438,317 km²
Iran                1,648,195 km²
Korea (North+South)   219,140 km²
Mexico              1,964,375 km²