2026年3月20日 研究日志¶

今日主题:继续整理最近学习的生物信息学相关概念、算法及python函数。同时将一些小技巧也记录下来。

模式匹配 Pattern Matching¶

在较长的文本中,寻找某个较短字符串出现的位置。在长度叹为观止的基因组序列中,你总得有点聪明的办法找到你需要的模式,比如转录因子的结合位点。

↓当你需要对基因组进行模式匹配时,怎么都绕不过反向互补,因此你需要这个反向互补函数。

In [ ]:
def reverse_complement_optimized(seq):
    """
    Generate the reverse complement of a DNA sequence.

    Purpose:
        This function efficiently computes the reverse complement of a DNA string
        by first reversing the sequence and then translating each nucleotide
        according to standard base-pairing rules (A↔T, C↔G).

    Input:
        seq (str): A DNA sequence consisting of characters 'A', 'T', 'C', and 'G'.

    Output:
        str: The reverse complement of the input DNA sequence.

    Notes:
        - This implementation uses str.maketrans and translate() for speed.
        - The function assumes the input contains only uppercase DNA bases.
    """
    translation_table = str.maketrans('ATCG', 'TAGC')
    return seq[::-1].translate(translation_table)

↓用于找到模式起始位置的函数find_pattern_positions(pattern, genome),不断在genome中向后搜索pattern,并记录每一次匹配的起点,从而实现对所有匹配位置的完整扫描。为了处理可能出现的重叠匹配,函数在找到一次匹配后,不跳过整个模式长度,而是从下一个字符继续搜索。最终将所有匹配到的起始位置以空格分隔的形式返回。

In [ ]:
def find_pattern_positions(pattern, genome):
    """
    Find all starting positions of a pattern within a genome string.

    Purpose:
        This function performs a straightforward pattern-matching scan to locate
        every occurrence of the substring `pattern` inside the larger string `genome`.
        Overlapping matches are allowed and will be reported.

    Input:
        pattern (str): The substring to search for.
        genome (str): The larger text (e.g., a DNA sequence) in which the pattern
                      will be searched.

    Output:
        str: A space-separated string of all starting indices (0-based) where the
             pattern occurs in the genome.

    """
    positions = []
    start = 0
    pattern_length = len(pattern)
    
    while start < len(genome):
        pos = genome.find(pattern, start)
        if pos == -1:
            break
        positions.append(str(pos))
        start = pos + 1
    
    return " ".join(positions)

↓用于在基因组序列中寻找所有能够形成(L, t)-clump的k-mer,即在长度为L的长度中至少出现t次的k-mer。可以用于识别复制起点。

In [ ]:
def FindClumps(Text, k, L, t):
    """
    Identify all k-mers forming (L, t)-clumps within a given genome string.

    Purpose:
        This function scans the genome using a sliding window of length L and
        identifies all k-mers that appear at least t times within any such window.
        A k-mer that satisfies this condition is considered to form an (L, t)-clump.

    Input:
        Text (str): The genome or long DNA string to be analyzed.
        k (int): Length of the k-mer.
        L (int): Length of the sliding window.
        t (int): Minimum number of occurrences required for a k-mer to be
                 considered part of a clump.

    Output:
        list[str]: A list of distinct k-mers that appear at least t times in
                   at least one window of length L.

    Algorithm Overview:
        - Slide a window of length L across the genome.
        - For each window, construct a frequency map of all k-mers within it.
        - Collect any k-mer whose count is ≥ t.
        - Ensure each qualifying k-mer is reported only once.

    Notes:
        - This implementation uses a naive sliding-window approach and may be
          computationally expensive for large genomes.
        - Overlapping windows are fully considered.
    """
    patterns = []
    n = len(Text)
    
    for i in range(n - L + 1):
        window = Text[i:i+L]
        
        freq_map = {}
        for j in range(len(window) - k + 1):
            kmer = window[j:j+k]
            freq_map[kmer] = freq_map.get(kmer, 0) + 1
        
        for kmer, count in freq_map.items():
            if count >= t and kmer not in patterns:
                patterns.append(kmer)
    
    return patterns

今天就先到这里了,今天折腾了半天博客的公式渲染问题,也没整理好,等有空再搞吧