2026年3月20日 研究日志¶
今日主题:继续整理最近学习的生物信息学相关概念、算法及python函数。同时将一些小技巧也记录下来。
模式匹配 Pattern Matching¶
在较长的文本中,寻找某个较短字符串出现的位置。在长度叹为观止的基因组序列中,你总得有点聪明的办法找到你需要的模式,比如转录因子的结合位点。
↓当你需要对基因组进行模式匹配时,怎么都绕不过反向互补,因此你需要这个反向互补函数。
In [ ]:
def reverse_complement_optimized(seq):
"""
Generate the reverse complement of a DNA sequence.
Purpose:
This function efficiently computes the reverse complement of a DNA string
by first reversing the sequence and then translating each nucleotide
according to standard base-pairing rules (A↔T, C↔G).
Input:
seq (str): A DNA sequence consisting of characters 'A', 'T', 'C', and 'G'.
Output:
str: The reverse complement of the input DNA sequence.
Notes:
- This implementation uses str.maketrans and translate() for speed.
- The function assumes the input contains only uppercase DNA bases.
"""
translation_table = str.maketrans('ATCG', 'TAGC')
return seq[::-1].translate(translation_table)
↓用于找到模式起始位置的函数find_pattern_positions(pattern, genome),不断在genome中向后搜索pattern,并记录每一次匹配的起点,从而实现对所有匹配位置的完整扫描。为了处理可能出现的重叠匹配,函数在找到一次匹配后,不跳过整个模式长度,而是从下一个字符继续搜索。最终将所有匹配到的起始位置以空格分隔的形式返回。
In [ ]:
def find_pattern_positions(pattern, genome):
"""
Find all starting positions of a pattern within a genome string.
Purpose:
This function performs a straightforward pattern-matching scan to locate
every occurrence of the substring `pattern` inside the larger string `genome`.
Overlapping matches are allowed and will be reported.
Input:
pattern (str): The substring to search for.
genome (str): The larger text (e.g., a DNA sequence) in which the pattern
will be searched.
Output:
str: A space-separated string of all starting indices (0-based) where the
pattern occurs in the genome.
"""
positions = []
start = 0
pattern_length = len(pattern)
while start < len(genome):
pos = genome.find(pattern, start)
if pos == -1:
break
positions.append(str(pos))
start = pos + 1
return " ".join(positions)
↓用于在基因组序列中寻找所有能够形成(L, t)-clump的k-mer,即在长度为L的长度中至少出现t次的k-mer。可以用于识别复制起点。
In [ ]:
def FindClumps(Text, k, L, t):
"""
Identify all k-mers forming (L, t)-clumps within a given genome string.
Purpose:
This function scans the genome using a sliding window of length L and
identifies all k-mers that appear at least t times within any such window.
A k-mer that satisfies this condition is considered to form an (L, t)-clump.
Input:
Text (str): The genome or long DNA string to be analyzed.
k (int): Length of the k-mer.
L (int): Length of the sliding window.
t (int): Minimum number of occurrences required for a k-mer to be
considered part of a clump.
Output:
list[str]: A list of distinct k-mers that appear at least t times in
at least one window of length L.
Algorithm Overview:
- Slide a window of length L across the genome.
- For each window, construct a frequency map of all k-mers within it.
- Collect any k-mer whose count is ≥ t.
- Ensure each qualifying k-mer is reported only once.
Notes:
- This implementation uses a naive sliding-window approach and may be
computationally expensive for large genomes.
- Overlapping windows are fully considered.
"""
patterns = []
n = len(Text)
for i in range(n - L + 1):
window = Text[i:i+L]
freq_map = {}
for j in range(len(window) - k + 1):
kmer = window[j:j+k]
freq_map[kmer] = freq_map.get(kmer, 0) + 1
for kmer, count in freq_map.items():
if count >= t and kmer not in patterns:
patterns.append(kmer)
return patterns
今天就先到这里了,今天折腾了半天博客的公式渲染问题,也没整理好,等有空再搞吧