Tuesday, January 9, 2024

Genomic Range Query: Find the minimal nucleotide from a range of sequence DNA (Medium Complexity)

Problem

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?

The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).

Write a function:

def solution(S, P, Q)

that, given a non-empty string S consisting of N characters and two non-empty arrays P and Q consisting of M integers, returns an array consisting of M integers specifying the consecutive answers to all queries.

Result array should be returned as an array of integers.

Write an efficient algorithm for the following assumptions:

N is an integer within the range [1..100,000];

M is an integer within the range [1..50,000];

each element of arrays P and Q is an integer within the range [0..N - 1];

P[K] ≤ Q[K], where 0 ≤ K < M;

string S consists only of upper-case English letters A, C, G, T.

Example

For example, consider string S = CAGCCTA and arrays P, Q such that:

P[0] = 2 Q[0] = 4

P[1] = 5 Q[1] = 5

P[2] = 0 Q[2] = 6

The answers to these M = 3 queries are as follows:

The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.

The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.

The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.

The function should return the values [2, 4, 1], as explained above.

Brute Force Solution

def solution(S, P, Q):
    impact_factors = {
        'A': 1,
        'C': 2,
        'G': 3,
        'T': 4
    }

    X = [impact_factors[i] for i in S]

    # Implement your solution here
    rtn = []
    for i in range(len(P)):
        start = P[i]
        end = Q[i] + 1

        s = X[start:end]

        rtn.append(min(s))

    return rtn 

Detected time complexity:

O(N * M)

Smart Solution Using Prefix Sums (Part 1)

def solution(S, P, Q):
    N = len(S)
    M = len(P)

    impact_factors = {'A': 1, 'C': 2, 'G': 3, 'T': 4}

    # Calculate prefix sums for each nucleotide type
    prefix_sums = [[0] * (N + 1) for _ in range(4)]
    for i in range(1, N + 1):
        for j in range(4):
            prefix_sums[j][i] = prefix_sums[j][i - 1]
        prefix_sums[impact_factors[S[i - 1]] - 1][i] += 1
        print(i, S[i-1])
        print(prefix_sums)

    print(':1 ~ ~ ~ ~ ~ ~ ~ ~ ~')
    S = 'CAGCCTA'
    P = [2, 5, 0]
    Q = [4, 5, 6]
    print(solution(S, P, Q))  # Output: [2, 4, 1]

    print(':2 ~ ~ ~ ~ ~ ~ ~ ~ ~')
    print(solution('AGCT', [], []))

    print(':3 ~ ~ ~ ~ ~ ~ ~ ~ ~')
    print(solution('AACT', [], []))

This is a two part solution.

In the first part, we build prefix sum sequences for each nucleotide to track it’s occurrence.

Output (Part 1)

S = 'CAGCCTA'

:1 ~ ~ ~ ~ ~ ~ ~ ~ ~

1 C

[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]]

2 A

[[0, 0, 1, 0, 0, 0, 0, 0], [0, 1, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]]

3 G

[[0, 0, 1, 1, 0, 0, 0, 0], [0, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]]

4 C

[[0, 0, 1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 2, 0, 0, 0], [0, 0, 0, 1, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]]

5 C

[[0, 0, 1, 1, 1, 1, 0, 0], [0, 1, 1, 1, 2, 3, 0, 0], [0, 0, 0, 1, 1, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]]

6 T

[[0, 0, 1, 1, 1, 1, 1, 0], [0, 1, 1, 1, 2, 3, 3, 0], [0, 0, 0, 1, 1, 1, 1, 0], [0, 0, 0, 0, 0, 0, 1, 0]]

7 A

[[0, 0, 1, 1, 1, 1, 1, 2], [0, 1, 1, 1, 2, 3, 3, 3], [0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 1, 1]]

Output (Part 2)

For 'AGCT'

:2 ~ ~ ~ ~ ~ ~ ~ ~ ~

1 A

[[0, 1, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]

2 G

[[0, 1, 1, 0, 0], [0, 0, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 0]]

3 C

[[0, 1, 1, 1, 0], [0, 0, 0, 1, 0], [0, 0, 1, 1, 0], [0, 0, 0, 0, 0]]

4 T

[[0, 1, 1, 1, 1], [0, 0, 0, 1, 1], [0, 0, 1, 1, 1], [0, 0, 0, 0, 1]]

For 'AACT'

:3 ~ ~ ~ ~ ~ ~ ~ ~ ~

1 A

[[0, 1, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]

2 A

[[0, 1, 2, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]

3 C

[[0, 1, 2, 2, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]

4 T

[[0, 1, 2, 2, 2], [0, 0, 0, 1, 1], [0, 0, 0, 0, 0], [0, 0, 0, 0, 1]]

Smart Solution Using Prefix Sums (Part 2)

def solution(S, P, Q):
    N = len(S)
    M = len(P)

    impact_factors = {'A': 1, 'C': 2, 'G': 3, 'T': 4}

    # Calculate prefix sums for each nucleotide type
    prefix_sums = [[0] * (N + 1) for _ in range(4)]
    for i in range(1, N + 1):
        for j in range(4):
            prefix_sums[j][i] = prefix_sums[j][i - 1]
        prefix_sums[impact_factors[S[i - 1]] - 1][i] += 1

    # Process queries
    result = []
    for k in range(M):
        start_position = P[k]
        end_position = Q[k] + 1

        for nucleotide_type in range(4):
            if prefix_sums[nucleotide_type][end_position] - prefix_sums[nucleotide_type][start_position] > 0:
                result.append(nucleotide_type + 1)
                break

    return result

    # Example usage:
    S = 'CAGCCTA'
    P = [2, 5, 0]
    Q = [4, 5, 6]
    print(solution(S, P, Q))  # Output: [2, 4, 1]

In the second part of the solution we determine from start to end if a low factor nucleotide is present or not.

Here, let seq_X be sequence from as in prefix_sums[nucleotide_type]

It takes the form:

seq_A: [0, 0, 1, 1, 1, 1, 1, 2],

seq_C: [0, 1, 1, 1, 2, 3, 3, 3],

seq_G: [0, 0, 0, 1, 1, 1, 1, 1],

seq_T: [0, 0, 0, 0, 0, 0, 1, 1]

For input sequence: CAGCCTA

Then from any of this seq, we determine whether if A is appearing between start and end using seq_A[end] - seq_A[start]

Or if C is appearing between start and end using:

seq_C[end] – seq_C[start]

Or if G is appearing between start and end using:

seq_G[end] – seq_G[start]

Or if T is appearing between start and end using:

seq_T[end] – seq_T[start]

If a nucleotide is found in this second part of the code, we record this nucleotide and break out.

Time Complexity

Detected time complexity:

O(N + M)

Tests

Correctness tests

extreme_sinlge

single character string

extreme_double

double character string

simple

simple tests

small_length_string

small length simple string

small_random

small random string, length = ~300

Performance tests

almost_all_same_letters

GGGGGG..??..GGGGGG..??..GGGGGG

large_random

large random string, length

extreme_large

all max ranges

Explanation of Code Using Bard

I'd be glad to explain the code you provided:

Here's a breakdown of the code's key steps:

  1. Initialization:

    • N stores the length of the DNA sequence S.
    • M stores the number of queries (the length of arrays P and Q).
    • impact_factors maps each nucleotide to its corresponding impact factor.
  2. Prefix Sums Calculation:

    • prefix_sums is a 2D array with 4 rows (one for each nucleotide type) and N+1 columns. It's used to store the counts of each nucleotide type up to each position in the DNA sequence.
    • The nested loops iterate through the DNA sequence and update prefix_sums accordingly:
      • The first loop iterates through each position in the sequence.
      • The second loop iterates through each nucleotide type.
      • For each position and nucleotide type, the code copies the count from the previous position and then increments the count if the current nucleotide matches the type.
  3. Query Processing:

    • result is an empty list to store the answers to the queries.
    • The outer loop iterates through each query, accessing its start and end positions from P and Q.
    • The inner loop iterates through each nucleotide type.
    • It checks if the count of the current nucleotide type within the query range (end_position - start_position) is greater than 0. If so, it means that nucleotide type is present within the range, and its impact factor is added to the result list. The loop then breaks because we've found the minimum impact factor for that query.
  4. Returning Results:

    • The function returns the result list containing the answers to all queries.

Example Usage:

  • The provided example demonstrates how to use the function with a sample DNA sequence S and query arrays P and Q.
  • The output [2, 4, 1] correctly represents the minimum impact factors for the specified query ranges in S.

Key Points:

  • The code efficiently calculates prefix sums to count the occurrences of each nucleotide type up to each position.
  • This precomputation allows for quick calculation of minimum impact factors within any query range, using simple subtractions of prefix sums.
  • The time complexity of the code is O(N + M), making it efficient for large DNA sequences and multiple queries.

No comments:

Post a Comment