Introduction to DSA & Big O
Data Structures are formats for organizing, managing, processing, and storing data. Algorithms are step-by-step procedures for solving problems or accomplishing tasks.
Complexity Analysis (Big O Notation)
Big O notation is used to describe the performance or complexity of an algorithm. It specifically describes the worst-case scenario and can be used to describe the execution time required or the space used (e.g., in memory or on disk) by an algorithm.
Arrays
An array is a collection of items stored at contiguous memory locations. In Python, the `list` type serves as a dynamic array.
Key Characteristics
- Indexed: Elements are accessed via an index, starting from 0.
- Contiguous Memory: Elements are stored next to each other in memory, which allows for fast O(1) access.
- Dynamic (in Python): Python's lists can grow or shrink in size.
Complexity Analysis
Operation | Time Complexity | Space Complexity | Notes |
---|---|---|---|
Access (by index) | O(1) | O(1) | e.g., `my_list[i]` |
Search (by value) | O(n) | O(1) | Must potentially check every element. |
Append (add to end) | O(1) (Amortized) | O(1) | Occasionally requires resizing (O(n)), but on average it's constant. |
Insert (at arbitrary index) | O(n) | O(1) | Requires shifting all subsequent elements. |
Delete (at arbitrary index) | O(n) | O(1) | Requires shifting all subsequent elements. |
Python Code Example
# Python's list is a dynamic array
arr = [10, 20, 30, 40, 50]
# Access (O(1))
print(f"Element at index 2: {arr[2]}") # Output: 30
# Append (Amortized O(1))
arr.append(60)
print(f"After append: {arr}") # Output: [10, 20, 30, 40, 50, 60]
# Insert (O(n))
arr.insert(1, 15) # Insert 15 at index 1
print(f"After insert: {arr}") # Output: [10, 15, 20, 30, 40, 50, 60]
# Delete (O(n))
arr.pop(3) # Deletes element at index 3
print(f"After pop: {arr}") # Output: [10, 15, 20, 40, 50, 60]
# Search (O(n))
try:
index = arr.index(40)
print(f"Index of 40: {index}") # Output: 3
except ValueError:
print("Element not found")
Strings
A string is a sequence of characters. In Python, strings are immutable, meaning they cannot be changed after they are created. Any operation that appears to modify a string actually creates a new one.
Key Characteristics
- Immutable: Cannot be changed in place. `my_string[0] = 'a'` will raise a `TypeError`.
- Ordered: Characters have a defined sequence.
- Iterable: You can loop through the characters.
Complexity Analysis
Operation | Time Complexity | Space Complexity | Notes |
---|---|---|---|
Access (by index) | O(1) | O(1) | e.g., `my_string[i]` |
Slicing | O(k) | O(k) | `k` is the length of the slice. A new string is created. |
Concatenation | O(n + m) | O(n + m) | `n` and `m` are lengths of strings. A new string is created. |
Search (substring) | O(n * m) | O(1) | Can be faster with algorithms like KMP. `in` operator in Python is highly optimized. |
Python Code Example
s = "hello"
# Immutability
# s[0] = 'H' # This will cause a TypeError
# Slicing (O(k))
new_s = s[1:4] # "ell"
print(f"Slice: {new_s}")
# Concatenation (O(n+m))
s2 = s + " world"
print(f"Concatenated: {s2}")
# Search (Optimized in Python)
if "ell" in s:
print("Substring 'ell' found")
# String methods create new strings
s_upper = s.upper()
print(f"Original: {s}, Uppercase: {s_upper}")
Linked Lists
A linked list is a linear data structure where elements are not stored at contiguous memory locations. The elements (nodes) are linked using pointers.
Each box (A, B, C) is a node containing `data` and a `next` pointer.
Types of Linked Lists
- Singly Linked List (SLL): Each node points only to the next node. Traversal is unidirectional.
- Doubly Linked List (DLL): Each node points to both the next and the previous node. Traversal can be bidirectional.
- Circular Linked List (CLL): The last node's `next` pointer points back to the first node, forming a circle.
Singly Linked List Implementation
class Node:
def __init__(self, data):
self.data = data
self.next = None
class SinglyLinkedList:
def __init__(self):
self.head = None
def append(self, data):
new_node = Node(data)
if not self.head:
self.head = new_node
return
last_node = self.head
while last_node.next:
last_node = last_node.next
last_node.next = new_node
def prepend(self, data):
new_node = Node(data)
new_node.next = self.head
self.head = new_node
def delete(self, key):
curr = self.head
if curr and curr.data == key:
self.head = curr.next
curr = None
return
prev = None
while curr and curr.data != key:
prev = curr
curr = curr.next
if curr is None:
return # Key not found
prev.next = curr.next
curr = None
def display(self):
elements = []
curr = self.head
while curr:
elements.append(str(curr.data))
curr = curr.next
print(" -> ".join(elements))
# Example usage
sll = SinglyLinkedList()
sll.append(1)
sll.append(2)
sll.append(3)
sll.prepend(0)
sll.display() # Output: 0 -> 1 -> 2 -> 3
sll.delete(2)
sll.display() # Output: 0 -> 1 -> 3
Comparison: Array vs. Linked List
Feature | Array (Python List) | Linked List |
---|---|---|
Access Time | O(1) | O(n) |
Insertion/Deletion (at start) | O(n) | O(1) |
Insertion/Deletion (at end) | O(1) Amortized | O(n) (or O(1) if tail pointer is maintained) |
Insertion/Deletion (in middle) | O(n) | O(n) (search) + O(1) (link change) |
Memory | Contiguous block | Scattered, requires extra memory for pointers |
Stacks
A stack is a linear data structure that follows the LIFO (Last-In, First-Out) principle. The last element added to the stack will be the first one to be removed.
Core Operations
- Push: Add an element to the top of the stack.
- Pop: Remove the top element from the stack.
- Peek/Top: View the top element without removing it.
- isEmpty: Check if the stack is empty.
Implementation using Python List
class Stack:
def __init__(self):
self.items = []
def push(self, item):
self.items.append(item)
def pop(self):
if not self.is_empty():
return self.items.pop()
return "Stack is empty"
def peek(self):
if not self.is_empty():
return self.items[-1]
return "Stack is empty"
def is_empty(self):
return len(self.items) == 0
def size(self):
return len(self.items)
# Example usage
stack = Stack()
stack.push(10)
stack.push(20)
stack.push(30)
print(f"Top element: {stack.peek()}") # Output: 30
print(f"Popped element: {stack.pop()}") # Output: 30
print(f"Is empty? {stack.is_empty()}") # Output: False
Complexity Analysis
Operation | Time Complexity | Space Complexity |
---|---|---|
Push | O(1) | O(1) |
Pop | O(1) | O(1) |
Peek | O(1) | O(1) |
Search | O(n) | O(1) |
Queues
A queue is a linear data structure that follows the FIFO (First-In, First-Out) principle. The first element added will be the first one to be removed.
Implementation with `collections.deque`
Using a Python `list` for a queue is inefficient for removals from the front (O(n)). The `collections.deque` (double-ended queue) is optimized for appends and pops from both ends (O(1)).
from collections import deque
queue = deque()
# Enqueue (add to the right)
queue.append(10)
queue.append(20)
queue.append(30)
print(f"Queue: {queue}")
# Dequeue (remove from the left)
item = queue.popleft()
print(f"Dequeued item: {item}") # Output: 10
print(f"Queue after dequeue: {queue}")
# Peek
print(f"Front of queue: {queue[0]}") # Output: 20
Variations of Queues
- Circular Queue: A fixed-size queue where the last position is connected back to the first, making it a circle. It efficiently uses a block of memory.
- Deque (Double-Ended Queue): Elements can be added or removed from either the front or the rear. `collections.deque` is Python's implementation.
- Priority Queue: Each element has a priority associated with it. An element with high priority is dequeued before an element with low priority. In Python, this is implemented using the `heapq` module (which uses a min-heap).
Priority Queue (`heapq`) Example
import heapq
# A min-priority queue
pq = []
# heapq treats the list 'pq' as a min-heap
heapq.heappush(pq, (2, 'Task B')) # (priority, item)
heapq.heappush(pq, (1, 'Task A'))
heapq.heappush(pq, (3, 'Task C'))
print(f"Priority queue: {pq}")
# Pop the item with the smallest priority
highest_priority_task = heapq.heappop(pq)
print(f"Processing: {highest_priority_task}") # Output: (1, 'Task A')
Hashing
Hashing is a technique used to map keys of any size to data of a fixed size. A hash function is used to generate this mapping. Hash tables (or hash maps) are data structures that implement an associative array abstract data type, which can map keys to values.
Key Concepts
- Hash Function: A function that takes a key and computes an index (a hash code) into an array of buckets or slots. A good hash function distributes keys uniformly.
- Hash Table: An array where data is stored according to its computed hash code.
- Collision: Occurs when two different keys hash to the same index.
Collision Resolution Techniques
-
Separate Chaining: Each bucket in the hash table points to a linked list of records that have the same hash code. This is the most common method.
Diagram:
Index 0: -> NULL
Index 1: -> KeyA -> KeyC -> NULL
Index 2: -> KeyB -> NULL -
Open Addressing (Probing): All elements are stored in the hash table itself. When a collision occurs, we probe for the next empty slot.
- Linear Probing: Check the next slot `(hash(key) + i) % size`. Suffers from primary clustering.
- Quadratic Probing: Check slots `(hash(key) + i^2) % size`.
Python's `dict` and `set`
Python's built-in `dict` (dictionary) and `set` are implemented using hash tables. They are highly optimized for performance.
Complexity Analysis (Average Case)
Operation | Time Complexity | Space Complexity |
---|---|---|
Insert | O(1) | O(1) |
Delete | O(1) | O(1) |
Search/Lookup | O(1) | O(1) |
*Worst-case complexity is O(n) if all keys hash to the same bucket, but this is extremely rare with good hash functions.
# Dictionary (Hash Map)
my_dict = {'name': 'Alice', 'age': 30}
my_dict['city'] = 'New York' # Insert/Update - O(1)
print(f"Age: {my_dict['age']}") # Lookup - O(1)
del my_dict['age'] # Delete - O(1)
print(f"Dictionary: {my_dict}")
# Set
my_set = {1, 2, 3, 4}
my_set.add(5) # Insert - O(1)
my_set.remove(2) # Delete - O(1)
print(f"Is 3 in set? {3 in my_set}") # Lookup - O(1)
print(f"Set: {my_set}")
Trees
A tree is a non-linear hierarchical data structure that consists of nodes connected by edges. Each tree has a root node, and all other nodes are descendants of the root.
Tree Traversal Algorithms
- In-order (Left, Root, Right): Visits nodes in ascending order for a BST.
- Pre-order (Root, Left, Right): Useful for creating a copy of the tree.
- Post-order (Left, Right, Root): Useful for deleting nodes.
- Level-order (Breadth-First Search): Visits nodes level by level.
Binary Search Tree (BST)
A binary tree where for each node, all values in its left subtree are less than its own value, and all values in its right subtree are greater.
class TreeNode:
def __init__(self, key):
self.left = None
self.right = None
self.val = key
def insert(root, key):
if root is None:
return TreeNode(key)
else:
if root.val < key:
root.right = insert(root.right, key)
else:
root.left = insert(root.left, key)
return root
def inorder_traversal(root):
if root:
inorder_traversal(root.left)
print(root.val, end=' ')
inorder_traversal(root.right)
# Example
root = TreeNode(50)
root = insert(root, 30)
root = insert(root, 20)
root = insert(root, 40)
root = insert(root, 70)
root = insert(root, 60)
root = insert(root, 80)
print("In-order traversal of the BST:")
inorder_traversal(root) # Output: 20 30 40 50 60 70 80
Balanced vs. Unbalanced Trees
A standard BST can become unbalanced (like a linked list) in the worst case, leading to O(n) for search/insert/delete operations. Self-balancing trees like AVL trees and Red-Black trees perform rotations to maintain a balanced height, ensuring O(log n) performance.
Heap
A heap is a specialized tree-based data structure that satisfies the heap property. In a Min-Heap, the parent node is always smaller than its children. In a Max-Heap, the parent is always larger.
Heaps are commonly used to implement Priority Queues. Python's `heapq` module provides a min-heap implementation.
Trie (Prefix Tree)
A Trie is a tree-like data structure that is efficient for retrieving keys in a dictionary-like dataset. It's commonly used for autocomplete features and spell checkers. Each node represents a character, and paths from the root to a node represent prefixes.
Graphs
A graph is a non-linear data structure consisting of a set of vertices (nodes) and a set of edges connecting these vertices.
Graph Representations
- Adjacency Matrix: A 2D array where `matrix[i][j] = 1` if there is an edge from vertex `i` to `j`. Space complexity is O(V2). Good for dense graphs.
- Adjacency List: An array of lists. `adj[i]` contains a list of all vertices adjacent to vertex `i`. Space complexity is O(V + E). Good for sparse graphs.
# Adjacency List Representation in Python using a dictionary
graph = {
'A': ['B', 'C'],
'B': ['D', 'E'],
'C': ['F'],
'D': [],
'E': ['F'],
'F': []
}
Graph Traversal Algorithms
- Breadth-First Search (BFS): Explores neighbor nodes first, level by level. Uses a queue. Useful for finding the shortest path in an unweighted graph.
- Depth-First Search (DFS): Explores as far as possible along each branch before backtracking. Uses a stack (or recursion). Useful for topological sorting and detecting cycles.
# BFS Implementation
from collections import deque
def bfs(graph, start_node):
visited = set()
queue = deque([start_node])
visited.add(start_node)
while queue:
vertex = queue.popleft()
print(vertex, end=" ")
for neighbor in graph.get(vertex, []):
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
print("BFS traversal:")
bfs(graph, 'A') # Example: A B C D E F
print("\n")
# DFS Implementation (Recursive)
def dfs(graph, start_node, visited=None):
if visited is None:
visited = set()
visited.add(start_node)
print(start_node, end=" ")
for neighbor in graph.get(start_node, []):
if neighbor not in visited:
dfs(graph, neighbor, visited)
print("DFS traversal:")
dfs(graph, 'A') # Example: A B D E F C
Important Graph Algorithms
- Dijkstra's Algorithm: Finds the shortest path from a source node to all other nodes in a weighted graph (with non-negative weights).
- Bellman-Ford Algorithm: Finds the shortest path, and can handle negative edge weights.
- Floyd-Warshall Algorithm: Finds the shortest paths between all pairs of nodes.
- Prim's & Kruskal's Algorithms: Finds the Minimum Spanning Tree (MST) of a graph.
- Topological Sort: Linear ordering of vertices in a Directed Acyclic Graph (DAG) such that for every directed edge from vertex `u` to `v`, `u` comes before `v` in the ordering.
Searching Algorithms
Linear Search
Iterates through a collection one element at a time. It can be used on any iterable, sorted or unsorted.
Time Complexity: O(n) | Space Complexity: O(1)
Binary Search
A highly efficient search algorithm that works on sorted collections. It repeatedly divides the search interval in half.
Time Complexity: O(log n) | Space Complexity: O(1) (iterative), O(log n) (recursive)
def binary_search(arr, target):
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1 # Not found
sorted_arr = [10, 20, 30, 40, 50, 60]
index = binary_search(sorted_arr, 40)
print(f"Index of 40 is: {index}") # Output: 3
Sorting Algorithms
Sorting arranges items in a collection into a specific order.
Comparison of Sorting Algorithms
Algorithm | Time Complexity (Best) | Time Complexity (Avg) | Time Complexity (Worst) | Space Complexity | Stable? |
---|---|---|---|---|---|
Bubble Sort | O(n) | O(n2) | O(n2) | O(1) | Yes |
Selection Sort | O(n2) | O(n2) | O(n2) | O(1) | No |
Insertion Sort | O(n) | O(n2) | O(n2) | O(1) | Yes |
Merge Sort | O(n log n) | O(n log n) | O(n log n) | O(n) | Yes |
Quick Sort | O(n log n) | O(n log n) | O(n2) | O(log n) | No |
Heap Sort | O(n log n) | O(n log n) | O(n log n) | O(1) | No |
Merge Sort Implementation
def merge_sort(arr):
if len(arr) > 1:
mid = len(arr) // 2
left_half = arr[:mid]
right_half = arr[mid:]
merge_sort(left_half)
merge_sort(right_half)
i = j = k = 0
# Merge the two halves
while i < len(left_half) and j < len(right_half):
if left_half[i] < right_half[j]:
arr[k] = left_half[i]
i += 1
else:
arr[k] = right_half[j]
j += 1
k += 1
while i < len(left_half):
arr[k] = left_half[i]
i += 1
k += 1
while j < len(right_half):
arr[k] = right_half[j]
j += 1
k += 1
my_list = [54, 26, 93, 17, 77, 31, 44, 55, 20]
merge_sort(my_list)
print(f"Sorted list: {my_list}")
Algorithm Paradigms
Divide and Conquer
This paradigm involves three steps:
- Divide: Break the problem into smaller subproblems.
- Conquer: Solve the subproblems recursively.
- Combine: Combine the solutions to solve the original problem.
Greedy Algorithms
Makes the locally optimal choice at each step with the hope of finding a global optimum. It doesn't always find the best solution but is fast.
Example: Finding minimum number of coins for a given change amount (for standard coin denominations).
Dynamic Programming (DP)
Solves complex problems by breaking them down into simpler subproblems, solving each subproblem only once, and storing their solutions. It is used for problems with overlapping subproblems and optimal substructure.
- Memoization (Top-Down): Recursive approach where you store the results of expensive function calls and return the cached result when the same inputs occur again.
- Tabulation (Bottom-Up): Iterative approach where you solve subproblems starting from the smallest ones and build up to the desired solution.
# Fibonacci with Memoization
memo = {}
def fib_memo(n):
if n in memo:
return memo[n]
if n <= 1:
return n
result = fib_memo(n - 1) + fib_memo(n - 2)
memo[n] = result
return result
print(f"Fibonacci (memoization): {fib_memo(10)}")
Backtracking
A general algorithmic technique for finding all (or some) solutions to computational problems, notably constraint satisfaction problems. It incrementally builds candidates to the solutions, and abandons a candidate ("backtracks") as soon as it determines that the candidate cannot possibly be completed to a valid solution.
Example: N-Queens problem, solving Sudoku.