Talently
Talently
DBA

DBA

Ensures that business data is available, consistent, and accessible at the performance level the product demands.

A Database Administrator (DBA) is responsible for designing, implementing, maintaining, and optimizing the database systems that underpin an organization's products and operations. Their work spans data modeling and database engine configuration to backup management, replication, high availability, and disaster recovery. They work closely with back-end developers, architects, and infrastructure teams to ensure that data systems can support business growth without compromising performance or data integrity.

PostgreSQLMySQLMongoDBRedisSQLReplication

Recruit the best DBA here

Start now

Main Responsibilities

  • Design and maintain relational and non-relational database schemas optimized for the product's access patterns.
  • Monitor and optimize query performance, indexes, and engine configuration to keep response times within SLAs.
  • Implement and maintain backup, replication, and high availability strategies to ensure business continuity.
  • Manage schema migrations in production using strategies that minimize downtime and the risk of data loss.
  • Define and enforce data access security policies: roles, permissions, auditing, and encryption.
  • Collaborate with development teams by reviewing queries and schema designs before they reach production.

Key Skills

Technical Skills

  • Advanced SQL: complex joins, window functions, CTEs, subqueries, and execution plan optimization
  • Administration of relational engines (PostgreSQL, MySQL) and non-relational (MongoDB, Redis), including configuration and performance tuning
  • Schema design with normalization, strategic denormalization, and data modeling for both OLTP and OLAP workloads
  • Implementation of replication, clustering, and high availability strategies (streaming replication, read replicas, automatic failover)
  • Backup management with point-in-time recovery strategies and regular, documented restore testing
  • Performance monitoring with tools such as pg_stat_statements, slow query log, EXPLAIN ANALYZE, and APM systems

Soft Skills

  • Methodical rigor when executing production changes: checklists, pre-change validations, and documented rollback plans
  • Technical communication to translate database performance issues into business impact that non-technical stakeholders can understand
  • Preventive thinking to anticipate bottlenecks before data volume makes them unavoidable
  • Ability to challenge schema or query designs in collaboration with developers without creating unnecessary friction
  • Operational discipline to keep documentation current: configurations, design decisions, and emergency procedures
  • Judgment to balance technical schema purity with the speed the business needs to evolve the product

Real use cases

Context

As data volume increases, queries that performed well at thousands of rows begin to degrade. Proactive optimization prevents the problem from surfacing in production.

Real examples

  • Slow query log analysis and identification of queries lacking appropriate indexes
  • Composite index design tailored to the product's specific access patterns
  • Table partitioning for historical data to maintain performance across years of records
  • Query rewriting for inefficient execution plans identified with EXPLAIN ANALYZE

Context

A well-designed schema from the start reduces future technical debt and makes the product easier to evolve. A poorly designed schema is expensive to correct once there is production data in it.

Real examples

  • Entity-relationship modeling with normalization appropriate to the use case
  • Decision between relational and document models based on access patterns
  • Multi-tenant schema design with per-organization data isolation
  • Definition of constraints, foreign keys, and check constraints that enforce integrity at the database level

Context

Databases are the most critical component in most systems. An outage affects the entire application. High availability and DR must be designed before the first incident occurs.

Real examples

  • Streaming replication with automatic failover configured in PostgreSQL
  • Read replica implementation to distribute read-heavy workloads
  • Backup strategy with periodic, documented restore tests
  • RTO and RPO targets defined and validated through scheduled DR drills

Context

Changing the schema of a database with millions of rows and active users is one of the riskiest production operations. It requires careful planning, a sound strategy, and thorough validation.

Real examples

  • Zero-downtime migrations using the expand-contract pattern
  • New column backfills on large tables using batches to avoid lock contention
  • Data integrity validation before and after every migration
  • Rollback procedures documented and tested before applying to production

Context

Data is an organization's most sensitive asset. Databases must have strict access controls, operation auditing, and encryption of sensitive data.

Real examples

  • Role and permission model following least-privilege principles per application and user
  • Access auditing on tables containing PII or financial data
  • Sensitive column encryption with key management integrated into a KMS
  • Dynamic data masking for development and staging environments

Basic questions

Normalize for OLTP: reduces redundancy, guarantees integrity, and simplifies writes. Denormalize strategically for OLAP workloads or when read queries are frequent and the cost of joins is measurable in production. Denormalization is always a deliberate decision with its trade-offs documented: increased write complexity and a risk of inconsistency if not managed carefully.
Index columns that appear frequently in WHERE, JOIN, ORDER BY, or GROUP BY clauses in high-volume queries. Avoid indexing low-cardinality columns (few distinct values) because the index adds no meaningful selectivity. Consider composite indexes when queries filter on multiple columns together. Every index carries a cost in write performance and storage — never create speculative indexes without evidence they will be used.
Run EXPLAIN ANALYZE to see the actual execution plan: check whether it is doing a sequential scan instead of an index scan, whether there are expensive nested loops, or whether the planner's row estimates differ significantly from reality. Verify that table statistics are current with ANALYZE. Check whether query parameters are preventing index use — functions applied to indexed columns, or implicit type conversions. Compare the plan in staging against production if they diverge.
Logical backup (pg_dump, mysqldump): exports data as SQL or CSV. Portable across versions and engines, but slow to restore for large databases. Physical backup (pg_basebackup, volume snapshots): copies the engine's data files directly. Faster to restore and enables point-in-time recovery with WAL archiving. For production databases with low RPO requirements, physical backup with PITR is the standard approach.
A deadlock occurs when two transactions each wait for the other to release a lock, blocking indefinitely. Database engines detect this automatically and abort one of the transactions. Prevention: always acquire resources in the same order across all transactions, keep transactions short, and use the minimum necessary isolation level. Detection: the engine logs deadlocks — PostgreSQL records them when deadlock_timeout is exceeded.
Create a read-only user scoped to the specific tables needed, with a time-limited access window. Log the access in the audit system. Never grant access using the application user account or a superuser. If data modification is required, the change must be reviewed, approved, and executed by the DBA — not the developer directly. Revoke access immediately after the incident is resolved.
Three main strategies: separate database per tenant (maximum isolation, highest cost), separate schema per tenant in the same database (good isolation, moderate scalability), or shared tables with a tenant_id column (greatest scale, requires row-level security to enforce isolation). The choice depends on the number of tenants, compliance and isolation requirements, and infrastructure cost. Row-level security in PostgreSQL is a key tool for the shared-table approach.
Run the migration against a recent copy of production, measuring the actual execution time and the locks it acquires. Verify that a tested rollback plan is in place. Communicate the maintenance window if user impact is expected. Execute during the lowest-traffic window. Actively monitor database metrics during the migration. Have the rollback command ready to execute within seconds if something fails. Never run a production migration that has not been tested beforehand.

Technical questions

A B-tree maintains values in a sorted, self-balancing tree structure that supports searches, inserts, and deletes in O(log n). It is efficient for equality and range comparisons. It is not optimal for: full-text search (use GIN with tsvector), geospatial data (use GiST or SP-GiST), low-cardinality columns where a bitmap scan is more efficient, or arrays and JSONB where GIN indexes allow searching within values.
Streaming replication with WAL shipping to one or more standby replicas. For automatic failover, Patroni is the current standard: it manages consensus between nodes using etcd or ZooKeeper, automatically promotes the most up-to-date replica when the primary fails, and updates the connection endpoint via HAProxy or a service discovery entry. The application connects to a virtual endpoint that always points to the current primary, with no configuration changes needed during failover.
Multi-Version Concurrency Control allows readers and writers to operate without blocking each other: each transaction sees a consistent snapshot of the data as it existed when the transaction began. The cost is that old row versions (dead tuples) accumulate until VACUUM reclaims them. Without sufficient vacuuming, table bloat degrades performance and can eventually cause transaction ID wraparound — a critical failure mode. Monitoring VACUUM lag and dead tuple ratios is part of routine DBA maintenance.
Range partitioning by date (monthly or weekly partitions) is the most appropriate approach for time-series data. It enables: dropping old partitions with DROP PARTITION instead of DELETE — instantaneous and without table bloat — confining recent queries to small partitions to improve performance, and implementing clean data retention policies. Create future partitions in advance. Verify that queries include the partition column in the WHERE clause so the planner can perform partition pruning.
Confirm the bloat by comparing pg_relation_size against pg_total_relation_size, or using the pgstattuple extension. Common causes: autovacuum disabled or configured too conservatively, or long-running transactions preventing VACUUM from reclaiming dead tuples. Zero-downtime solution: VACUUM FULL requires an exclusive lock (not viable on an active production table); use pg_repack or pgcompacttable to reorganize the table online. Tune autovacuum parameters (autovacuum_vacuum_scale_factor, autovacuum_vacuum_cost_delay) to be more aggressive on high-churn tables.
Separate OLAP from OLTP load by adding dedicated read replicas for analytical queries. Route reporting queries to replicas via connection pooling (PgBouncer) or at the application layer. For very heavy queries, consider materializing results in aggregation tables updated on a schedule. If analytical workload continues to grow unboundedly, evaluate offloading it to a dedicated OLAP system (Redshift, BigQuery, ClickHouse) with synchronization from the transactional database.
Apply audit triggers only to tables containing sensitive data, not to the entire database. Write audit records asynchronously to a separate table or an external system (Kafka, centralized log) to avoid adding latency to the write path. In PostgreSQL, pgaudit enables granular auditing by object and operation type with lower overhead than custom triggers. For strict compliance, send audit logs to an immutable external system where they cannot be altered.
PostgreSQL spawns a process per connection with a significant memory cost (5–10 MB per connection). With hundreds of application instances, direct connections will saturate the database. PgBouncer maintains a small pool of real database connections and multiplexes many application connections over them. In transaction pooling mode, the connection is returned to the pool at the end of each transaction. The trade-off: some PostgreSQL features that depend on session state — prepared statements, advisory locks, SET LOCAL — do not work correctly in transaction pooling mode.

Advanced questions

Evaluate active-passive replication (one primary region, read replicas in other regions) versus active-active (writes accepted in multiple regions with conflict resolution). Active-passive is simpler and avoids write conflicts; the trade-off is write latency for users far from the primary. Active-active with CRDTs or timestamp-based resolution is only viable for data models that tolerate conflicts. CockroachDB or Spanner are options for global consistency without manually managing the complexity.
Identify bounded contexts and the data that belongs to each. Start with the most independent domains. The standard strategy is the strangler fig pattern: the new service writes to its own database and to the monolith in parallel during the transition. Once the new system is stable, writes to the monolith are disabled and the monolith reads from the new service via API. The biggest risk is cross-domain consistency: distributed transactions must be replaced with eventual consistency and domain events.
First, contain the damage: disable the processes that are continuing to write corrupt data. Then, assess the exact scope using diagnostic queries. Restore from the most recent backup taken before the corruption, using PITR if available. If a full restore is not feasible, identify the affected records and correct them using scripts validated in staging first. Document the incident fully: root cause, scope, actions taken, and improvements to prevent recurrence. Stakeholder communication must be honest about the real extent of the impact.
Migration is justified when the relational model is a real, measurable obstacle: schemas that change very frequently and generate costly migrations, data with highly variable structure across records, or access patterns that require horizontal scale beyond what PostgreSQL can achieve. Do not migrate for technological trend-following. In most cases, PostgreSQL with JSONB for variable fields and partitioning for volume resolves the problem without the cost of a full migration. If migration is genuinely justified, do it incrementally by business entity — never as a big bang.
Embed the review in the pipeline: tools like Squawk or sqlfluff automatically flag dangerous migrations (ALTER TABLE statements that acquire exclusive locks, adding a NOT NULL column without a default on a large table). The DBA reviews only the changes that the automated tools flag as high risk. Provide developers with a guide to safe migration patterns with concrete examples. The goal is for 80% of migrations to pass without manual review because the team already knows the correct patterns.
Classify data by category: active operational data, historical data required for compliance retention, and data subject to right-to-erasure. Implement date-based partitioning to make archival and deletion of old data straightforward. Compliance-retained data is moved to cheaper storage (S3 Glacier) with metadata that allows retrieval if required. Data subject to erasure must have a verifiable deletion mechanism that includes backup coverage. Document the complete data flow for compliance audits.

Common interview mistakes

Indexes are not always the right answer and carry real costs in write performance and maintenance. A DBA who recommends adding an index without first running EXPLAIN ANALYZE and understanding why the planner is not using existing indexes is diagnosing superficially — and may make the problem worse.
Any change to a production database must have a tested rollback procedure in place before it is executed. A DBA who describes migrations without addressing how to revert if something goes wrong immediately loses credibility with interviewers who have dealt with real database incidents.
Backups address data loss (RPO). High availability addresses downtime (RTO). They are complementary, not interchangeable. A system with excellent backups but no HA can still face hours of downtime to restore. A system with HA but no backups can lose data due to corruption or accidental deletion.
A backup that has never been restored is a hypothesis, not a guarantee. Interviewers at companies with critical data specifically ask when the last restore test was performed and how long it took. Not having an answer to that question is a serious operational red flag.
A schema is not good or bad in the abstract — it is good or bad for a specific set of access patterns. Describing a database design without explaining which queries it was optimized for, what write volume was anticipated, or which relationships were prioritized suggests the design was intuitive rather than evidence-based.
Operations such as ADD COLUMN NOT NULL without a default, CREATE INDEX without CONCURRENTLY, or VACUUM FULL acquire exclusive locks that block both reads and writes. A DBA who proposes these changes without addressing their availability impact — or how to avoid it with online alternatives — demonstrates a lack of experience with production databases under real load.