Persona 3: The Technologist
Focus: Architecture, RAG Pipelines, Latency, Vectors, Security.
1. What is your RAG implementation? Naive RAG or HyDE?
We use a modified Graph-RAG approach. Standard vector search misses relationship hops (e.g., "The guy I met with Sarah"). We combine vector embeddings for semantic search with a Knowledge Graph for social/temporal links, ensuring high-fidelity context retrieval.
2. Which embedding model are you using? OpenAI's text-embedding-3?
We use cloud AI multimodal embeddings for image-text alignment and text processing. All embedding generation happens via secure API calls with end-to-end encryption.
3. How do you handle Vector DB costs at scale (10k+ vectors/user)?
We use cloud vector database for vector storage and retrieval. Cost optimization through: 1) Efficient embedding dimensions (384d vs 1536d), 2) Smart indexing (HNSW with optimized parameters), 3) Tiered storage (hot/warm data separation). Cloud-native architecture keeps costs predictable and scalable.
4. Latency. LLM inference usually takes 2-3 seconds. That feels slow.
We implement streaming responses and aggressive caching for common queries. For immediate results ("Find photo"), we return vector search hits first (latency <300ms) while cloud AI generates the enhanced answer in the background. Progressive enhancement approach.
5. Sending user photos to cloud AI kills privacy. How do you handle this?
We implement privacy-preserving processing: 1) Media stays stored locally on device, 2) Only necessary metadata sent to cloud AI (resized thumbnails, OCR text, not full-res images), 3) All API calls use E2E encryption, 4) Zero retention policy with AI providers (API calls don't train models).
6. Cloud dependency means offline doesn't work?
We support offline-first operation. Basic search works offline using local metadata cache. Heavy AI features (transcription, semantic search) require connectivity but queue automatically. Users can view all locally-stored media offline. 90% of functionality available without internet.
7. "Zero-Access" is a marketing term. Explain the key management.
We use LibSodium for client-side encryption. The Master Key is generated on the device and stored in the Secure Enclave (iOS) or Keystore (Android). We only receive the encrypted blob. We perform "Blind Indexing" for cloud search, or rely entirely on local search to avoid leaking search terms.
8. How do you perform "Semantic Search" on encrypted data in the cloud?
Vector embeddings are searchable even when encrypted using semantic hashing techniques. We store encrypted embeddings in cloud vector database with privacy-preserving similarity search. Media files stay locally encrypted. Alternative: users can opt for local-only mode where search happens entirely on device with cached embeddings.
9. Whisper hallucinations on proper nouns are bad. How do you fix names?
We bias the Whisper decoding using the user's local Social Graph (contacts list). If Whisper hears "Sarah", we increase the probability token for the "Sarah" found in your contacts, reducing phonetic errors.
10. Image OCR is wildly inaccurate on handwriting.
We use a dedicated TrOCR (Transformer OCR) pipeline fine-tuned on handwritten notes, which outperforms standard Tesseract or Apple Live Text on messy inputs.
11. What's your data ingestion throughput? Can you handle 10k photos at once?
We use a queue-based system with priority sorting. Critical photos (recent, faces) get processed first. Large batches are processed incrementally over 24-48 hours during idle time.
12. How do you dedupe images? Hash-based or perceptual?
Both. MD5 for exact dupes, pHash for near-dupes (crops, filters). We also use SSIM for quality assessment to keep the best version.
13. What about video indexing? That's bandwidth-heavy.
We sample keyframes (1 frame/second), run OCR + object detection on those. For audio, we extract and transcribe. Original video stays local; only metadata syncs.
14. Do you support RAW photos (ProRAW, DNG)?
We extract JPEG previews for analysis. The RAW file itself is preserved but not processed (too computationally expensive). Pro photographers can keep originals separately.
15. What's your approach to temporal consistency? Events across multiple photos?
We cluster photos by timestamp + location (DBSCAN). Photos within 10 minutes and 100 meters are grouped as an "episode." This creates narrative coherence.
16. What's your model accuracy on face recognition?
98.5% on frontal faces (parity with Google Photos). Drops to ~85% on side profiles. We surface low-confidence predictions for user confirmation.
17. How do you handle occlusion (sunglasses, masks)?
We use ArcFace embeddings which are partially occlusion-resistant. For masks, we rely on context clues (clothing, location, companions) for probabilistic identification.
18. What about aging? I look different now than 10 years ago.
We continuously update face embeddings. When you confirm "This is me," the model fine-tunes its representation. It learns your aging trajectory.
19. Scene understanding—how granular? "Beach" vs "Sunset at Santorini Beach"?
We use BLIP-2 for scene captioning. It generates "Sunset over ocean with white buildings," which we enhance with GPS data to get "Santorini." Multi-modal fusion.
20. Can you detect activities? "Playing basketball" vs "Watching basketball"?
Yes, via action recognition models (X3D). We also use motion sensors (accelerometer) to differentiate active participation from passive observation.
21. What cloud provider? AWS, GCP, Azure?
Multi-cloud architecture. We use specialized services for optimal performance and cost: cloud AI for processing, managed vector database for embeddings, and cost-effective object storage for media backups. We avoid single provider lock-in.
22. How do you handle cold starts on serverless functions?
We use Golang backend with persistent connections. No serverless for critical paths. Background jobs (batch processing) can tolerate cold starts. Main API is always-on lightweight Go services.
23. Database: SQL or NoSQL?
Hybrid. PostgreSQL for relational data (user accounts, subscriptions), cloud vector database for embeddings, real-time sync service for live updates. Right tool for right job.
24. How do you monitor model drift?
We track per-query confidence scores and user correction rates. If corrections spike >5%, we investigate model degradation and retrain.
25. CI/CD pipeline?
GitHub Actions for mobile (Fastlane), Kubernetes for backend. Blue-green deployments for zero-downtime updates. Full test coverage on critical paths.
26. What about deepfakes? Can you detect manipulated memories?
We embed digital signatures (C2PA standard) at capture time. Post-edited images are flagged. We also use forensic detectors (noise analysis) as a secondary check.
27. Low-light photos are noisy. Does that hurt accuracy?
We preprocess with denoising (DnCNN). Modern vision models are surprisingly noise-tolerant, but we also store the original for user viewing.
28. What about non-English text in images? Arabic, Chinese?
Our OCR supports 100+ languages via Tesseract + PaddleOCR. We auto-detect language and route to the appropriate model.
29. Panorama photos. Do you treat them as one image or stitch analysis?
We detect panoramas (via EXIF), split them into logical segments, and analyze each. This prevents the LLM from getting confused by ultra-wide scenes.
30. How do you handle screenshots vs real photos?
We detect screenshots (pixel patterns, UI elements) and treat them as "Document Memory" rather than "Visual Memory." They're indexed differently.
31. Quantization. Are you using 4-bit, 8-bit?
Not applicable for our architecture. We use cloud-based AI which handles quantization internally. For client-side processing, we use efficient Flutter widgets and native mobile APIs, not on-device ML models.
32. Do you use model distillation?
Not directly - we rely on efficient cloud AI models optimized for mobile applications. Our focus is on efficient API usage patterns and smart caching rather than model optimization.
33. Caching strategy for repeat queries?
LRU cache with 30-day TTL. Common queries ("Where's my passport?") hit cache 90% of the time, dropping cost to near zero.
34. What about progressive loading? Instant results while refining?
Yes. We show cached/simple results in <200ms, then stream LLM-enhanced results. User sees something immediately.
35. How do you optimize network bandwidth?
Delta syncing (only changed vectors), compression (zstd), and smart scheduling (defer large syncs to Wi-Fi). Mobile data usage is <10MB/day.
36. Can you do similarity search? "Find photos like this one."
Yes. Cosine similarity on CLIP embeddings. Click any photo -> "Find similar" -> instant visual search.
37. Clustering. Can you auto-create albums?
Yes. HDBSCAN on (embedding + time + location) space. We auto-generate "Trip to Tokyo" or "Emma's Birthday" without manual tagging.
38. What about duplicate face detection across different ages?
We use triplet loss training. The model learns that "Baby Emma" and "Adult Emma" share identity despite appearance changes.
39. Can I search by color? "Find all red dresses."
Yes. We extract color histograms and dominant colors. Queries like "red dress" are decomposed into color + object filters.
40. Audio fingerprinting. Can you ID songs playing in the background?
We integrate ACRCloud for music recognition. Background songs are tagged to memories automatically.
41. How do you debug a bad search result?
We log (with user consent) the query, retrieved vectors, LLM prompt, and final output. This creates a debugging trace for model improvement.
42. What metrics do you track?
P95 latency, cache hit rate, user correction rate, query abandonment rate, and MAU per feature. We're obsessive about performance.
43. Error handling. What if the LLM returns gibberish?
We have a fallback chain: GPT-4 -> Claude -> On-Device LLM. If all fail, we return raw vector results with "AI unavailable" notice.
44. A/B testing on AI features?
Yes. We use Statsig for feature flags. Testing model variants, prompt strategies, and UX flows continuously.
45. What about telemetry? How much data do you collect?
Minimal. Anonymous usage patterns ("User queried 5 times today") but never query content. Privacy-first telemetry.
46. Will you support local LLMs (Ollama)?
Not planned. Our architecture is optimized for cloud AI which provides best performance-cost ratio for mobile applications. Supporting local LLMs would require complete re-architecture and wouldn't provide enough value for the complexity.
47. What about federated learning? Improve models without seeing data?
Exploring it. Users could opt-in to share gradient updates (not raw data) to improve global models while preserving privacy.
48. AR integration. How does that work technically?
ARKit/ARCore APIs. Overlay metadata on real-world objects ("This is where you left your keys"). Spatial anchoring + memory linkage.
49. Will you support blockchain for data provenance?
No immediate plans. The user experience is bad. If demand grows, we'd use a lightweight ledger, not a full blockchain.
50. What about quantum computing threats to encryption?
We're monitoring post-quantum cryptography standards (NIST). When ready, we'll migrate to quantum-resistant algorithms (Kyber, Dilithium).
51. iOS vs Android feature parity?
iOS launches first (80% done). Android follows 2 months later. Core features identical; some OS-specific optimizations differ.
52. Do you use Apple's ML frameworks or custom?
Hybrid. Core ML for Vision tasks (fast, native), custom TFLite for NLP (more flexible). Best of both worlds.
53. What about Apple's Privacy Nutrition Labels?
We pass with flying colors. Minimal data collection. Everything is user-controlled. No third-party tracking.
54. Web version?
Yes. React + WebAssembly for lightweight client-side processing. WASM allows us to run efficient caching and embedding storage in-browser without heavy server calls.
55. Desktop apps (Mac, Windows)?
Electron wrapper around the web app. Native performance isn't critical for desktop; cross-platform consistency is.
56. What's the hardest technical problem you've solved?
Efficient on-device vector search at scale. Indexing 100k vectors on a phone without killing battery or memory. HNSW + quantization made it possible.
57. What's the biggest bottleneck right now?
LLM inference cost. As usage scales, cloud inference gets expensive. We're aggressively optimizing (caching, model compression, edge inference).
58. What keeps you up at night technically?
Data corruption. If a user loses their memory index due to a bug, that's catastrophic. We have redundant backups and integrity checks everywhere.
59. How do you handle schema migrations?
Versioned data formats. The app auto-migrates old schemas to new ones on update. We test migrations extensively in QA.
60. What's your test coverage?
85% for backend, 70% for mobile. Critical paths (encryption, data sync) have 100% coverage. We prioritize high-risk code.
61. Will you open-source the models?
We're considering it. The moat is the data, not the model. Open-sourcing could build trust and attract contributors.
62. What open-source libraries do you depend on?
Whisper, CLIP, LanceDB, Fastlane, libsodium. We contribute back (bug fixes, documentation). Open source is our foundation.
63. Do you have a bug bounty program?
Not yet, but planned for post-launch. We'll offer rewards for security vulnerabilities, especially around encryption.
64. Can community developers build plugins?
Phase 2. We'll open APIs for "Memory Apps"—third-party tools that read from your Dzikra index (with permission). Plugin ecosystem incoming.
65. What about security audits?
We'll undergo third-party pen testing and cryptographic audits before GA launch. Security isn't negotiable.
66. What's the most exciting tech on your roadmap?
Multimodal fusion. Combining video, audio, location, and biometric data into a single unified memory model. True "4D" recall.
67. Will you train your own foundation model?
Not initially. We fine-tune existing models. If we reach 10M users, we'll consider training a "Memory-Specialized" foundation model.
68. What about real-time processing? Live memory capture?
That's the "AR Glasses" endgame. Continuous capture + real-time indexing. The tech exists; the UX and social acceptance don't. Yet.
69. How do you plan to scale to 100M users?
Horizontal scaling (Kubernetes), aggressive caching (Redis/Cloudflare), edge computing (bring inference closer to users), and ruthless optimization.
70. Last technical question: What's your "North Star" metric?
Time to successful recall. How fast can a user go from "I need X" to "I found X." Currently ~3 seconds. Goal: <1 second.