This description was collected and forwarded by Artech, a staffing firm.
- Build the Event Hubs integration with Service Fabric microservices implementation. Streaming the processed files from blobs into
EH for downstream processing.
- Anonymized files (~1000 of them and to a size of ~GB) will be given as input
- Service Fabric code portion will be provided.
- Build the Spark processing reading off EventHubs, implementation in either Python or Scala would suffice.
- Look at the caching needs; leverage .cache to retain appropriate results from Spark ‘Actions’ in Spark executors
- Our team will evaluate a set of data store that would be
a landing spot post Spark – Blobs being a required one. We will pick 1
or 2 from this list -- SQL DW, Azure SQL DB, Cassandra and DocumentDB
being other candidate stores and we will have code
snippets and/or guidance
Integration & Deployment
- Integrate the items from above with completed items (Azure
Data Factory with ARM provisioning, picking up from the ADF pipeline
which lands files onto blobs)
- Apply best practices for capacity planning, deployment for E2E
- Integrate the deployment with existing set of tools and processes.
Testing
- Build a unit test framework that can test each building block in isolation (ADF
à Blobs, Blobs
à Service Fabric, Service Fabric à EH, EH
à Spark, Spark
à <Data Store>
- Build an E2E test environment with telemetry on latency, throughout with percentiles. *Leverage APM tools as appropriate