DZone Spotlight

Tuesday, May 21 View All Articles »

Distributed Systems: Common Pitfalls and Complexity

By Aleksei Popov

The complexity of distributed systems is an important challenge for engineers and developers. Complexity tends to increase as the system evolves, and therefore it is important to be proactive. Let's talk about what types of complexity you may encounter and what effective tactics to deal with it in your work. Distributed Systems and Complexity In development, a distributed system is a network of computers that are connected to each other and working on a single task. Each computer or node has its own local memory and processor and runs its own processes. However, they use a common network for coordination and centralization. A distributed system is very reliable; failure of one component does not disrupt the entire network. In a centralized computing system, one computer with one processor and one memory works on solving problems. In a centralized system, there are nodes, but they access the central node, which can cause network congestion and slowness. A centralized system has a single point of failure — this is an important disadvantage of it. Complexity Complexity can be defined from different perspectives and aspects. There are two main definitions that are important. In systems theory, complexity describes how different independent parts of the system interact and communicate with each other: how they define interactions with each other, how they depend on each other, how many dependencies they have, and also how they interact within the whole. From a software and technology perspective, complexity refers to the details of the software architecture, such as the number of components. Monolithic Architecture Monolithic architecture is a great example of a centralized system. It is represented as a single deployable and single executable component. For instance, such components may contain a user interface and different modules located in one place. Although this architecture is a traditional one for building software, it has several important drawbacks: Inability to scale modules independently Harder to control the growing complexity Lack of modules independent deployment Challenging to maintain a huge code database Technology and vendors coupling Microservices Architecture Microservices architecture is an architectural style and a variant of service-oriented architecture that structures the system as a collection of loosely coupled services. For example, companies, accounts, customers, and UI are represented as separate processes deployed on multiple nodes. All these services have their own time-to-time shared database, but this is probably a bad practice or antipattern. There are some advantages of such an architecture. Horizontal scalability is a game-changer! You can scale the database horizontally, and you can scale your services horizontally. Technically, any infrastructure component can be scaled horizontally by cloning, but many challenges must be solved. High availability and tolerance: Whenever you have several clones, you may organize some techniques that will help you avoid any downtimes in case of crashes, memory leaks, or power outages. Geographic distribution: If we all have customers in the USA, Europe, or Asia, and we also want to bring the best experience to our customers, we need to distribute these services across the world and organize more complicated techniques for data replication. Technology choice: You are free to choose your solutions. Quality Attributes There are three main quality attributes which any system has at some level or another: Reliability: Continuing to function properly despite challenges, meaning being fault-tolerant or resilient; Even if a system operates reliably now, it doesn’t guarantee future reliability. A frequent cause of performance degradation is increased load: for example, the system might have expanded from 10,000 to 100,000 concurrent users, or from 1 million to 10 million. Scalability is the term we use to describe a system’s ability to handle increased load. It is important to note that the scalability weakness of the whole system is determined by its weakest component. Maintainability is about making life better for the engineering and operations teams who need to work with the system. Good and stable abstractions can help reduce complexity and make the system easier to modify and adapt for new use features. What Are the Main Issues? “Anything that can go wrong will go wrong and at the worst possible time.” — Murphy Law Unreliable Networks There are many reasons why networks are not reliable, such as: Your request may have been lost. Your request may be waiting in a queue and will be delivered later. The remote node may have failed (perhaps it crashed or was powered down). The remote node may have temporarily stopped responding. The remote node may have processed your request, but the response has been lost on the network. The remote node may have processed your request, but the response has been delayed and will be delivered later. Strategy: Timeout The simplest solution to the problem is to apply timeout logic on the caller's side. For example, if the caller doesn’t receive a response after some timeout, it just throws an error and shows an error to the user. Strategy: Retry At scale, we can’t just throw exceptions for every network problem and upset users or delay system execution. So, if a response indicates that something goes wrong, just retry it. But what if the request was processed by the server and only the response was lost? In this scenario, retries may lead to severe consequences like several orders, payments, transactions, and so on. Strategy: Idempotency To avoid that, we can utilize a technique named idempotency. The concept of idempotency pertains to the notion that performing the same action multiple times has the same effect as performing it just once. To achieve the property of exactly-once semantics, a solution can be employed that attaches an idempotency key to the request. Upon retrying the same request with the identical idempotency key, the server will verify that a request with such a key has already been processed and will simply return the previous response. Consequently, any number of retries with the same key will not have a deleterious effect on the system's behavior. Strategy: Circuit Breaker Another pattern that might be useful in preventing overloading and completely crushing the server in case of failure is the circuit breaker. Circuit Breaker acts as a proxy to prevent the calling system, which is under maintenance, likely will fail, or heavily failing right now. There are so many reasons why it can go wrong: memory leak, a bug in the code, or external dependencies that are faulted. In such scenarios, it is better to fail fast rather than risk cascading failures. Concurrency and Lost Writes Concurrency represents one of the most intricate challenges in distributed systems. Concurrency implies the simultaneous occurrence of multiple computations. Consequently, what occurs when an attempt is made to update the account balance simultaneously from disparate operations? In the absence of a defensive mechanism, it is highly probable that race conditions will ensue, which will inevitably result in the loss of writes and data inconsistency. In this example, two operations are attempting to update the account balance concurrently. Since they are running in parallel, the last one to complete wins, resulting in a significant issue. To circumvent this problem, various techniques can be employed. Strategy: Snapshot Isolation The ACID acronym stands for Atomicity, Consistency, Isolation, and Durability. All of the popular SQL databases implement these properties. Atomicity specifies that the operation will be either completely executed or failed, no matter at what stage it happens. It allows us to ensure that another thread cannot see the half-finished result of the operation. Consistency means that all invariants are defined and will be satisfied before successfully committing a transaction and changing the state. Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other. There is a serializable isolation level which is the strictest one to process all transactions sequentially, but another level named snapshot isolation in popular databases is mainly used. Durability promises that once the transaction is committed, all the data is stored safely. The key idea of this level is that databases track recorded versions and fail to commit transactions for ones that were already modified outside of the current transaction. Strategy: Compare and Set Most of the NoSQL databases do not provide ACID properties while choosing in favor of BASE, wherein such databases compare and the set is used. The purpose of this operation is to avoid lost updates by allowing an update to happen only if the value has not changed since you last read it. If the current value does not match what you previously read, the update has no effect and the read-modify-write cycle must be retried. For instance, Cassandra provides lightweight transactions that allow you to utilize various IF, IF NOT EXISTS, and IF EXISTS conditionals to prevent concurrency issues. Strategy: Lease Another potential solution is the lease pattern. To illustrate, consider a scenario where a resource must be updated exclusively. The lease pattern entails first obtaining a lease with an expiration period for the resource, then updating it, and finally returning the lease. In the event of failures, the lease will automatically expire, allowing another thread to access the resource. Although this technique is highly beneficial, there is a risk of process pauses and clock desynchronization, which may lead to issues with parallel resource access. Dual Write Problem The dual write problem is a challenge that arises in distributed systems, particularly when multiple data sources or databases must be kept in sync. To illustrate, consider a scenario in which new data must be stored in the database and messages are sent to Kafka. Since these two operations are not atomic, there is a possibility of failure during the publishing of new messages. If a transaction is attempted while messages are being sent, the result is a more problematic situation. In the event that the transaction fails to commit, external systems may have already been informed of changes that, in fact, did not occur. Strategy: Transactional Outbox One potential solution is the implementation of a transactional outbox. This involves the storage of events in an "OutboxEvents" table within the same transaction as the operation itself. Due to the atomicity of the process, no data will be stored in the event of a transaction failure. Another necessary component is Relay, which polls the OutboxEvents table at regular intervals and sends messages to destinations. This approach allows for the achievement of at least one delivery guarantee. Nevertheless, this is not a concern since all consumers must be idempotent due to the unreliability of the network. Strategy: Log Tailing An alternative solution to the construction of a custom transactional outbox is the utilization of a database transactional log and custom connectors to read directly from this log and send changes to destinations. This approach has its own advantages and disadvantages. For instance, it requires coupling to database solutions but allows for the writing of less code in the application. Unreliable Clocks Time tracking is a fundamental aspect of any software or infrastructure, as it enables the enforcement of timeouts, expirations, and the gathering of metrics. However, the reliability of clocks represents a significant challenge in distributed systems, as the accuracy of time is contingent upon the performance of individual computers, which may have clocks that are either faster or slower than others. There are two primary types of clocks utilized by computers: time-of-day and monotonic clocks. Time-of-day clocks return the date and time according to a specific calendar, and they are typically synchronized with Network Time Protocol (NTP). However, delays and network issues may affect the synchronization process, leading to clock desynchronization. Monotonic clocks continuously advance, making them suitable for measuring durations. However, the monotonically increased value is unique to each computer, limiting their use for multi-server date and time comparison. Achieving highly accurate clock synchronization is a challenging task. In the majority of cases, the necessity for such a solution is not apparent. However, in instances where compliance with regulations necessitates its use, the Precision Time Protocol can be employed, although this will entail a significant investment. Availability and Consistency The CAP Theorem posits that any distributed data store can only satisfy two of the three guarantees. However, since network unreliability is not a factor that can be significantly influenced, in the case of network partitions, the only viable option is to choose between availability or consistency. Consider the scenario in which two clients read from different nodes: one from the primary node and another from the follower. A replication is configured to update followers after the leader has been changed. However, what happens if, for some reason, the leader stops responding? This could be a crash, network partitioning, or another issue. In highly available systems, a new leader must be assigned, but how do we choose between existing followers? To address this issue, a distributed consensus algorithm must be employed. However, before delving into the specifics of this algorithm, it is essential to gain a comprehensive understanding of the various types of consistency. Consistency Type There are two main classes of consistency used to describe guarantees. Weak consistency, or eventually one, means that data will be synchronized on all followers after some time if you stop making changes to the leader. Strong consistency is a property that ensures that all nodes in the system see the same data at the same time, regardless of which node they are accessing. Strategy: Distributed Consensus Algorithm (e.g., Raft) Returning to the problem when the leader crashes, there is a need to elect a new leader. This problem, at first glance, looks easy, but in reality, there are so many conditions and tradeoffs that have to be taken into account when selecting the appropriate approach. Per Raft protocol, if followers do not receive data or heartbeat from the leader for a specified period of time, then a new leader election process begins. Each Replication Unit (monolith write node or multiple shards) is associated with a set of Raft logs and OS processes that maintain the logs and replicate changes from the leader to followers. The Raft protocol guarantees that followers receive log records in the same order they are generated by the leader. A user transaction is committed on the leader as soon as half of the followers acknowledge the receipt of the commit record and writes it to the Raft log. Strategy: Read From Leader One of the possible effective and simple strategies is to read from the follower by the user who just saved new data to avoid replication lag. Instead of Conclusion From monolithic architectures to microservices, each approach presents its own set of advantages and challenges. While monolithic architectures offer simplicity, they often struggle with scalability and maintainability, pushing developers towards a more modular and scalable microservices architecture. Central to the discussion is the management of complexity, which manifests in various forms, from network unreliability to concurrency issues and the dual write problem. Strategies such as timeouts, retries, idempotency, and circuit breakers offer effective tools for mitigating the risks associated with unreliable networks, while techniques like snapshot isolation, compare and set, and leases address the challenges of concurrency and lost writes. Furthermore, the critical issue of unreliable clocks underscores the importance of accurate time synchronization in distributed systems, with solutions ranging from NTP synchronization to the Precision Time Protocol. Additionally, the CAP theorem reminds us of the inherent trade-offs between availability and consistency, necessitating a thorough understanding of distributed consensus algorithms like Raft. In conclusion, mastering the maze of complexity in distributed systems requires a multifaceted approach, combining theoretical knowledge with practical strategies. By embracing these strategies and continuously adapting to the evolving landscape of distributed computing, engineers and developers can navigate the complexities with confidence, ensuring the reliability, scalability, and maintainability of their systems in the face of ever-changing challenges. More

You Can Shape Trend Reports — Participate in DZone Research Surveys + Enter the Raffles!

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. you can find details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Trend Report

Modern API Management

When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.

Refcard #395

Open Source Migration Practices and Patterns

By Nuwan Dias

CORE

Open Source Migration Practices and Patterns

Refcard #171

MongoDB Essentials

By Abhishek Gupta

CORE

DevOps vs. DataOps vs. MLOps Vs. AIOps: Comparison of All "Ops"

The acronym "Ops" has rapidly increased in IT operations in recent years. IT operations are turning towards the automation process to improve customer delivery. Traditional application development uses DevOps implementation for Continued Integration (CI) and Continued Deployment (CD). The exact delivery and deployment process may not be suitable for data-intensive Machine Learning and Artificial Intelligence (AI) applications. This article will define different "Ops" and explain their work for the following: DevOps, DataOps, MLOps, and AIOps. DevOps This practice automates the collaboration between Development (Dev) and Operations (Ops). The main goal is to deliver the software product more rapidly and reliably and continue delivery with software quality. DevOps complements the agile software development process/agile way of working. DataOps DataOps is a practice or technology that combines integrated and process-oriented data with automation to improve data quality, collaboration, and analytics. It mainly deals with the cooperation between data scientists, data engineers, and other data professionals. MLOps MLOps is a practice or technology that develops and deploys machine learning models reliably and efficiently. MLOps is the set of practices at the intersection of DevOps, ML, and Data Engineering. AIOps AIOps is the process of capabilities to automate and streamline operations workflows for natural language processing and machine learning models. Machine Learning and Big Data are major aspects of AIOps because AI needs data from different systems and processes using ML models. AI is driven by machine learning models to create, deploy, train, and analyze the data to get accurate results. As per the IBM Developer, below are the typical “Ops” work together: Image Source: IBM Collective Comparison The table below describes the comparison between DevOps, DataOps, MLOps, and AIOps: Aspect DevOps DataOps MLOps AIOps Focus on: IT operations and software development with Agile way of working Data quality, collaboration, and analytics Machine Learning models IT operations Key Technologies/Tools: Jenkins, JIRA, Slack, Ansible, Docker, Git, Kubernetes, and Chef Apache Airflow, Databricks, Data Kitchen, High Byte Python, TensorFlow, PyTorch, Jupyter, and Notebooks Machine learning, AI algorithms, Big Data, and monitoring tools Key Principles: IT process automation Team collaboration and communication Continuous integration and continuous delivery (CI/CD) Collaboration between data Data pipeline automation and optimization Version control for data artifacts Data scientists and operations teams collaborate. Machine learning models, version control Continuous monitoring and feedback Automated analysis and response to IT incidents Proactive issue resolution using analytics IT management tools integration Continuous improvement using feedback Primary Users Software and DevOps engineers Data and DataOps engineers Data scientists and MLOps engineers Data scientists, Big Data scientists, and AIOps engineers Use Cases Microservices, containerization, CI/CD, and collaborative development Ingestion of data, processing and transforming data, and extraction of data into other platforms Machine learning (ML) and data science projects for predictive analytics and AI IT AI operations to enhance network, system, and infrastructure Summary In summary, managing a system from a single project team is at the end of its life due to business processes becoming more complex and IT systems changing dynamically with new technologies. The detailed implementation involves a combination of collaborative practices, automation, monitoring, and a focus on continuous improvement as part of DevOps, DataOps, MLOps, and AIOps processes. DevOps focuses primarily on IT processes and software development, and the DataOps and MLOps approaches focus on improving IT and business collaborations as well as overall data use in organizations. DataOps workflows leverage DevOps principles to manage the data workflows. MLOps also leverages the DevOps principles to manage applications built-in machine learning.

By Ravi Kiran Mallidi

CORE

Comparing Pandas, Polars, and PySpark: A Benchmark Analysis

Lately, I have been working with Polars and PySpark, which brings me back to the days when Spark fever was at its peak, and every data processing solution seemed to revolve around it. This prompts me to question: was it really necessary? Let’s delve into my experiences with various data processing technologies. Background During my final degree project on sentiment analysis, Pandas was just beginning to emerge as the primary tool for feature engineering. It was user-friendly and seamlessly integrated with several machine learning libraries, such as scikit-learn. Then, as I started working, Spark became a part of my daily routine. I used it for ETL processes in a nascent data lake to implement business logic, although I wondered if we were over-engineering the process. Typically, the data volumes we handled were not substantial enough to necessitate using Spark, yet it was employed every time new data entered the system. We would set up a cluster and proceed with processing using Spark. In only a few instances did I genuinely feel that Spark was not the right tool for the job. This experience pushed me to develop a lightweight ingestion framework using Pandas. However, this framework did not perform as expected, struggling with medium to large files. Recently, I've started using Polars for some tasks and I have been impressed by its performance in processing datasets with several million rows. This has led to me setting up a different benchmarking for all of these tools. Let's dive into it! A Little Bit of Context Pandas We don't have to forget that Pandas has been the dominant tool for data manipulation, exploration, and analysis. Pandas has risen in popularity among data scientists thanks to its similarities with the R grid view. Moreover, it is synchronized with other Python libraries related to the machine learning field: NumPy is a mathematical library for implementing linear algebra and standard calculations. Pandas is based on NumPy. Scikit-learn is the reference library for machine learning applications. Normally, all the data used for the model has been loaded, visualized, and analyzed with Pandas or NumPy. PySpark Spark is a free and distributed platform that transforms the paradigm of how big data processing is done, with PySpark as its Python library. It offers a unified computing engine with exceptional features: In-memory processing: Spark's major feature is its in-memory architecture, which is fast as it keeps the data in memory rather than on disk. Fault tolerance: The failure tolerance mechanisms that are built into the software ensure dependable data processing. Resilient Distributed Datasets do data tracking and allow for automatic recovery in case of failures such as failures. Scalability: Spark’s horizontally scalable architecture processes large datasets adaptively and distributes much faster to clusters. The data is distributed, using the massive power of all nodes in the cluster. Polars Polars is a Python library built on top of Rust, combining the flexibility and user-friendliness of Python with the speed and scalability of Rust. Rust is a low-level language that prioritizes performance, reliability, and productivity. It is memory efficient and gives performance par with C and C++. On the other hand, Polars uses Apache Arrow as its query engine to execute vectorized queries. Apache Arrow is a cross-language development platform for fast in-memory processing. Polars enable instantaneity in executing the operations of tabular data manipulation, analysis, and transformation, favoring its utilization with large datasets. Moreover, its syntax is like SQL, and the expressive complexity of data processing is easy to demonstrate. Another capability is its lazyness which evaluates queries and applies query optimization. Benchmarking Set up Here is a link to the GitHub project with all the information. There are four notebooks for each tool (two for polars for testing eager and lazy evaluation). The code will extract time execution for the following tasks: Reading Filtering Aggregations Joining Writing There are five datasets with multiple sizes, 50,000, 250,000, 1,000,000, 5,000,000, and 25,000,000 of rows. The idea is to test different scenarios and sizes. The data used for this test is a financial dataset from Kaggle. The tests were executed in: macOS Sonoma Apple M1 Pro 32 GB Table of Execution Times Row Size Pandas Polars Eager Polars Lazy PySpark 50,000 Rows 0.368 0.132 0.078 1.216 250,000 Rows 1.249 0.096 0.156 0.917 1,000,000 Rows 4.899 0.302 0.300 1.850 5,000,000 Rows 24.320 1.605 1.484 7.372 25,000,000 Rows 187.383 13.001 11.662 44.724 Analysis Pandas performed poorly, especially as dataset sizes increased. However, it could handle small datasets with decent performance time. PySpark, while being executed in a single machine, shows considerable improvement over Pandas when the dataset size grows. Polars, both in eager and lazy configurations, significantly outperforms the other tools, showing improvements up to 95-97% compared to Pandas and 70-75% compared to PySpark, confirming its efficiency in handling large datasets on a single machine. Visual Representations These visual aids help underline the relative efficiencies of the different tools across various test conditions. Conclusion The benchmarking results provided offer a clear insight into the performance scalability of four widely-used data processing tools across varying dataset sizes. From the analysis, several critical conclusions emerge: Pandas performance scalability: Popular for data manipulation in smaller datasets, it struggles significantly as the data volume increases indicating it is not the best for high-volume data. However, its integration over a lot of Machine Learning and stadistic libraries makes it indispensable for Data Science teams. Efficiency of Polars: Configurations of Polars (Eager and Lazy) demonstrate exceptional performance across all tested scales, outperforming both Pandas and PySpark by a wide margin, making Polars an efficient tool capable of processing large datasets. However, Polars has not released yet a major version of Python and until that, I don't recommend it for Production systems. Tool selection strategy: The findings underscore the importance of selecting the right tool based on the specific needs of the project and the available resources. For small to medium-sized datasets, Polars offers a significant performance advantage. For large-scale distributed processing, PySpark remains a robust option. Future considerations: As dataset sizes continue to grow and processing demands increase, the choice of data processing tools will become more critical. Tools like Polars built over Rust are emerging and the results have to be considered. Also, the tendency to use Spark as a solution for processing everything is disappearing and these tools are taking their place when there is no need for large-scale distributed systems. Use the right tool for the right job!

By Nacho Corcuera

10 ChatGPT Prompts To Boost Developer Productivity

As developers and engineers, we constantly seek ways to streamline our workflows, increase productivity, and solve complex problems efficiently. With the advent of advanced language models like ChatGPT, we now have powerful tools to assist us in our daily tasks. By leveraging the capabilities of ChatGPT, we can generate prompts that enhance our productivity and creativity, making us more effective problem solvers and innovators. In this article, we’ll explore 10 ChatGPT prompts tailored specifically for developers and engineers to boost their productivity and streamline their workflow. Code Refactoring Suggestions Here is the sample prompt: “I have a code that needs refactoring. Can you provide suggestions to improve its readability and efficiency? Here is the code: <paste or write code here>” Use ChatGPT to generate recommendations for refactoring code snippets, such as identifying redundant lines, suggesting better variable names, or proposing alternative algorithms to optimize performance. Please refer to the screenshot below: Here is the response: Troubleshooting Assistance Here is the sample prompt: “I’m encountering an error message [insert error message here] in my code. Can you help me troubleshoot and find a solution?” This prompt will help you troubleshoot bugs or issues in your code. Again, there may need a couple of iterations to really nail down the problems but this is a good starting prompt. API Documentation Retrieval Here is the prompt: “I’m working with the [insert API name] API. Can you provide me with relevant documentation or usage examples?” This is really helpful when we are working with new systems or platforms and instead of reading all the documentation, you can ask the ChatGPT to retrieve useful information for you in a summarized way. Design Pattern Recommendations Here is the prompt: “I’m designing a new software component. Here is the requirement: [ Put your requirement here]. What design pattern would you recommend for implementing [insert functionality]?” This prompt requires a good level of detail but it can help you recommend some of the best design patterns you should use for your problem set. Algorithm Optimization Techniques Here is the prompt: “I’m implementing [insert algorithm name]. Are there any optimization techniques or best practices I should consider?” This is not only limited to algorithms but you can use some code as well. In short, this prompt will help you optimize the algorithm/code. Code Review Feedback Here is the prompt: “I’ve written a new feature. Can you review my code and provide feedback on potential improvements? Here is the code : [Insert code here]” ChatGPT can provide some really good feedback about your code. You may or may not implement all those feedbacks but it can certainly be a good starting point. Library or Framework Recommendations Here is the prompt: “I’m starting a new project. Can you recommend a suitable [insert programming language] library or framework for [insert functionality]?” ChatGPT can suggest popular libraries, frameworks, and tools based on the programming language and desired functionality, enabling you to make informed technology choices. Technical Documentation Summaries Here is the prompt: “I need a summary of the [insert technology or concept] technical documentation. Can you provide a concise overview?” This is my most used prompt, I summarize the technical documentation and read the gist, this has certainly improved my productivity. Code Snippet Generation Here is the prompt: “I need a code snippet for [insert functionality or task]. Can you generate a sample code snippet?” This is a good prompt to generate a starter code pack. But be sure not just to copy the code and use it, be cautious with the code generated by LLMs as they contain security flaws and bugs as well. Project Planning and Task Prioritization Here is the prompt: “I’m planning my project roadmap. Can you suggest a prioritized list of tasks based on [insert project requirements or constraints]?” ChatGPT can analyze project requirements, dependencies, and deadlines to generate a prioritized task list, helping you effectively manage project timelines and deliverables. Conclusion Incorporating ChatGPT prompts into your development workflow can significantly enhance productivity, creativity, and problem-solving capabilities. By leveraging ChatGPT’s natural language understanding and generation capabilities, developers and engineers can streamline tasks such as code refactoring, troubleshooting, documentation retrieval, and project planning. By integrating ChatGPT into your toolkit, you empower yourself to tackle challenges more effectively and unlock new levels of innovation in your projects.

By Shahid Shaikh

Advanced Linux Troubleshooting Techniques for Site Reliability Engineers

In Site Reliability Engineering (SRE), the ability to quickly and effectively troubleshoot issues within Linux systems is crucial. This article explores advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and using the Extended Berkeley Packet Filter (eBPF) for real-time data gathering. Kernel Debugging Kernel debugging is a fundamental skill for any SRE working with Linux. It allows for deep inspection of the kernel's behavior, which is critical when diagnosing system crashes or performance bottlenecks. Tools and Techniques GDB (GNU Debugger) GDB can debug kernel modules and the Linux kernel. It allows setting breakpoints, stepping through the code, and inspecting variables. GNU Debugger Official Documentation: This is the official documentation for GNU Debugger, providing a comprehensive overview of its features. KGDB The kernel debugger allows the kernel to be debugged using GDB over a serial connection or a network. Using kgdb, kdb, and the kernel debugger internals provides a detailed explanation of how kgdb can be enabled and configured. Dynamic Debugging (dyndbg) Linux's dynamic debug feature enables real-time debugging messages that help trace kernel operations without rebooting the system. The official Dynamic Debug page describes how to use the dynamic debug (dyndbg) feature. Tracing System Calls With strace strace is a powerful diagnostic tool that monitors the system calls used by a program and the signals received by a program. It is instrumental in understanding the interaction between applications and the Linux kernel. Usage To trace system calls, strace can be attached to a running process or start a new process under strace. It logs all system calls, which can be analyzed to find faults in system operations. Example: Shell root@ubuntu:~# strace -p 2009 strace: Process 2009 attached munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 In the above example, the -p flag is the process, and 2009 is the pid. Similarly, you can use the -o flag to log the output to a file instead of dumping everything on the screen. You can review the following article to understand system calls on Linux with strace. Performance Analysis With perf perf is a versatile tool used for system performance analysis. It provides a rich set of commands to collect, analyze, and report on hardware and software events. Key Features perf record: Gathers performance data into a file, perf.data, which can be further analyzed using perf report to identify hotspots perf report: This report analyzes the data collected by perf record and displays where most of the time was spent, helping identify performance bottlenecks. Event-based sampling: perf can record data based on specific events, such as cache misses or CPU cycles, which helps pinpoint performance issues more accurately. Example: Shell root@ubuntu:/tmp# perf record ^C[ perf record: Woken up 17 times to write data ] [ perf record: Captured and wrote 4.619 MB perf.data (83123 samples) ] root@ubuntu:/tmp# root@ubuntu:/tmp# perf report Samples: 83K of event 'cpu-clock:ppp', Event count (approx.): 20780750000 Overhead Command Shared Object Symbol 17.74% swapper [kernel.kallsyms] [k] cpuidle_idle_call 8.36% stress [kernel.kallsyms] [k] __do_softirq 7.17% stress [kernel.kallsyms] [k] finish_task_switch.isra.0 6.90% stress [kernel.kallsyms] [k] el0_da 5.73% stress libc.so.6 [.] random_r 3.92% stress [kernel.kallsyms] [k] flush_end_io 3.87% stress libc.so.6 [.] random 3.71% stress libc.so.6 [.] 0x00000000001405bc 2.71% kworker/0:2H-kb [kernel.kallsyms] [k] ata_scsi_queuecmd 2.58% stress libm.so.6 [.] __sqrt_finite 2.45% stress stress [.] 0x0000000000000f14 1.62% stress stress [.] 0x000000000000168c 1.46% stress [kernel.kallsyms] [k] __pi_clear_page 1.37% stress libc.so.6 [.] rand 1.34% stress libc.so.6 [.] 0x00000000001405c4 1.22% stress stress [.] 0x0000000000000e94 1.20% stress [kernel.kallsyms] [k] folio_batch_move_lru 1.20% stress stress [.] 0x0000000000000f10 1.16% stress libc.so.6 [.] 0x00000000001408d4 0.84% stress [kernel.kallsyms] [k] handle_mm_fault 0.77% stress [kernel.kallsyms] [k] release_pages 0.65% stress [kernel.kallsyms] [k] super_lock 0.62% stress [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.61% stress [kernel.kallsyms] [k] blk_done_softirq 0.61% stress [kernel.kallsyms] [k] _raw_spin_lock 0.60% stress [kernel.kallsyms] [k] folio_add_lru 0.58% kworker/0:2H-kb [kernel.kallsyms] [k] finish_task_switch.isra.0 0.55% stress [kernel.kallsyms] [k] __rcu_read_lock 0.52% stress [kernel.kallsyms] [k] percpu_ref_put_many.constprop.0 0.46% stress stress [.] 0x00000000000016e0 0.45% stress [kernel.kallsyms] [k] __rcu_read_unlock 0.45% stress [kernel.kallsyms] [k] dynamic_might_resched 0.42% stress [kernel.kallsyms] [k] _raw_spin_unlock 0.41% stress [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.40% stress [kernel.kallsyms] [k] mas_walk 0.39% stress [kernel.kallsyms] [k] arch_counter_get_cntvct 0.39% stress [kernel.kallsyms] [k] rwsem_read_trylock 0.39% stress [kernel.kallsyms] [k] up_read 0.38% stress [kernel.kallsyms] [k] down_read 0.37% stress [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.36% stress [kernel.kallsyms] [k] free_unref_page_commit 0.34% stress [kernel.kallsyms] [k] memset 0.32% stress libc.so.6 [.] 0x00000000001408c8 0.30% stress [kernel.kallsyms] [k] sync_inodes_sb 0.29% stress [kernel.kallsyms] [k] iterate_supers 0.29% stress [kernel.kallsyms] [k] percpu_counter_add_batch Real-Time Data Gathering With eBPF eBPF allows for creating small programs that run on the Linux kernel in a sandboxed environment. These programs can track system calls and network messages, providing real-time insights into system behavior. Applications Network monitoring: eBPF can monitor network traffic in real-time, providing insights into packet flow and protocol usage without significant performance overhead. Security: eBPF helps implement security policies by monitoring system calls and network activity to detect and prevent malicious activities. Performance monitoring: It can track application performance by monitoring function calls and system resource usage, helping SREs optimize performance. Conclusion Advanced troubleshooting in Linux involves a combination of tools and techniques that provide deep insights into system operations. Tools like GDB, strace, perf, and eBPF are essential for any SRE looking to enhance their troubleshooting capabilities. By leveraging these tools, SREs can ensure the high reliability and performance of Linux systems in production environments.

By Prashanth Ravula

CORE

Building Intelligent AI Agents Using Semantic Kernels and Azure OpenAI Models: A Step-By-Step Guide

In this article, we’ll explore how to build intelligent AI agents using Azure Open AI and semantic kernels (Microsoft C# SDK). You can combine it with Open AI, Azure Open AI, Hugging Face, or any other model. We’ll cover the fundamentals, dive into implementation details, and provide practical code examples in C#. Whether you’re a beginner or an experienced developer, this guide will help you harness the power of AI for your applications. What Is Semantic Kernel? In Kevin Scott's talk on "The era of the AI copilot," he showcased how Microsoft's Copilot system uses a mix of AI models and plugins to enhance user experiences. At the core of this setup is an AI orchestration layer, which allows Microsoft to combine these AI components to create innovative features for users. For developers looking to create their own copilot-like experiences using AI plugins, Microsoft has introduced Semantic kernel. Semantic Kernel is an open-source framework that enables developers to build intelligent agents by providing a common interface for various AI models and algorithms. The Semantic Kernel SDK allows you to integrate the power of large language models (LLMs) in your own applications. The Semantic Kernel SDK allows developers to integrate prompts to LLMs and results in their applications, and potentially craft their own copilot-like experiences. It allows developers to focus on building intelligent applications without worrying about the underlying complexities of AI models. Semantic Kernel is built on top of the .NET ecosystem and provides a robust and scalable platform for building intelligent apps/agents. Figure courtesy of Microsoft Key Features of Semantic Kernel Modular architecture: Semantic Kernel has a modular architecture that allows developers to easily integrate new AI models and algorithms. Knowledge graph: Semantic Kernel provides a built-in knowledge graph that enables developers to store and query complex relationships between entities. Machine learning: Semantic Kernel supports various machine learning algorithms, including classification, regression, and clustering. Natural language processing: Semantic Kernel provides natural language processing capabilities, including text analysis and sentiment analysis. Integration with external services: Semantic Kernel allows developers to integrate with external services, such as databases and web services. Let's dive deep into writing some intelligent code using Semantic kernel C# SDK. I will write them in steps so it will be easy to follow along. Step 1: Setting up the Environment Let's set up our environment. You will need to install the following to follow along. .NET 8 or later Semantic Kernel SDK (available on NuGet) Your preferred IDE (Visual Studio, Visual Studio Code, etc.) Azure OpenAI access Step 2: Creating a New Project in VS Open Visual Studio and create a blank empty console DotNet 8 Application. Step 3: Install NuGet References Right-click on the project --> click on Manage NuGet reference section to install the below 2 latest NuGet packages. 1) Microsoft.SemanticKernel 2) Microsoft.Extensions.Configuration.json Note: To avoid Hardcoding Azure Open AI key and endpoint, I am storing these as key-value pair into appsettings.json, and using the #2 package, I can easily retrieve them based on the key. Step 4: Create and Deploy Azure OpenAI Model Once you have obtained access to Azure OpenAI service, login to the Azure portal or Azure OpenAI studio to create Azure OpenAI resource. The screenshots below are from the Azure portal: You can also create an Azure Open AI service resource using Azure CLI by running the following command: PowerShell az cognitiveservices account create -n <nameoftheresource> -g <Resourcegroupname> -l <location> \ --kind OpenAI --sku s0 --subscription subscriptionID You can see your resource from Azure OpenAI studio as well by navigating to this page and selecting the resource that was created from: Deploy a Model Azure OpenAI includes several types of base models as shown in the studio when you navigate to the Deployments tab. You can also create your own custom models by using existing base models as per your requirements. Let's use the deployed GPT-35-turbo model and see how to consume it in the Azure OpenAI studio. Fill in the details and click Create. Once the model is deployed, grab the Azure OpenAI key and endpoint to paste it inside the appsettings.json file as shown below Step 5: Create Kernel in the Code Step 6: Create a Plugin to Call the Azure OpenAI Model Step 7: Use Kernel To Invoke the LLM Models Once you run the program by pressing F5 you will see the response generated from the Azure OpenAI model. Complete Code C# using Microsoft.Extensions.Configuration; using Microsoft.SemanticKernel; var config = new ConfigurationBuilder() .AddJsonFile("appsettings.json", optional: true, reloadOnChange: true) .Build(); var builder = Kernel.CreateBuilder(); builder.Services.AddAzureOpenAIChatCompletion( deploymentName: config["AzureOpenAI:DeploymentModel"] ?? string.Empty, endpoint: config["AzureOpenAI:Endpoint"] ?? string.Empty, apiKey: config["AzureOpenAI:ApiKey"] ?? string.Empty); var semanticKernel = builder.Build(); Console.WriteLine(await semanticKernel.InvokePromptAsync("Give me shopping list for cooking Sushi")); Conclusion By combining AI LLM models with semantic kernels, you’ll create intelligent applications that go beyond simple keyword matching. Experiment, iterate, and keep learning to build remarkable apps that truly understand and serve your needs.

By Naga Santhosh Reddy Vootukuri

CORE

Mastering SQL Server: Identifying and Optimizing Slow Queries for Enhanced Performance

SQL Server serves as a robust solution for handling and examining extensive amounts of data. Nevertheless, when databases expand and evolve into intricate structures, sluggish queries may arise as a notable concern, impacting the effectiveness of your applications and user contentment. This piece will delve into effective approaches for pinpointing and enhancing slow queries within SQL Server, guaranteeing optimal operational performance of your database. Identifying Slow Queries 1. Utilize SQL Server Management Studio (SSMS) Activity Monitor Launch SSMS, establish a connection to your server, right-click on the server name, and choose Activity Monitor. Review the Recent Expensive Queries section to pinpoint queries that are utilizing a significant amount of resources. Data Collection Reports Configure data collection to gather system data that can help in identifying troublesome queries. Go to Management -> Data Collection, and configure the data collection sets. You can access reports later by right-clicking on Data Collection and selecting Reports. Prior to proceeding, we will first establish the sample c. Subsequently, adhere to the provided steps below to insert the sample data, explore the views and stored procedures, and optimize the query. MS SQL CREATE DATABASE IFCData; GO USE IFCData; GO CREATE TABLE Flights ( FlightID INT PRIMARY KEY, FlightNumber VARCHAR(10), DepartureAirportCode VARCHAR(3), ArrivalAirportCode VARCHAR(3), DepartureTime DATETIME, ArrivalTime DATETIME ); GO CREATE TABLE Passengers ( PassengerID INT PRIMARY KEY, FirstName VARCHAR(50), LastName VARCHAR(50), Email VARCHAR(100) ); GO CREATE TABLE ServicesUsed ( ServiceID INT PRIMARY KEY, PassengerID INT, FlightID INT, ServiceType VARCHAR(50), UsageTime DATETIME, DurationMinutes INT, FOREIGN KEY (PassengerID) REFERENCES Passengers(PassengerID), FOREIGN KEY (FlightID) REFERENCES Flights(FlightID) ); GO Please input the sample data. This serves as the sample data that will be utilized in the example below. Here is the code to copy and paste to insert. MS SQL -- Inserting data into Flights INSERT INTO Flights VALUES (1, 'UA123', 'SFO', 'LAX', '2024-05-01 08:00:00', '2024-05-01 09:30:00'), (2, 'AA456', 'NYC', 'MIA', '2024-05-01 09:00:00', '2024-05-01 12:00:00'), (3, 'DL789', 'LAS', 'SEA', '2024-05-02 07:00:00', '2024-05-02 09:00:00'), (4, 'UA123', 'LAX', 'SFO', '2024-05-02 10:00:00', '2024-05-02 11:30:00'), (5, 'AA456', 'MIA', 'NYC', '2024-05-02 13:00:00', '2024-05-02 16:00:00'), (6, 'DL789', 'SEA', 'LAS', '2024-05-03 08:00:00', '2024-05-03 10:00:00'), (7, 'UA123', 'SFO', 'LAX', '2024-05-03 12:00:00', '2024-05-03 13:30:00'), (8, 'AA456', 'NYC', 'MIA', '2024-05-03 17:00:00', '2024-05-03 20:00:00'), (9, 'DL789', 'LAS', 'SEA', '2024-05-04 07:00:00', '2024-05-04 09:00:00'), (10, 'UA123', 'LAX', 'SFO', '2024-05-04 10:00:00', '2024-05-04 11:30:00'), (11, 'AA456', 'MIA', 'NYC', '2024-05-04 13:00:00', '2024-05-04 16:00:00'), (12, 'DL789', 'SEA', 'LAS', '2024-05-05 08:00:00', '2024-05-05 10:00:00'); -- Inserting data into Passengers INSERT INTO Passengers VALUES (1, 'Vikay', 'Singh', 'johndoe@example.com'), (2, 'Mario', 'Smith', 'janesmith@example.com'), (3, 'Alice', 'Johnson', 'alicejohnson@example.com'), (4, 'Bob', 'Brown', 'bobbrown@example.com'), (5, 'Carol', 'Davis', 'caroldavis@example.com'), (6, 'David', 'Martinez', 'davidmartinez@example.com'), (7, 'Eve', 'Clark', 'eveclark@example.com'), (8, 'Frank', 'Lopez', 'franklopez@example.com'), (9, 'Grace', 'Harris', 'graceharris@example.com'), (10, 'Harry', 'Lewis', 'harrylewis@example.com'), (11, 'Ivy', 'Walker', 'ivywalker@example.com'), (12, 'Jack', 'Hall', 'jackhall@example.com'); -- Inserting data into ServicesUsed INSERT INTO ServicesUsed VALUES (1, 1, 1, 'WiFi', '2024-05-01 08:30:00', 60), (2, 2, 1, 'Streaming', '2024-05-01 08:45:00', 30), (3, 3, 3, 'WiFi', '2024-05-02 07:30:00', 90), (4, 4, 4, 'WiFi', '2024-05-02 10:30:00', 60), (5, 5, 5, 'Streaming', '2024-05-02 13:30:00', 120), (6, 6, 6, 'Streaming', '2024-05-03 08:30:00', 110), (7, 7, 7, 'WiFi', '2024-05-03 12:30:00', 90), (8, 8, 8, 'WiFi', '2024-05-03 17:30:00', 80), (9, 9, 9, 'Streaming', '2024-05-04 07:30:00', 95), (10, 10, 10, 'Streaming', '2024-05-04 10:30:00', 85), (11, 11, 11, 'WiFi', '2024-05-04 13:30:00', 75), (12, 12, 12, 'WiFi', '2024-05-05 08:30:00', 65); 2. Dynamic Management Views (DMVs) DMVs provide a way to gain insights into the health of a SQL Server instance. To identify slow-running queries that could be affecting your IFCData database performance, you can use the sys.dm_exec_query_stats, sys.dm_exec_sql_text, and sys.dm_exec_query_plan DMVs: MS SQL SELECT TOP 10 qs.total_elapsed_time / qs.execution_count AS avg_execution_time, qs.total_logical_reads / qs.execution_count AS avg_logical_reads, st.text AS query_text, qp.query_plan FROM sys.dm_exec_query_stats AS qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) AS qp ORDER BY avg_execution_time DESC; This query provides a snapshot of the most resource-intensive queries by average execution time, helping you pinpoint areas where query optimization could improve performance. Enhancing Performance Advanced Query Optimization Techniques: Enhance Join Performance Join operations play a crucial role in database tasks, particularly when dealing with extensive tables. By optimizing the join conditions and the sequence in which tables are joined, it is possible to greatly minimize the time taken for query execution. In order to derive valuable insights from various tables within the IFCData database, it is essential to utilize appropriate SQL joins. By linking passenger details with flights and services utilized, a comprehensive understanding can be obtained. Here is a guide on how to effectively join the Flights, Passengers, and ServicesUsed tables for in-depth analysis. MS SQL SELECT p.FirstName, p.LastName, p.Email, f.FlightNumber, f.DepartureAirportCode, f.ArrivalAirportCode, s.ServiceType, s.UsageTime, s.DurationMinutes FROM Passengers p JOIN ServicesUsed s ON p.PassengerID = s.PassengerID JOIN Flights f ON s.FlightID = f.FlightID WHERE f.DepartureAirportCode = 'SFO'; -- Example condition to filter by departure airport This query efficiently merges data from the three tables, offering a comprehensive overview of the flight details and services utilized by each passenger, with a filter applied for a specific departure airport. Such a query proves valuable in analyzing passenger behavior, patterns of service usage, and operational efficiency. Performance Tuning Tools 1. SQL Server Profiler SQL Server Profiler captures and analyzes database events. This tool is essential for identifying slow-running queries and understanding how queries interact with the database. Example: Set up a trace to capture query execution times: Start SQL Server Profiler. Create a new trace and select the events you want to capture, such as SQL:BatchCompleted. Add a filter to capture only events where the duration is greater than a specific threshold, e.g., 1,000 milliseconds. Run the trace during a period of typical usage to gather data on any queries that exceed your threshold. 2. Database Engine Tuning Advisor (DTA) Database Engine Tuning Advisor analyzes workloads and recommends changes to indexes, indexed views, and partitioning. Example: To use DTA, you first need to capture a workload in a file or table. Here’s how to use it with a file: Capture a workload using SQL Server Profiler. Save the workload to a file. Open DTA, connect to your server and select the workload file. Configure the analysis, specifying the databases to tune and the types of recommendations you're interested in. Run the analysis. DTA will propose changes such as creating new indexes or modifying existing ones to optimize performance. 3. Query Store Query Store collects detailed performance information about queries, making it easier to monitor performance variations and understand the impact of changes. Example: Enable Query Store and force a plan for a query that intermittently performs poorly: It's executed successfully. Here is the code below. MS SQL -- Enable Query Store for IFCData database ALTER DATABASE IFCData SET QUERY_STORE = ON; -- Configure Query Store settings ALTER DATABASE IFCData SET QUERY_STORE (OPERATION_MODE = READ_WRITE, -- Allows Query Store to capture query information CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30), -- Data older than 30 days will be cleaned up DATA_FLUSH_INTERVAL_SECONDS = 900, -- Data is written to disk every 15 minutes INTERVAL_LENGTH_MINUTES = 60, -- Aggregated in 60-minute intervals MAX_STORAGE_SIZE_MB = 500, -- Limits the storage size of Query Store data to 500 MB QUERY_CAPTURE_MODE = AUTO); -- Captures all queries that are significant based on internal algorithms Upon activation, Query Store commences the collection of data regarding query execution, which can be examined through a range of reports accessible in SQL Server Management Studio (SSMS). Below are a few essential applications and queries that can be utilized to analyze data from the Query Store for the IFCData database. Queries with high resource consumption: Detect queries that utilize a significant amount of resources, aiding in the identification of areas that require performance enhancements. Query code: MS SQL SELECT TOP 10 qs.query_id, qsp.query_sql_text, rs.avg_cpu_time, rs.avg_logical_io_reads, rs.avg_duration, rs.count_executions FROM sys.query_store_plan AS qp JOIN sys.query_store_query AS qs ON qp.query_id = qs.query_id JOIN sys.query_store_query_text AS qsp ON qs.query_text_id = qsp.query_text_id JOIN sys.query_store_runtime_stats AS rs ON qp.plan_id = rs.plan_id ORDER BY rs.avg_cpu_time DESC; 2. Analyzing query performance decline: MS SQL SELECT rs.start_time, rs.end_time, qp.query_plan, rs.avg_duration FROM sys.query_store_runtime_stats AS rs JOIN sys.query_store_plan AS qp ON rs.plan_id = qp.plan_id WHERE qp.query_id = YOUR_QUERY_ID -- Specify the query ID you want to analyze ORDER BY rs.start_time; Assess the performance of queries across various periods to identify any declines in performance. 3. Monitoring changes in query plans: MS SQL SELECT qp.plan_id, qsp.query_sql_text, qp.last_execution_time FROM sys.query_store_plan AS qp JOIN sys.query_store_query AS qs ON qp.query_id = qs.query_id JOIN sys.query_store_query_text AS qsp ON qs.query_text_id = qsp.query_text_id WHERE qs.query_id = 1 -- Specify the query ID you want to analyze ORDER BY qp.last_execution_time DESC; I am using query_id = 1. In your case, it can be any number. Track the alterations in query plans over time for a particular query, facilitating the comprehension of performance fluctuations. Conclusion By systematically identifying slow queries and applying targeted optimization techniques, you can significantly enhance the performance of your SQL Server databases. Regular monitoring and maintenance are key to sustaining these performance gains over time. With the right tools and techniques, you can transform your SQL Server into a high-performing, efficient database management system. Further Reading Learn DMVs Best Practice to monitor the query load Performing DBCC CHECKDB

By Vijay Panwar

CORE

Using CloudTrail Lake To Enable Auditing of Enterprise Applications

For a long time, AWS CloudTrail has been the foundational technology that enabled organizations to meet compliance requirements by capturing audit logs for all AWS API invocations. CloudTrail Lake extends CloudTrail's capabilities by adding support for a SQL-like query language to analyze audit events. The audit events are stored in a columnar format called ORC to enable high-performance SQL queries. An important capability of CloudTrail Lake is the ability to ingest audit logs from custom applications or partner SaaS applications. With this capability, an organization can get a single aggregated view of audit events across AWS API invocations and their enterprise applications. As each end-to-end business process can span multiple enterprise applications, an aggregated view of audit events across them becomes a critical need. This article discusses an architectural approach to leverage CloudTrail Lake for auditing enterprise applications and the corresponding design considerations. Architecture Let us start by taking a look at the architecture diagram. This architecture uses SQS Queues and AWS Lambda functions to provide an asynchronous and highly concurrent model for disseminating audit events from the enterprise application. At important steps in business transactions, the application will call relevant AWS SDK APIs to send the audit event details as a message to the Audit event SQS queue. A lambda function is associated with the SQS queue so that it is triggered whenever a message is added to the queue. It will call the putAuditEvents() API provided by CloudTrail Lake to ingest Audit Events into the Event Data Store configured for this enterprise application. Note that the architecture shows two other Event Data stores to illustrate that events from the enterprise application can be correlated with events in the other data stores. Required Configuration Start by creating an Event Data Store which accepts events of category AuditEventLog. Note down the ARN of the event data store created. It will be needed for creating an integration channel. Shell aws cloudtrail create-event-data-store \ --name custom-events-datastore \ --no-multi-region-enabled \ --retention-period 90 \ --advanced-event-selectors '[ { "Name": "Select all external events", "FieldSelectors": [ { "Field": "eventCategory", "Equals": ["ActivityAuditLog"] } ] } ]' Create an Integration with the source as "My Custom Integration" and choose the delivery location as the event data store created in the previous step. Note the ARN of the channel created; it will be needed for coding the Lambda function. Shell aws cloudtrail create-channel \ --region us-east-1 \ --destinations '[{"Type": "EVENT_DATA_STORE", "Location": "<event data store arn>"}]' \ --name custom-events-channel \ --source Custom Create a Lambda function that would contain the logic to receive messages from an SQS queue, transform the message into an audit event, and send it to the channel created in the previous step using the putAuditEvents() API. Refer to the next section to understand the main steps to be included in the lambda function logic. Add permissions through an inline policy for the Lambda function, to be authorized to put audit events into the Integration channel. JSON { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": "cloudtrail-data:PutAuditEvents", "Resource": "<channel arn>" }] } Create a SQS queue of type "Standard" with an associated dead letter queue. Add permissions to the Lambda function using an inline policy to allow receiving messages from the SQS Queue. JSON { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": "sqs:*", "Resource": "<SQS Queue arn>" } ] } In the Lambda function configuration, add a trigger by choosing the source as "SQS" and specifying the ARN of the SQS queue created in the previous step. Ensure that Report batch item failures option is selected. Finally, ensure that permissions to send messages to this queue are added to the IAM Role assigned to your enterprise application. Lambda Function Code The code sample will focus on the Lambda function, as it is at the crux of the solution. Java public class CustomAuditEventHandler implements RequestHandler<SQSEvent, SQSBatchResponse> { Java public SQSBatchResponse handleRequest(final SQSEvent event, final Context context) { List<SQSMessage> records = event.getRecords(); AWSCloudTrailData client = AWSCloudTrailDataClientBuilder.defaultClient(); PutAuditEventsRequest request = new PutAuditEventsRequest(); List<AuditEvent> auditEvents = new ArrayList<AuditEvent>(); request.setChannelArn(channelARN); for (SQSMessage record : records) { AuditEvent auditEvent = new AuditEvent(); // Add logic in the transformToEventData() operation to transform contents of // the message to the event data format needed by Cloud Trail Lake. String eventData = transformToEventData(record); context.getLogger().log("Event Data JSON: " + eventData); auditEvent.setEventData(eventData); // Set a source event ID. This could be useful to correlate the event // data stored in Cloud Trail Lake to relevant information in the enterprise // application. auditEvent.setId(record.getMessageId()); auditEvents.add(auditEvent); } request.setAuditEvents(auditEvents); PutAuditEventsResult putAuditEvents = client.putAuditEvents(request); context.getLogger().log("Put Audit Event Results: " + putAuditEvents.toString()); SQSBatchResponse response = new SQSBatchResponse(); List<BatchItemFailure> failures = new ArrayList<SQSBatchResponse.BatchItemFailure>(); for (ResultErrorEntry result : putAuditEvents.getFailed()) { BatchItemFailure batchItemFailure = new BatchItemFailure(result.getId()); failures.add(batchItemFailure); context.getLogger().log("Failed Event ID: " + result.getId()); } response.setBatchItemFailures(failures); return response; } The first thing to note is that the type specification for the Class uses SQSBatchResponse, as we want the audit event messages to be processed as batches. Each Enterprise application would have its own format for representing audit messages. The logic to transform the messages to the format needed by CloudTrail Lake data schema should be part of the Lambda function. This would allow for using the same architecture even if the audit events need to be ingested into a different (SIEM) tool instead of CloudTrail Lake. Apart from the event data itself, the putAuditEvents() API of CloudTrail Lake expects a source event id to be provided for each event. This could be used to tie the audit event stored in the CloudTrail Lake to relevant information in the enterprise application. The messages which failed to be ingested should be added to list of failed records in the SQSBatchResponse object. This will ensure that all the successfully processed records are deleted from the SQS Queue and failed records are retried at a later time. Note that the code is using the source event id (result.getID()) as the ID for failed records. This is because the source event id was set as the message id earlier in the code. If a different identifier has to be used as the source event id, it has to be mapped to the message id. The mapping will help with finding the message ids for records that were not successfully ingested while framing the lambda function response. Architectural Considerations This section discusses the choices made for this architecture and the corresponding trade-offs. These need to be considered carefully while designing your solution. FIFO VS Standard Queues Audit events are usually self-contained units of data. So, the order in which they are ingested into the CloudTrail Lake should not affect the information conveyed by them in any manner. Hence, there is no need to use a FIFO queue to maintain the information integrity of audit events. Standard queues provide higher concurrency than FIFO queues with respect to fanning out messages to Lambda function instances. This is because, unlike FIFO queues, they do not have to maintain the order of messages at the queue or message group level. Achieving a similar level of concurrency with FIFO queues would require increasing the complexity of the source application as it has to include logic to fan out messages across message groups. With standard queues, there is a small chance of multiple deliveries of the same message. This should not be a problem as duplicates could be filtered out as part of the Cloud Data Lake queries. SNS Vs SQS: This architecture uses SQS instead of SNS for the following reasons: SNS does not support Lambda functions to be triggered for standard topics. SQS through its retry logic, provides better reliability with respect to delivering messages to the recipient than SNS. This is a valuable capability, especially for data as important as audit events. SQS can be configured to group audit events and send those to Lambda to be processed in batches. This helps with the performance/cost of the Lambda function and avoids overwhelming CloudTrail Lake with a high number of concurrent connection requests. There are other factors to consider as well such as the usage of private links, VPC integration, and message encryption in transit, to securely transmit audit events. The concurrency and message delivery settings provided by SQS-Lambda integration should also be tuned based on the throughput and complexity of the audit events. The approach presented and the architectural considerations discussed provide a good starting point for using CloudTrail Lake with enterprise applications.

By Balaji Nagarajan

Securing Secrets: A Guide To Implementing Secrets Management in DevSecOps Pipelines

Introduction to Secrets Management In the world of DevSecOps, where speed, agility, and security are paramount, managing secrets effectively is crucial. Secrets, such as passwords, API keys, tokens, and certificates, are sensitive pieces of information that, if exposed, can lead to severe security breaches. To mitigate these risks, organizations are turning to secret management solutions. These solutions help securely store, access, and manage secrets throughout the software development lifecycle, ensuring they are protected from unauthorized access and misuse. This article aims to provide an in-depth overview of secrets management in DevSecOps, covering key concepts, common challenges, best practices, and available tools. Security Risks in Secrets Management The lack of implementing secrets management poses several challenges. Primarily, your organization might already have numerous secrets stored across the codebase. Apart from the ongoing risk of exposure, keeping secrets within your code promotes other insecure practices such as reusing secrets, employing weak passwords, and neglecting to rotate or revoke secrets due to the extensive code modifications that would be needed. Here below are some of the risks highlighting the potential risks of improper secrets management: Data Breaches If secrets are not properly managed, they can be exposed, leading to unauthorized access and potential data breaches. Example Scenario A Software-as-a-Service (SaaS) company uses a popular CI/CD platform to automate its software development and deployment processes. As part of their DevSecOps practices, they store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue Unfortunately, the CI/CD platform they use experiences a security vulnerability that allows attackers to gain unauthorized access to the secrets management tool's API. This vulnerability goes undetected by the company's security monitoring systems. Consequence Attackers exploit the vulnerability and gain access to the secrets stored in the management tool. With these credentials, they are able to access the company's production systems and databases. They exfiltrate sensitive customer data, including personally identifiable information (PII) and financial records. Impact The data breach leads to significant financial losses for the company due to regulatory fines, legal fees, and loss of customer trust. Additionally, the company's reputation is tarnished, leading to a decrease in customer retention and potential business partnerships. Preventive Measures To prevent such data breaches, the company could have implemented the following preventive measures: Regularly auditing and monitoring access to the secrets management tool to detect unauthorized access. Implementing multi-factor authentication (MFA) for accessing the secrets management tool. Ensuring that the secrets management tool is regularly patched and updated to address any security vulnerabilities. Limiting access to secrets based on the principle of least privilege, ensuring that only authorized users and systems have access to sensitive credentials. Implementing strong encryption for storing secrets to mitigate the impact of unauthorized access. Conducting regular security assessments and penetration testing to identify and address potential security vulnerabilities in the CI/CD platform and associated tools. Credential Theft Attackers may steal secrets, such as API keys or passwords, to gain unauthorized access to systems or resources. Example Scenario A fintech startup uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as database passwords and API keys, in a secrets management tool integrated with their pipelines. Issue An attacker gains access to the company's internal network by exploiting a vulnerability in an outdated web server. Once inside the network, the attacker uses a variety of techniques, such as phishing and social engineering, to gain access to a developer's workstation. Consequence The attacker discovers that the developer has stored plaintext files containing sensitive credentials, including database passwords and API keys, on their desktop. The developer had mistakenly saved these files for convenience and had not securely stored them in the secrets management tool. Impact With access to the sensitive credentials, the attacker gains unauthorized access to the company's databases and other systems. They exfiltrate sensitive customer data, including financial records and personal information, leading to regulatory fines and damage to the company's reputation. Preventive Measures To prevent such credential theft incidents, the fintech startup could have implemented the following preventive measures: Educating developers and employees about the importance of securely storing credentials and the risks of leaving them in plaintext files. Implementing strict access controls and auditing mechanisms for accessing and managing secrets in the secrets management tool. Using encryption to store sensitive credentials in the secrets management tool, ensures that even if credentials are stolen, they cannot be easily used without decryption keys. Regularly rotating credentials and monitoring for unusual or unauthorized access patterns to detect potential credential theft incidents early. Misconfiguration Improperly configured secrets management systems can lead to accidental exposure of secrets. Example Scenario A healthcare organization uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as database passwords and API keys, in a secrets management tool integrated with their pipelines. Issue A developer inadvertently misconfigures the permissions on the secrets management tool, allowing unintended access to sensitive credentials. The misconfiguration occurs when the developer sets overly permissive access controls, granting access to a broader group of users than intended. Consequence An attacker discovers the misconfigured access controls and gains unauthorized access to the secrets management tool. With access to sensitive credentials, the attacker can now access the healthcare organization's databases and other systems, potentially leading to data breaches and privacy violations. Impact The healthcare organization suffers reputational damage and financial losses due to the data breach. They may also face regulatory fines for failing to protect sensitive information. Preventive Measures To prevent such misconfiguration incidents, the healthcare organization could have implemented the following preventive measures: Implementing least privilege access controls to ensure that only authorized users and systems have access to sensitive credentials. Regularly auditing and monitoring access to the secrets management tool to detect and remediate misconfigurations. Implementing automated checks and policies to enforce proper access controls and configurations for secrets management. Providing training and guidance to developers and administrators on best practices for securely configuring and managing access to secrets. Compliance Violations Failure to properly manage secrets can lead to violations of regulations such as GDPR, HIPAA, or PCI DSS. Example Scenario A financial services company uses a popular CI/CD platform to automate their software development and deployment processes. They store sensitive credentials, such as encryption keys and API tokens, in a secrets management tool integrated with their pipelines. Issue The financial services company fails to adhere to regulatory requirements for managing and protecting sensitive information. Specifically, they do not implement proper encryption for storing sensitive credentials and do not maintain proper access controls for managing secrets. Consequence Regulatory authorities conduct an audit of the company's security practices and discover compliance violations related to secrets management. The company is found to be non-compliant with regulations such as PCI DSS (Payment Card Industry Data Security Standard) and GDPR (General Data Protection Regulation). Impact The financial services company faces significant financial penalties for non-compliance with regulatory requirements. Additionally, the company's reputation is damaged, leading to a loss of customer trust and potential legal consequences. Preventive Measures To prevent such compliance violations, the financial services company could have implemented the following preventive measures: Implementing encryption for storing sensitive credentials in the secrets management tool to ensure compliance with data protection regulations. Implementing strict access controls and auditing mechanisms for managing and accessing secrets to prevent unauthorized access. Conducting regular compliance audits and assessments to identify and address any non-compliance issues related to secrets management. Lack of Accountability Without proper auditing and monitoring, it can be difficult to track who accessed or modified secrets, leading to a lack of accountability. Example Scenario A technology company uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue The company does not establish clear ownership and accountability for managing and protecting secrets. There is no designated individual or team responsible for ensuring that proper security practices are followed when storing and accessing secrets. Consequence Due to the lack of accountability, there is no oversight or monitoring of access to sensitive credentials. As a result, developers and administrators have unrestricted access to secrets, increasing the risk of unauthorized access and data breaches. Impact The lack of accountability leads to a data breach where sensitive credentials are exposed. The company faces financial losses due to regulatory fines, legal fees, and loss of customer trust. Additionally, the company's reputation is damaged, leading to a decrease in customer retention and potential business partnerships. Preventive Measures To prevent such lack of accountability incidents, the technology company could have implemented the following preventive measures: Designating a specific individual or team responsible for managing and protecting secrets, including implementing and enforcing security policies and procedures. Implementing access controls and auditing mechanisms to monitor and track access to secrets, ensuring that only authorized users have access. Providing regular training and awareness programs for employees on the importance of secrets management and security best practices. Conducting regular security audits and assessments to identify and address any gaps in secrets management practices. Operational Disruption If secrets are not available when needed, it can disrupt the operation of DevSecOps pipelines and applications. Example Scenario A financial institution uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as encryption keys and API tokens, in a secrets management tool integrated with their pipelines. Issue During a routine update to the secrets management tool, a misconfiguration occurs that causes the tool to become unresponsive. As a result, developers are unable to access the sensitive credentials needed to deploy new applications and services. Consequence The operational disruption leads to a delay in deploying critical updates and features, impacting the financial institution's ability to serve its customers effectively. The IT team is forced to troubleshoot the issue, leading to downtime and increased operational costs. Impact The operational disruption results in financial losses due to lost productivity and potential revenue. Additionally, the financial institution's reputation is damaged, leading to a loss of customer trust and potential business partnerships. Preventive Measures To prevent such operational disruptions, the financial institution could have implemented the following preventive measures: Implementing automated backups and disaster recovery procedures for the secrets management tool to quickly restore service in case of a failure. Conducting regular testing and monitoring of the secrets management tool to identify and address any performance issues or misconfigurations. Implementing a rollback plan to quickly revert to a previous version of the secrets management tool in case of a failed update or configuration change. Establishing clear communication channels and escalation procedures to quickly notify stakeholders and IT teams in case of operational disruption. Dependency on Third-Party Services Using third-party secrets management services can introduce dependencies and potential risks if the service becomes unavailable or compromised. Example Scenario A software development company uses a popular CI/CD platform to automate its software development and deployment processes. They rely on a third-party secrets management tool to store sensitive credentials, such as API keys and database passwords, used in their pipelines. Issue The third-party secrets management tool experiences a service outage due to a cyber attack on the service provider's infrastructure. As a result, the software development company is unable to access the sensitive credentials needed to deploy new applications and services. Consequence The dependency on the third-party secrets management tool leads to a delay in deploying critical updates and features, impacting the software development company's ability to deliver software on time. The IT team is forced to find alternative ways to manage and store sensitive credentials temporarily. Impact The dependency on the third-party secrets management tool results in financial losses due to lost productivity and potential revenue. Additionally, the software development company's reputation is damaged, leading to a loss of customer trust and potential business partnerships. Preventive Measures To prevent such dependencies on third-party services, the software development company could have implemented the following preventive measures: Implementing a backup plan for storing and managing sensitive credentials locally in case of a service outage or disruption. Diversifying the use of secrets management tools by using multiple tools or providers to reduce the impact of a single service outage. Conducting regular reviews and assessments of third-party service providers to ensure they meet security and reliability requirements. Implementing a contingency plan to quickly switch to an alternative secrets management tool or provider in case of a service outage or disruption. Insider Threats Malicious insiders may abuse their access to secrets for personal gain or to harm the organization. Example Scenario A technology company uses a popular CI/CD platform to automate their software development and deployment processes. They store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue An employee with privileged access to the secrets management tool decides to leave the company and maliciously steals sensitive credentials before leaving. The employee had legitimate access to the secrets management tool as part of their job responsibilities but chose to abuse that access for personal gain. Consequence The insider threat leads to the theft of sensitive credentials, which are then used by the former employee to gain unauthorized access to the company's systems and data. This unauthorized access can lead to data breaches, financial losses, and damage to the company's reputation. Impact The insider threat results in financial losses due to potential data breaches and the need to mitigate the impact of the stolen credentials. Additionally, the company's reputation is damaged, leading to a loss of customer trust and potential legal consequences. Preventive Measures To prevent insider threats involving secrets management, the technology company could have implemented the following preventive measures: Implementing strict access controls and least privilege principles to limit the access of employees to sensitive credentials based on their job responsibilities. Conducting regular audits and monitoring of access to the secrets management tool to detect and prevent unauthorized access. Providing regular training and awareness programs for employees on the importance of data security and the risks of insider threats. Implementing behavioral analytics and anomaly detection mechanisms to identify and respond to suspicious behavior or activities involving sensitive credentials. Best Practices for Secrets Management Here are some best practices for secrets management in DevSecOps pipelines: Use a dedicated secrets management tool: Utilize a specialized tool or service designed for securely storing and managing secrets. Encrypt secrets at rest and in transit: Ensure that secrets are encrypted both when stored and when transmitted over the network. Use strong access controls: Implement strict access controls to limit who can access secrets and what they can do with them. Regularly rotate secrets: Regularly rotate secrets (e.g., passwords, API keys) to minimize the impact of potential compromise. Avoid hardcoding secrets: Never hardcode secrets in your code or configuration files. Use environment variables or a secrets management tool instead. Use environment-specific secrets: Use different secrets for different environments (e.g., development, staging, production) to minimize the impact of a compromised secret. Monitor and audit access: Monitor and audit access to secrets to detect and respond to unauthorized access attempts. Automate secrets retrieval: Automate the retrieval of secrets in your CI/CD pipelines to reduce manual intervention and the risk of exposure. Regularly review and update policies: Regularly review and update your secrets management policies and procedures to ensure they are up-to-date and effective. Educate and train employees: Educate and train employees on the importance of secrets management and best practices for handling secrets securely. Use-Cases of Secrets Management For Different Tools Here are the common use cases for different tools of secrets management: IBM Cloud Secrets Manager Securely storing and managing API keys Managing database credentials Storing encryption keys Managing certificates Integrating with CI/CD pipelines Compliance and audit requirements by providing centralized management and auditing of secrets usage. Ability to dynamically generate and rotate secrets HashiCorp Vault Centralized secrets management for distributed systems Dynamic secrets generation and management Encryption and access controls for secrets Secrets rotation for various types of secrets AWS Secrets Manager Securely store and manage AWS credentials Securely store and manage other types of secrets used in AWS services Integration with AWS services for seamless access to secrets Automatic secrets rotation for supported AWS services Azure Key Vault Centralized secrets management for Azure applications Securely store and manage secrets, keys, and certificates Encryption and access policies for secrets Automated secrets rotation for keys, secrets, and certificates CyberArk Conjur Secrets management and privileged access management Secrets retrieval via REST API for integration with CI/CD pipelines Secrets versioning and access controls Automated secrets rotation using rotation policies and scheduled tasks Google Cloud Secret Manager Centralized secrets management for Google Cloud applications Securely store and manage secrets, API keys, and certificates Encryption at rest and in transit for secrets Automated and manual secrets rotation with integration with Google Cloud Functions These tools cater to different cloud environments and offer various features for securely managing and rotating secrets based on specific requirements and use cases. Implement Secrets Management in DevSecOps Pipelines Understanding CI/CD in DevSecOps CI/CD in DevSecOps involves automating the build, test, and deployment processes while integrating security practices throughout the pipeline to deliver secure and high-quality software rapidly. Continuous Integration (CI) CI is the practice of automatically building and testing code changes whenever a developer commits code to the version control system (e.g., Git). The goal is to quickly detect and fix integration errors. Continuous Delivery (CD) CD extends CI by automating the process of deploying code changes to testing, staging, and production environments. With CD, every code change that passes the automated tests can potentially be deployed to production. Continuous Deployment (CD) CD goes one step further than continuous delivery by automatically deploying every code change that passes the automated tests to production. This requires a high level of automation and confidence in the automated tests. Continuous Compliance (CC) CC refers to the practice of integrating compliance checks and controls into the automated CI/CD pipeline. It ensures that software deployments comply with relevant regulations, standards, and internal policies throughout the development lifecycle. DevSecOps DevSecOps integrates security practices into the CI/CD pipeline, ensuring that security is built into the software development process from the beginning. This includes performing security testing (e.g., static code analysis, dynamic application security testing) as part of the pipeline and managing secrets securely. The following picture depicts the DevSecOps lifecycles: Picture courtesy Implement Secrets Management Into DevSecOps Pipelines Implementing secrets management into DevSecOps pipelines involves securely handling and storing sensitive information such as API keys, passwords, and certificates. Here's a step-by-step guide to implementing secrets management in DevSecOps pipelines: Select a Secrets Management Solution Choose a secrets management tool that aligns with your organization's security requirements and integrates well with your existing DevSecOps tools and workflows. Identify Secrets Identify the secrets that need to be managed, such as database credentials, API keys, encryption keys, and certificates. Store Secrets Securely Use the selected secrets management tool to securely store secrets. Ensure that secrets are encrypted at rest and in transit and that access controls are in place to restrict who can access them. Integrate Secrets Management into CI/CD Pipelines Update your CI/CD pipeline scripts and configurations to integrate with the secrets management tool. Use the tool's APIs or SDKs to retrieve secrets securely during the pipeline execution. Implement Access Controls Implement strict access controls to ensure that only authorized users and systems can access secrets. Use role-based access control (RBAC) to manage permissions. Rotate Secrets Regularly Regularly rotate secrets to minimize the impact of potential compromise. Automate the rotation process as much as possible to ensure consistency and security. Monitor and Audit Access Monitor and audit access to secrets to detect and respond to unauthorized access attempts. Use logging and monitoring tools to track access and usage. Best Practices for Secrets Management Into DevSecOps Pipelines Implementing secrets management in DevSecOps pipelines requires careful consideration to ensure security and efficiency. Here are some best practices: Use a secrets management tool: Utilize a dedicated to store and manage secrets securely. Encrypt secrets: Encrypt secrets both at rest and in transit to protect them from unauthorized access. Avoid hardcoding secrets: Never hardcode secrets in your code or configuration files. Use environment variables or secrets management tools to inject secrets into your CI/CD pipelines. Rotate secrets: Implement a secrets rotation policy to regularly rotate secrets, such as passwords and API keys. Automate the rotation process wherever possible to reduce the risk of human error. Implement access controls: Use role-based access controls (RBAC) to restrict access to secrets based on the principle of least privilege. Monitor and audit access: Enable logging and monitoring to track access to secrets and detect any unauthorized access attempts. Automate secrets retrieval: Automate the retrieval of secrets in your CI/CD pipelines to reduce manual intervention and improve security. Use secrets injection: Use tools or libraries that support secrets injection (e.g., Kubernetes secrets, Docker secrets) to securely inject secrets into your application during deployment. Conclusion Secrets management is a critical aspect of DevSecOps that cannot be overlooked. By implementing best practices such as using dedicated secrets management tools, encrypting secrets, and implementing access controls, organizations can significantly enhance the security of their software development and deployment pipelines. Effective secrets management not only protects sensitive information but also helps in maintaining compliance with regulatory requirements. As DevSecOps continues to evolve, it is essential for organizations to prioritize secrets management as a fundamental part of their security strategy.

By Josephine Eskaline Joyce

CORE

How To Add Custom Attributes in Python Logging

Logging is essential for any software system. Using logs, you can troubleshoot a wide range of issues, including debugging an application bug, security defect, system slowness, etc. In this article, we will discuss how to use Python logging effectively using custom attributes. Python Logging Before we delve in, I briefly want to explain a basic Python logging module with an example. #!/opt/bb/bin/python3.7 import logging import sys root = logging.getLogger() root.setLevel(logging.DEBUG) std_out_logger = logging.StreamHandler(sys.stdout) std_out_logger.setLevel(logging.INFO) std_out_formatter = logging.Formatter("%(levelname)s - %(asctime)s %(message)s") std_out_logger.setFormatter(std_out_formatter) root.addHandler(std_out_logger) logging.info("I love Dzone!") The above example prints the following when executed: INFO - 2024-03-09 19:49:07,734 I love Dzone! In the example above, we are creating the root logger and the logging format for log messages. On line 6, logging.getLogger() returns the logger if already created; if not, it goes one level above the hierarchy and returns the parent logger. We define our own StreamHandler to print the log message at the console. Whenever we log messages, it is essential to log the basic attributes of the LogRecord. On line 10, we define the basic format that includes level name, time in string format, and the actual message itself. The handler thus created is set at the root logger level. We could use any pre-defined log attribute name and the format from the LogRecord library. However, let's say you want to print some additional attributes like contextId, a custom logging adapter to the rescue. Logging Adapter class MyLoggingAdapter(logging.LoggerAdapter): def __init__(self, logger): logging.LoggerAdapter.__init__(self, logger=logger, extra={}) def process(self, msg, kwargs): return msg, kwargs We create our own version of Logging Adapter and pass "extra" parameters as a dictionary for the formatter. ContextId Filter import contextvars class ContextIdFilter(logging.Filter): context_id = contextvars.ContextVar('context_id', default='') def filter(self, record): # add a new UUID to the context. req_id = str(uuid.uuid4()) if not self.context_id.get(): self.context_id.set(req_id) record.context_id = self.context_id.get() return True We create our own filter that extends the logging filter, which returns True if the specified log record should be logged. We simply add our parameter to the log record and return True always, thus adding our unique id to the record. In our example above, a unique id is generated for every new context. For an existing context, we return already stored contextId from the contextVars. Custom Logger import logging root = logging.getLogger() root.setLevel(logging.DEBUG) std_out_logger = logging.StreamHandler(sys.stdout) std_out_logger.setLevel(logging.INFO) std_out_formatter = logging.Formatter("%(levelname)s - %(asctime)s ContextId:%(context_id)s %(message)s") std_out_logger.setFormatter(std_out_formatter) root.addHandler(std_out_logger) root.addFilter(ContextIdFilter()) adapter = MyLoggingAdapter(root) adapter.info("I love Dzone!") adapter.info("this is my custom logger") adapter.info("Exiting the application") Now let's put it together in our logger file. Add the contextId filter to the root. Please note that we are using our own adapter in place of logging wherever we need to log the message. Running the code above prints the following message: INFO - 2024-04-20 23:54:59,839 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 I love Dzone! INFO - 2024-04-20 23:54:59,842 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 this is my custom logger INFO - 2024-04-20 23:54:59,843 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 Exiting the application By setting root.propagate = False, events logged to this logger will be passed to the handlers of higher logging, aka parent logging class. Conclusion Python does not provide a built-in option to add custom parameters in logging. Instead, we create a wrapper logger on top of the Python root logger and print our custom parameters. This would be helpful at the time of debugging request-specific issues.

By Ganesh Nagarajan

Microservices Design Patterns for Highly Resilient Architecture

The monolithic architecture was historically used by developers for a long time — and for a long time, it worked. Unfortunately, these architectures use fewer parts that are larger, thus meaning they were more likely to fail in entirety if a single part failed. Often, these applications ran as a singular process, which only exacerbated the issue. Microservices solve these specific issues by having each microservice run as a separate process. If one cog goes down, it doesn’t necessarily mean the whole machine stops running. Plus, diagnosing and fixing defects in smaller, highly cohesive services is often easier than in larger monolithic ones. Microservices design patterns provide tried-and-true fundamental building blocks that can help write code for microservices. By utilizing patterns during the development process, you save time and ensure a higher level of accuracy versus writing code for your microservices app from scratch. In this article, we cover a comprehensive overview of microservices design patterns you need to know, as well as when to apply them. Key Benefits of Using Microservices Design Patterns Microservices design patterns offer several key benefits, including: Scalability: Microservices allow applications to be broken down into smaller, independent services, each responsible for a specific function or feature. This modular architecture enables individual services to be scaled independently based on demand, improving overall system scalability and resource utilization. Flexibility and agility: Microservices promote flexibility and agility by decoupling different parts of the application. Each service can be developed, deployed, and updated independently, allowing teams to work autonomously and release new features more frequently. This flexibility enables faster time-to-market and easier adaptation to changing business requirements. Resilience and fault isolation: Microservices improve system resilience and fault isolation by isolating failures to specific services. If one service experiences an issue or failure, it does not necessarily impact the entire application. This isolation minimizes downtime and improves system reliability, ensuring that the application remains available and responsive. Technology diversity: Microservices enable technology diversity by allowing each service to be built using the most suitable technology stack for its specific requirements. This flexibility enables teams to choose the right tools and technologies for each service, optimizing performance, development speed, and maintenance. Improved development and deployment processes: Microservices streamline development and deployment processes by breaking down complex applications into smaller, manageable components. This modular architecture simplifies testing, debugging, and maintenance tasks, making it easier for development teams to collaborate and iterate on software updates. Scalability and cost efficiency: Microservices enable organizations to scale their applications more efficiently by allocating resources only to the services that require them. This granular approach to resource allocation helps optimize costs and ensures that resources are used effectively, especially in cloud environments where resources are billed based on usage. Enhanced fault tolerance: Microservices architecture allows for better fault tolerance as services can be designed to gracefully degrade or fail independently without impacting the overall system. This ensures that critical functionalities remain available even in the event of failures or disruptions. Easier maintenance and updates: Microservices simplify maintenance and updates by allowing changes to be made to individual services without affecting the entire application. This reduces the risk of unintended side effects and makes it easier to roll back changes if necessary, improving overall system stability and reliability. Let's go ahead and look for different Microservices Design Patterns. Database per Service Pattern The database is one of the most important components of microservices architecture, but it isn’t uncommon for developers to overlook the database per service pattern when building their services. Database organization will affect the efficiency and complexity of the application. The most common options that a developer can use when determining the organizational architecture of an application are: Dedicated Database for Each Service A database dedicated to one service can’t be accessed by other services. This is one of the reasons that makes it much easier to scale and understand from a whole end-to-end business aspect. Picture a scenario where your databases have different needs or access requirements. The data owned by one service may be largely relational, while a second service might be better served by a NoSQL solution and a third service may require a vector database. In this scenario, using dedicated services for each database could help you manage them more easily. This structure also reduces coupling as one service can’t tie itself to the tables of another. Services are forced to communicate via published interfaces. The downside is that dedicated databases require a failure protection mechanism for events where communication fails. Single Database Shared by All Services A single shared database isn’t the standard for microservices architecture but bears mentioning as an alternative nonetheless. Here, the issue is that microservices using a single shared database lose many of the key benefits developers rely on, including scalability, robustness, and independence. Still, sharing a physical database may be appropriate in some situations. When a single database is shared by all services, though, it’s very important to enforce logical boundaries within it. For example, each service should own its have schema, and read/write access should be restricted to ensure that services can’t poke around where they don’t belong. Saga Pattern A saga is a series of local transactions. In microservices applications, a saga pattern can help maintain data consistency during distributed transactions. The saga pattern is an alternative solution to other design patterns that allow for multiple transactions by giving rollback opportunities. A common scenario is an e-commerce application that allows customers to purchase products using credit. Data may be stored in two different databases: One for orders and one for customers. The purchase amount can’t exceed the credit limit. To implement the Saga pattern, developers can choose between two common approaches. 1. Choreography Using the choreography approach, a service will perform a transaction and then publish an event. In some instances, other services will respond to those published events and perform tasks according to their coded instructions. These secondary tasks may or may not also publish events, according to presets. In the example above, you could use a choreography approach so that each local e-commerce transaction publishes an event that triggers a local transaction in the credit service. 2. Orchestration An orchestration approach will perform transactions and publish events using an object to orchestrate the events, triggering other services to respond by completing their tasks. The orchestrator tells the participants what local transactions to execute. Saga is a complex design pattern that requires a high level of skill to successfully implement. However, the benefit of proper implementation is maintained data consistency across multiple services without tight coupling. API Gateway Pattern For large applications with multiple clients, implementing an API gateway pattern is a compelling option One of the largest benefits is that it insulates the client from needing to know how services have been partitioned. However, different teams will value the API gateway pattern for different reasons. One of these possible reasons is that it grants a single entry point for a group of microservices by working as a reverse proxy between client apps and the services. Another is that clients don’t need to know how services are partitioned, and service boundaries can evolve independently since the client knows nothing about them. The client also doesn’t need to know how to find or communicate with a multitude of ever-changing services. You can also create a gateway for specific types of clients (for example, backends for frontends) which improves ergonomics and reduces the number of roundtrips needed to fetch data. Plus, an API gateway pattern can take care of crucial tasks like authentication, SSL termination, and caching, which makes your app more secure and user-friendly. Another advantage is that the pattern insulates the client from needing to know how services have been partitioned. Before moving on to the next pattern, there’s one more benefit to cover: Security. The primary way the pattern improves security is by reducing the attack surface area. By providing a single entry point, the API endpoints aren’t directly exposed to clients, and authorization and SSL can be efficiently implemented. Developers can use this design pattern to decouple internal microservices from client apps so a partially failed request can be utilized. This ensures a whole request won’t fail because a single microservice is unresponsive. To do this, the encoded API gateway utilizes the cache to provide an empty response or return a valid error code. Circuit Breaker Design Pattern This pattern is usually applied between services that are communicating synchronously. A developer might decide to utilize the circuit breaker when a service is exhibiting high latency or is completely unresponsive. The utility here is that failure across multiple systems is prevented when a single microservice is unresponsive. Therefore, calls won’t be piling up and using the system resources, which could cause significant delays within the app or even a string of service failures. Implementing this pattern as a function in a circuit breaker design requires an object to be called to monitor failure conditions. When a failure condition is detected, the circuit breaker will trip. Once this has been tripped, all calls to the circuit breaker will result in an error and be directed to a different service. Alternatively, calls can result in a default error message being retrieved. There are three states of the circuit breaker pattern functions that developers should be aware of. These are: Open: A circuit breaker pattern is open when the number of failures has exceeded the threshold. When in this state, the microservice gives errors for the calls without executing the desired function. Closed: When a circuit breaker is closed, it’s in the default state and all calls are responded to normally. This is the ideal state developers want a circuit breaker microservice to remain in — in a perfect world, of course. Half-open: When a circuit breaker is checking for underlying problems, it remains in a half-open state. Some calls may be responded to normally, but some may not be. It depends on why the circuit breaker switched to this state initially. Command Query Responsibility Segregation (CQRS) A developer might use a command query responsibility segregation (CQRS) design pattern if they want a solution to traditional database issues like data contention risk. CQRS can also be used for situations when app performance and security are complex and objects are exposed to both reading and writing transactions. The way this works is that CQRS is responsible for either changing the state of the entity or returning the result in a transaction. Multiple views can be provided for query purposes, and the read side of the system can be optimized separately from the write side. This shift allows for a reduction in the complexity of all apps by separately querying models and commands so: The write side of the model handles persistence events and acts as a data source for the read side The read side of the model generates projections of the data, which are highly denormalized views Asynchronous Messaging If a service doesn’t need to wait for a response and can continue running its code post-failure, asynchronous messaging can be used. Using this design pattern, microservices can communicate in a way that’s fast and responsive. Sometimes this pattern is referred to as event-driven communication. To achieve the fastest, most responsive app, developers can use a message queue to maximize efficiency while minimizing response delays. This pattern can help connect multiple microservices without creating dependencies or tightly coupling them. While there are tradeoffs one makes with async communication (such as eventual consistency), it’s still a flexible, scalable approach to designing a microservices architecture. Event Sourcing The event-sourcing design pattern is used in microservices when a developer wants to capture all changes in an entity’s state. Using event stores like Kafka or alternatives will help keep track of event changes and can even function as a message broker. A message broker helps with the communication between different microservices, monitoring messages and ensuring communication is reliable and stable. To facilitate this function, the event sourcing pattern stores a series of state-changing events and can reconstruct the current state by replaying the occurrences of an entity. Using event sourcing is a viable option in microservices when transactions are critical to the application. This also works well when changes to the existing data layer codebase need to be avoided. Strangler-Fig Pattern Developers mostly use the strangler design pattern to incrementally transform a monolith application to microservices. This is accomplished by replacing old functionality with a new service — and, consequently, this is how the pattern receives its name. Once the new service is ready to be executed, the old service is “strangled” so the new one can take over. To accomplish this successful transfer from monolith to microservices, a facade interface is used by developers that allows them to expose individual services and functions. The targeted functions are broken free from the monolith so they can be “strangled” and replaced. Utilizing Design Patterns To Make Organization More Manageable Setting up the proper architecture and process tooling will help you create a successful microservice workflow. Use the design patterns described above and learn more about microservices in my blog to create a robust, functional app.

By Gaurav Shekhar