Wednesday, December 23, 2015
Wednesday, December 9, 2015
The OWASP community is working on a new set of secure developer guidelines, called the "OWASP Proactive Controls". The latest draft of these guidelines have been posted in "world edit" mode so that anyone can make direct comments or edits to the document, even anonymously.
You can help make software development safer and more secure by reviewing and contributing to the guidelines at this link:
Thanks for your help!
Wednesday, November 11, 2015
Thursday, August 20, 2015
Some Rules of failure in Complex Systems
4. Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.
3. Catastrophe requires multiple failures - single point failures are not enough. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.
14. Change introduces new forms of failure. The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.
The net of this: Complex systems are essentially and unavoidably fragile. We can try, but we can’t stop them from failing – there are too many moving pieces, too many variables and too many combinations to understand and to test. And even the smallest change or mistake can trigger a catastrophic failure.
A New Hope
But new research at the University of Toronto on catastrophic failures in complex distributed systems offers some hope – a potentially simple way to reduce the risk and impact of these failures.
The researchers looked at distributed online systems that had been extensively reviewed and tested, but still failed in spectacular ways.
They found that most catastrophic failures were initially triggered by minor, non-fatal errors: mistakes in configuration, small bugs, hardware failures that should have been tolerated. Then, following rule #3 above, a specific and unusual sequence of events had to occur for the catastrophe to unravel.
The bad news is that this sequence of events can’t be predicted – or tested for – in advance.
The good news is that catastrophic failures in complex, distributed systems may actually be easier to fix than anyone previously thought. Looking closer, the researchers found that almost all (92%) catastrophic failures are the result of incorrect handling on non-fatal errors. These mistakes in error handling caused the system to behave unpredictably, causing other errors, which weren’t always handled correctly or predictably, creating a domino effect.
More than half (58%) of catastrophic failures could be prevented by careful review and testing of error handling code. In 35% of the cases, the faults in error handling code were trivial: the error handler was empty or only logged a failure, or the logic was clearly incomplete. Easy mistakes to find and fix. So easy that the researchers built a freely available static analysis checker for Java byte code, Aspirator, to catch many of these problems.
In another 23% of the cases, the error handling logic of a non-fatal error was so wrong that basic statement coverage testing or careful code reviews would have caught the mistakes.
The next challenge that the researchers encountered was convincing developers to take these mistakes seriously. They had to walk developers through understanding why small bugs in error handling, bugs that “would never realistically happen” needed to be fixed – and why careful error handling is so important.
This is a challenge that we all need to take up – if we hope to prevent catastrophic failure in complex distributed systems.
Tuesday, July 7, 2015
There’s a lot of bad software out there. Unreliable, insecure, unsafe and unusable. It’s become so bad that some people are demanding regulation of software development and licensing software developers as “software engineers” so that they can be held to professional standards, and potentially sued for negligence or malpractice.
Licensing would ensure that everyone who develops software has at least a basic level of knowledge and an acceptable level of competence. But licensing developers won’t ensure good software. Even well-trained, experienced and committed developers can’t always build good software. Because most of the decisions that drive software quality aren’t made by developers – they’re made by somebody else in the organization.
Product managers and Product Owners. Project managers and program managers. Executive sponsors. CIOs and CTOs and VPs of Engineering. The people who decide what’s important to the organization, what gets done and what doesn’t, and who does it – what problems the best people work on, what work gets shipped offshore or outsourced to save costs. The people who do the hiring and firing, who decide how much money is spent on training and tools. The people who decide how people are organized and what processes they follow. And how much time they get to do their work.
Managers – not developers – decide what quality means for the organization. What is good, and what is “good enough”.
As a manager, I’ve made a lot of mistakes and bad decisions over my career. Short-changing quality to cut costs. Signing teams up for deadlines that couldn’t be met. Giving marketing control over schedules and priorities, trying to squeeze in more features to make the customer or a marketing executive happy. Overriding developers and testers who told me that the software wouldn’t be ready, that they didn’t have enough time to do things properly. Letting technical debt add up. Insisting that we had to deliver now or never, and that somehow we would make it all right later.
I’ve learned from these mistakes. I think I know what it takes to build good software now. And I try to hold to it. But I keep seeing other managers make the same mistakes. Even at the world’s biggest and most successful technology companies, at organizations like Microsoft and Apple.
These are organizations that control their own destinies. They get to decide what they will build and when they need to deliver it. They have some of the best engineering talent in the world. They have all the good tools that money can buy – and if they need better tools, they just write their own. They’ve been around long enough to know how to do things right, and they have the money and scale to accomplish it.
They should write beautiful software. Software that is a joy to use, and that the rest of us can follow as examples. But they don’t even come close. And it’s not the fault of the engineers.
Problems with software quality at Microsoft are so long-running that “Microsoft Quality” has become a recognized term, for software that is just barely “good enough” to be marginally accepted – and sometimes not even that good.
Even after Microsoft became a dominant, global enterprise vendor, quality has continued to be a problem. A 2014 Computer World article “At Microsoft, quality seems to be job none” complains about serious quality and reliability problems in early versions of Windows 10. But Windows 10 is supposed to represent a sea change for Microsoft under their new CEO, a chance to make up for past mistakes, to do things right. So what's going wrong?
The culture and legacy of “good enough” software has been in place for so long that Microsoft seems to be trapped, unable to improve even when they have recognized that good enough isn’t good enough anymore. This is a deep-seated organizational and cultural problem. A management problem. Not an engineering problem.
Apple’s Software Quality Problems
Apple sets themselves apart from Microsoft and the rest of the technology field, and charges a premium based on their reputation for design and engineering excellence. But when it comes to software, Apple is no better than anyone else.
From the epic public face plant of Apple Maps, to constant problems in iTunes and the App Store, problems with iOs updates that fail to install, data lost somewhere in the iCloud, serious security vulnerabilities, error messages that make no sense, and baffling inconsistencies and restrictions on usability, Apple’s software too often disappoints in fundamental and embarrassing ways.
And like Microsoft, Apple management seems have lost their way:
I fear that Apple’s leadership doesn’t realize quite how badly and deeply their software flaws have damaged their reputation, because if they realized it, they’d make serious changes that don’t appear to be happening. Instead, the opposite appears to be happening: the pace of rapid updates on multiple product lines seems to be expanding and accelerating.
I suspect the rapid decline of Apple’s software is a sign that marketing is too high a priority at Apple today: having major new releases every year is clearly impossible for the engineering teams to keep up with while maintaining quality. Maybe it’s an engineering problem, but I suspect not — I doubt that any cohesive engineering team could keep up with these demands and maintain significantly higher quality.
Marco Arment, Apple has lost the functional high ground, 2015-01-04
Recent announcements at this year’s WWDC indicate that Apple is taking some extra time to make sure that their software works. More finish, less flash. We’ll have to wait and see whether this is a temporary pause or a sign that management is starting to understand (or remember) how important quality and reliability actually is.
Managers: Stop Making the Same Mistakes
If companies like Microsoft and Apple, with all of their talent and money, can’t build quality software, how are the rest of us supposed to do it? Simple. By not making the same mistakes:
Putting speed-to-market and cost in front of everything else. Pushing people too hard to hit “drop dead dates”. Taking “sprints” literally: going as fast as possible, not giving the team time to do things right or a chance to pause and reflect and improve.
We all have to work within deadlines and budgets, but in most business situations there’s room to make intelligent decisions. Agile methods and incremental delivery provide a way out when you can’t negotiate deadlines or cost, and don’t understand or can’t control the scope. If you can’t say no, you can say “not yet”. Prioritize work ruthlessly and make sure that you deliver the important things as early as you can. And because these things are important, make sure that you do them right.
Leaving testing to the end. Which means leaving bug fixing to after the end. Which means delivering late and with too many bugs.
Disciplined Agile practices all depend on testing – and fixing – as you code. TDD even forces you to write the tests before the code. Continuous Integration makes sure that the code works every time someone checks in. Which means that there is no reason to let bugs build up.
Not talking to customers, not testing ideas out early. Not learning why they really need the software, how they actually use it, what they love about it, what they hate about it.
Deliver incrementally and get feedback. Act on this feedback, and improve the software. Rinse and repeat.
Ignoring fundamental good engineering practices. Pretending that your team doesn’t need to do these things, or can’t afford to do them or don’t have time to do them, even though we’ve known for years that doing things right will help to deliver better software faster.
As a Program Manager or Product Owner or a Business Owner you don’t need to be an expert in software engineering. But you can’t make intelligent trade-off decisions without understanding the fundamentals of how the software is built, and how software should be built. There’s a lot of good information out there on how to do software development right. There’s no excuse for not learning it.
Ignoring warning signs.
Listen to developers when they tell you that something can’t be done, or shouldn’t be done, or has to be done. Developers are generally too willing to sign on for too much, to reach too far. So when they tell you that they can’t do something, or shouldn’t do something, pay attention.
And when you make mistakes - which you will, learn from them, don’t waste them. When something goes wrong, get the team to review it in a retrospective or run a blameless post mortem to figure out what happened and why, and how you can get better. Learn from audits and pen tests. Take negative feedback from customers seriously. This is important, valuable information. Treat it accordingly.
As a manager, the most important thing you can do is to not set your team up for failure. That’s not asking for too much.
Wednesday, June 24, 2015
OWASP Top 10
The OWASP Top 10 is a community-built list of the 10 most common and most dangerous security problems in online (especially web) applications. Injection flaws, broken authentication and session management, XSS and other nasty security bugs.
These are problems that you need to be aware of and look for, and that you need to prevent in your design and coding. The Top 10 explains how to test for each kind of problem to see if your app is vulnerable (including common attack scenarios), and basic steps you can take to prevent each problem.
If you’re working on mobile apps, take time to understand the OWASP Top 10 Mobile list.
IEEE Top Design Flaws
The OWASP Top 10 is written more for security testers and auditors than for developers. It’s commonly used to classify vulnerabilities found in security testing and audits, and is referenced in regulations like PCI-DSS.
The IEEE Center for Secure Design, a group of application security experts from industry and university researchers, has taken a different approach. They have come up with a Top 10 list that focuses on identifying and preventing common security mistakes in architecture and design.
This list includes good design practices such as: earn or give, but never assume trust; identify sensitive data and how they should be handled; understand how integrating external components changes your attack surface. The IEEE’s list should be incorporated into design patterns and used in design reviews to try and deal with security issues early.
OWASP Proactive Controls
IEEE’s approach is principle-based – a list of things that you need to think about in design, in the same way that you think about things like simplicity and encapsulation and modularity.
The OWASP Proactive Controls, originally created by security expert Jim Manico, is written at the developer level. It is a list of practical, concrete things that you can do as a developer to prevent security problems in coding and design. How to parameterize queries, and encode or validate data safely and correctly. How to properly store passwords and to implement a forgot password feature. How to implement access control – and how not to do it.
It points you to Cheat Sheets and other resources for more information, and explains how to leverage the security features of common languages and frameworks, and how and when to use popular, proven security libraries like Apache Shiro and the OWASP Java Encoder.
Katy Anton and Jason Coleman have mapped all of these controls together (the OWASP Top 10, OWASP Proactive Controls and the IEEE Security Flaws), showing how the OWASP Proactive Controls implement safe design practices from the IEEE list and how they prevent or mitigate OWASP Top 10 risks.
You can use these maps to look for gaps in your application security practices, in your testing and coding, and in your knowledge, to identify areas where you can learn and improve.
Monday, June 22, 2015
Wednesday, June 17, 2015
DevOps can help reduce technical debt in some fundamental ways.
First, building a Continuous Delivery/Deployment pipeline, automating the work of migration and deployment, will force you to clean up inconsistencies and holes in configuration and code deployment, and inconsistencies between development, test and production environments.
And automated Continuous Delivery and Infrastructure as Code gets rid of dangerous one-of-a-kind snowflakes and configuration drift caused by making configuration changes and applying patches manually over time. Which makes systems easier to setup and manage, and reduces the risk of an un-patched system becoming the target of a security attack or the cause of an operational problem.
A CD pipeline also makes it easier, cheaper and faster to pay down other kinds of technical debt. With Continuous Delivery/Deployment, you can test and push out patches and refactoring changes and platform upgrades faster and with more confidence.
The Lean feedback cycle and Just-in-Time prioritization in DevOps ensures that you’re working on whatever is most important to the business. This means that bugs and usability issues and security vulnerabilities don’t have to wait until after the next feature release to get fixed. Instead, problems that impact operations or the users will get fixed immediately.
But there’s a negative side to DevOps that can add to technical debt costs.
Michael Feathers’ research has shown that constant, iterative change is erosive: the same code gets changed over and over, the same classes and methods become bloated (because it is naturally easier to add code to an existing method or a method to an existing class), structure breaks down and the design is eventually lost.
DevOps can make this even worse.
DevOps and Continuous Delivery/Deployment involves pushing out lots of small changes, running experiments and iteratively tuning features and the user experience based on continuous feedback from production use.
Many DevOps teams work directly on the code mainline, “branching in code” to “dark launch” code changes, while code is still being developed, using conditional logic and flags to skip over sections of code at run-time. This can make the code hard to understand, and potentially dangerous: if a feature toggle is turned on before the code is ready, bad things can happen.
Feature flags are also used to run A/B experiments and control risk on release, by rolling out a change incrementally to a few users to start. But the longer that feature flags are left in the code, the harder it is to understand and change.
There is a lot of housekeeping that needs to be done in DevOps: upgrading the CD pipeline and making sure that all of the tests are working; maintaining Puppet or Chef (or whatever configuration management tool you are using) recipes; disciplined, day-to-day refactoring; keeping track of features and options and cleaning them up when they are no longer needed, getting rid of dead code and trying to keep the code as simple as possible.
Microservices and Technology Choices
This is because loosely-coupled Microservices are easier for individual teams to independently deploy, change, refactor or even replace.
And a Microservices-based approach provides developers with more freedom when deciding on language or technology stack: teams don’t necessarily have to work the same way, they can choose the right tool for the job, as long as they support an API contract for the rest of the system.
In the short term there are obvious advantages to giving teams more freedom in making technology choices. They can deliver code faster, quickly try out prototypes, and teams get a chance to experiment and learn about different technologies and languages.
But Microservices “are not a free lunch”. As you add more services, system testing costs and complexity increase. Debugging and problem solving gets harder. And as more teams choose different languages and frameworks, it’s harder to track vulnerabilities, harder to operate, and harder for people to switch between teams. Code gets duplicated because teams want to minimize coupling and it is difficult or impossible to share libraries in a polyglot environment. Data is often duplicated between services for the same reason, and data inconsistencies creep in over time.
There is a potentially negative side to the Lean delivery feedback cycle too.
Constantly responding to production feedback, always working on what’s most immediately important to the organization, doesn’t leave much space or time to consider bigger, longer-term technical issues, and to work on paying off deeper architectural and technical design debt that result from poor early decisions or incorrect assumptions.
Smaller, more immediate problems get fixed fast in DevOps. Bugs that matter to operations and the users can get fixed right away instead of waiting until all the features are done, and patches and upgrades to the run-time can be pushed out more often. Which means that you can pay off a lot of debt before costs start to compound.
But behind-the-scenes, strategic debt will continue to add up. Nothing’s broke, so you don’t have to fix anything right away. And you can’t refactor your way out of it either, at least not easily. So you end up living with a poor design or an aging technology platform, slowly slowing down your ability to respond to changes, to come up with new solutions. Or forcing you to continue filling in security holes as they come up, or scrambling to scale as load increases.
DevOps can reduce technical debt. But only if you work in a highly disciplined way. And only if you raise your head up from tactical optimization to deal with bigger, more strategic issues before they become real problems.
Friday, June 5, 2015
A new book by Len Bass, Ingo Weber and Liming Zhu “DevOps: A Software Architect’s Perspective”, part of the SEI Series in Software Engineering, looks at how DevOps affects architectural decisions, and a software architect’s role in DevOps.
The authors focus on the goals of DevOps: to get working software into production as quickly as possible while minimizing risk, balancing time-to-market against quality.
“DevOps is a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while insuring high quality”These fundamental practices are:
- Engaging operations as a customer and partner, “a first-class stakeholder”, in development. Understanding and satisfying requirements for deployment, logging, monitoring and security in development of an application.
Engaging developers in incident handling. Developers taking responsibility for their code, making sure that it is working correctly, helping (often taking the role of first responders) to investigate and resolve production problems.
This includes the role of a “reliability engineer” on every development team, someone who is responsible for coordinating downstream changes with operations and for ensuring that changes are deployed successfully.
- Ensuring that all changes to code and configuration are done using automated, traceable and repeatable mechanisms – a deployment pipeline.
- Continuous Deployment of changes from check-in to production, to maximize the velocity of delivery, using these pipelines.
- Infrastructure as Code. Operations provisioning and configuration through software, following the same kinds of quality control practices (versioning, reviews, testing) as application software.
Cloud Architecture and Microservices
As a reference for architects, the book focuses on architectural considerations for DevOps. It walks through how Cloud-based systems work, virtualization concepts and especially microservices.
While DevOps does not necessarily require making major architectural changes, the authors argue that most organizations adopting DevOps will find that a microservices-based approach, as pioneered at organizations like Netflix and Amazon, by minimizing dependencies between different parts of the system and between different teams, will also minimize the time required to get changes into production – the first goal of DevOps.
Conway’s Law also comes into play here. DevOps work is usually done by small agile cross-functional teams solving end-to-end problems independently, which means that they will naturally end up building small, independent services:
“Having an architecture composed of small services is a response to having small teams.”
But there are downsides and costs to a microservice-based approach.
As Martin Fowler and James Lewis point out, microservices introduce many more points of failure. Which means that resilience has to be designed and built into each service. Services cannot trust their clients or the other services that they call out to. You need to add defensive checking on data and anticipate failures of other services, implement time-outs and retries, and fall back alternatives or safe default behaviors if another service is unavailable. You also need to design your service to minimize the impact of failure on other services, and to make it easier and faster to recover/restart.
Microservices also increase the cost and complexity of end-to-end system testing. Run-time performance and latency degrade due to the overhead of remote calls. And monitoring and troubleshooting in production can be much more complicated, since a single action often involves many microservices working together (an example at LinkedIn, where a single user request may chain to as many as 70 services).
DevOps in Architecture: Monitoring
In DevOps, monitoring becomes a much more important factor in architecture and design, in order to meet operations requirements.
The chapter on monitoring explains what you need to monitor and why, DevOps metrics, challenges in monitoring systems under continuous change, monitoring microservices and monitoring in the Cloud, and common log management and monitoring tools for online systems.
Monitoring also becomes an important part of live testing in DevOps (Monitoring as Testing), and plays a key role in Continuous Deployment. The authors look at common kinds of live testing, including canaries, A/B testing, and Netflix’s famous Simian Army in terms of passive checking (Security Monkey, Compliance Monkey) and active live testing (Chaos Monkey and Latency Monkey).
DevOps in Architecture: Security
Security is another important cross-cutting concern in software architecture addressed in this book. It looks at security fundamentals including how to identify threats (using Microsoft’s STRIDE model) and the resources that need to be protected, CIA, identity management, access controls. It provides an overview of the security controls in NIST 800-53, and common security issues with VMs and in Cloud architectures (specifically AWS).
In DevOps, security needs to be wired into Continuous Deployment:
- Enforcing that all changes to code and configuration are done through the Continuous Deployment pipeline
- Security testing should be included in different stages of the Continuous Deployment pipeline
- Securing the pipeline itself, including the logs and the artifacts
Continuous Deployment Pipeline and Gatekeepers
Developers – and architects – have to take responsibility for building their automated testing and deployment pipelines. The book explains how Continuous Deployment leverages Continuous Integration, and common approaches to code management and test automation. And it emphasizes the role of gatekeepers along the pipeline – manual decisions or automated checks at different points to determine if it is ok to go forward, from development to testing to staging to live production testing and then to production.
DevOps and Modern Software Architecture
“DevOps: A Software Architect’s Perspective” does a good job of explaining common DevOps practices, especially Continuous Deployment, in a development, instead of operations, context. It also looks at contemporary issues in software architecture, including virtualization and microservices.
It is less academic than Bass’s other book “Software Architecture in Practice”, and emphasizes the importance of real-world operations concerns like reliability, security and transparency (monitoring and live checks and testing) in architecture and deployment.
This is a book written mostly for enterprise software architects and managers who want to understand more about DevOps and Continuous Deployment and Cloud services.
If you’re already deep into DevOps and working with microservices in the Cloud, you probably won’t find much new here.
But if you are looking at how to apply DevOps at scale, or how to migrate legacy enterprise systems to microservices and the Cloud, or if you are a developer who wants to understand operations and what DevOps will mean to you and your job, this is worth reading.
Friday, May 22, 2015
This year's SANS Institute State of Application Security Survey, which I worked on with Eric Johnson and Frank Kim, looks at the gaps between Builders (the people who design and develop software) and Defenders (application security and information security professionals and operations).
We found that more developers - and managers - are coming to understand the risks and costs of insecure software, and are taking security more seriously. And defenders are doing a better job of understanding software development and how to work with developers. But there's still a long way to go.
Developers still need better skills in secure software development and a better understanding of application security risks. And time to learn and apply these skills. Defenders are trying to catch up with developers and Lean/Agile development, injecting security earlier into requirements and design, leveraging automated tools and services to accelerate security testing. But they are coming up against organizational and communications silos, and managers who put marketing priorities (features and time-to-market) ahead of everything else.
More than 1/3 of the organizations surveyed are looking at secure DevOps as a way to help bridge these gaps, break down the silos and bring development and security together. This is going to require some serious changes to how application security and development are done, but it offers a new hope for secure software.
You can read the detailed report of the survey results here.
Friday, May 8, 2015
DevOps probably isn't killing developers.
But it is changing how people think about development - from running projects to a focus on building and running services. And more importantly, DevOps is killing maintenance, or sustaining engineering, or whatever managers want to call it. And that’s something that we should all celebrate.
High-bandwidth collaboration and rapid response to change in Agile put a bullet in the head of offshore development done by outsourced CMMI Level 5 certified development factories. DevOps, by extending collaboration between development teams and operations teams and by increasing the velocity of delivery to production (up to hundreds or even thousands of times per day), and by using real feedback from production to drive development priorities and design decisions, has pulled the plug on the sick idea that maintenance should be done by sustaining engineering teams, offshored together with the help desk to somewhere far away in order to get access to cheap talent.
Agile started the job. Devops can finish it
While large companies were busy finding offshore development and testing partners, Agile changed the rules of the game on them.
Off shoring coding and testing work made sense in large-scale waterfall projects with lots of upfront planning and detailed specs that could be handed off from analysts to farms of programmers and testers.
But the success of Agile adoption in so many organizations, including large enterprises, made outsourcing or offshoring development work less practical and less effective. Instead of detailed analysis and documented hand-offs, Agile teams rely on high-bandwidth face-to-face collaboration with each other and especially with the Customer, and rapid iteration and feedback. Everything happens faster. Customers change priorities and requirements. Developers respond and build and deliver features faster.
Time-intensive and people-intensive work like manual testing and reviews are replaced with automated testing and static analysis in Continuous Integration, pair programming, and continuous review and improvement.
In this dynamic world, it doesn’t make sense to try to shovel work offshore. You have to give up too much in return for saving on staff costs. Teleconferencing across time zones and cultures, “virtual team rooms” using webcams, remote pair programming over Skype… these are all poor compromises that lead to misunderstandings, inefficiencies and mistakes. Sure you can do offshore Agile development, but just because you can do something doesn’t mean that it is a good idea.
Devops is going to finish the job
In DevOps, with Continuous Delivery and Continuous Deployment, changes happen even faster. Cycle times and response times get shorter, from months or weeks to days or hours. And feedback cycles are extended from development into production, letting the entire IT organization experiment and learn and improve from real customer use.
Developers collaborate even more, not just with each other and with customers, but with operations too, in order to make sure that the system is setup correctly and running optimally. This can’t be done effectively by development and operations teams working in different time zones. And it doesn’t need to be.
We all know how outsourcing has played out. In the name of efficiency we sliced out non-strategic parts of core IT and farmed them out to other companies, whether offshore or domestic. CIOs loved it because of the budgetary benefits. Meanwhile, it sparked a thousand conversations about what outsourcing meant for IT, the US economy, individual careers, and the relationship between people and businesses.
But it turned out that we took outsourcing too far. It makes sense for some functions, but it can also mean losing control over management, quality, and security, among other things. Now we're seeing a lot of those big contracts being pulled back, and the word of the day is insourcing.
InformationWeek, DevOps: The New Outsourcing
DevOps intentionally blurs the lines between developers and operations, between coding and support. Engineering is engineering. Project work gets broken down into piece work: individual features or fixes or upgrades that can be completed quickly and pushed into production as soon as possible. Development work is prioritized together with operations and support tasks. What matters is whatever is important to the business, whatever is needed for the system to run. If the business needs something fixed now, your best people are fixing it, instead of giving it to some kids or shipping it overseas.
In DevOps, developers are accountable for making sure that their code works in production:
Which means making sure that the code gets into production, monitoring to make sure that it is working correctly, diagnosing and fixing any problems if something breaks.
New features, changes, fixes, upgrades, support work, deployment… everything is done by the same people, working together. Which means that maintenance and support gets the same management focus as new development. Which means that nobody is stuck in dead end job sustaining a dead end system. Which means that customers get better results, when they need them.
Except for enterprise legacy systems on life support, maintenance as most of us think of it today should die soon, thanks to DevOps. That alone makes DevOps worth adopting.
Thursday, April 30, 2015
There was a lot of talk at RSA this year about DevOps and security: DevOpsSec or DevSecOps or Rugged DevOps or whatever people want to call it. This included a full-day seminar on DevOps before the conference opened and several talks and workshops throughout the conference which tried to make the case that DevOps isn’t just about delivering software faster, but making software better and more secure; and that DevOps isn't just for the Cloud, but that it can work in the enterprise.
The Rugged DevOps story is based on a few core ideas:
Delivering smaller changes, more often, reduces complexity. Smaller, less complex changes are easier to code and test and review, and easier to troubleshoot when something goes wrong. And this should result in safer and more secure code: less complex code has fewer bugs, and code that has fewer bugs also has fewer vulnerabilities.
If you’re going to deliver code more often, you need to automate and streamline the work of testing and deployment. A standardized, repeatable and automated build and deployment pipeline, with built-in testing and checks, enables you to push changes out much faster and with much more confidence, which is important when you are trying to patch a critical vulnerability.
And using an automated deployment pipeline for all changes – changes to application code and configuration and changes to infrastructure – provides better change control. You know what was changed, by who and when, on every system, and you can track all changes back to your version control system.
But this means that you need to re-tool and re-think how you do deployment and configuration management, which is why so many vendors – not just Opscode and Puppet Labs, but classic enterprise vendors like IBM – are so excited about DevOps.
The DevOps Security Testing Problem
And you also need to re-tool and re-think how you do testing, especially system testing and security testing.
In DevOps, with Continuous Delivery or especially Continuous Deployment to production, you don’t have a “hardening sprint” where you can schedule a pen test or in-depth scans or an audit or operational reviews before the code gets deployed. Instead, you have to do your security testing and checks in-phase, as changes are checked-in. Static analysis engines that support incremental checking can work here, but most other security scanning and testing tools that we rely on today won’t keep up.
Which means that you’ll need to write your own security tests. But this raises a serious question. Who’s going to write these tests?
Infosec? There’s already a global shortage of people who understand application security today. And most of these people – the ones who aren’t working at consultancies or for tool vendors – are busy doing risk assessments and running scans and shepherding the results through development to get vulnerabilities fixed, or maybe doing secure code reviews or helping with threat modeling in a small number of more advanced shops. They don’t have the time or often the skills to write automated security tests in Ruby or whatever automated testing framework that you select.
QA? In more and more shops today, especially where Agile or DevOps methods are followed, there isn't anybody in QA, because manual testers who walk through testing checklists can’t keep up, so developers are responsible for doing their own testing.
When it comes to security testing, this is a problem. Most developers still don’t have the application security knowledge to understand how to write secure code, which means that they also don’t understand enough about security to know what security tests need to be written. And writing an automated attack in Gauntlt (and from what I can tell, more people are talking about Gauntlt than writing tests with it) is a lot different than writing happy path automated unit tests in JUnit or UI-driven functional tests in Selenium or Watir.
So we shouldn’t expect too much from automated security testing in DevOps. There’s not enough time in a Continuous Delivery pipeline to do deep scanning or comprehensive fuzzing especially if you want to deploy each day or multiple times per day, and we won’t get real coverage from some automated security tests written in Gauntlt or Mittn.
But maybe that’s ok, because DevOps could force us to change the way that we think about and the way that we do application security, just as Agile development changed the way that most of us design and build applications.
DevOpsSec – a Forcing Factor for Change
Agile development pushed developers to work more closely with each other and with the Customer, to understand real requirements and priorities and to respond to changes in requirements and priorities. And it also pushed developers to take more responsibility for code quality and for making sure that their code actually did what it was supposed to, through practices like TDD and relentless automated testing.
DevOps is pushing developers again, this time to work more closely with operations and infosec, to understand what’s required to make their code safe and resilient and performant. And it is pushing developers to take responsibility for making their code run properly in production:
“You build it, you run it”
Werner Vogels, CTO Amazon
When it comes to security, DevOps can force a fundamental change in how application security is done today, from "check-then-fix" to something that will actually work: building security in from the beginning, where it makes the most difference. But a lot of things have to change for this to succeed:
Developers need better appsec skills, and they need to work more closely with ops and with infosec, so that they can understand security and operational risks and understand how to deal with them proactively. Thinking more about security and reliability in requirements and design, understanding the security capabilities of their languages and frameworks and using them properly, writing more careful code and reviewing code more carefully.
Managers and Product Owners need to give developers the time to learn and build these skills, and the time to think through design and to do proper code reviews.
Infosec needs to become more iterative and more agile, to move out front, so that they can understand changing risks and threats as developers adopt new platforms and new technologies (the Cloud, Mobile, IoT, …). So that they can help developers design and write tools and tests and templates instead of preparing checklists – to do what Intuit calls “Security as Code”.
DevOps isn’t making software more secure – not yet. But it could, if it changes the way that developers design and build software and the way that most of us think about security.
Wednesday, April 15, 2015
Someone on your development team, or a contractor or a consultant, or one of your sys admins, or a bad guy who stole one of these people’s credentials, might have put a backdoor, a logic bomb, a Trojan or other “malcode” into your application code. And you don’t know it.
How much of a real problem is this? And how can you realistically protect your organization from this kind of threat?
The bad news is that it can be difficult to find malcode planted by a smart developer, especially in large legacy code bases. And it can be card to distinguish between intentionally bad code and mistakes.
The good news is that according to research by CERT’s Insider Threat Program less than 5% of insider attacks involve someone intentionally tampering with software: (for a fascinating account of real-world insider software attacks, check out this report from CERT). h Which means that most of us are in much greater danger from sloppy design and coding mistakes in our code and in the third party code that we use, than we are from intentional fraud or other actions by malicious insiders.
And the better news is that most of the work in catching and containing threats from malicious insiders is the same work that you need to do to catch and prevent security mistakes in coding. Whether it is sloppy/stupid or deliberate/evil, you look for the same things, for what Brenton Kohler at Cigital calls “red flags”:
- Stupid or small accidental or “accidental” mistakes in security code such as authentication and session management, access control, or in crypto or secrets handling
- Hard-coded URLs or IPs or other addresses, hard-coded user-ids and passwords or password hashes or keys in the code or in configuration. Potential backdoors for insiders, whether they were intended for support purposes or not, are also holes that could be exploited by attackers
- Test code or debugging code or diagnostics
- Embedded shell commands
- Hidden commands, hidden parameters and hidden options
- Logic mistakes in handling money (like penny shaving) or risk limits or managing credit card details, or in command or control functions, or critical network-facing code
- Mistakes in error handling or exception handling that could leave the system open
- Missing logging or missing audit functions, and gaps in sequence handling
- Code that that is overly tricky, or unclear or that just doesn’t make sense. A smart bad guy will probably take steps to obfuscate what they are trying to do, and anything that doesn’t make sense should raise red flags. Even if this code isn’t intentionally malicious, you don’t want it in your system
- Self-modifying code. See above.
Some of these issues can be found through static analysis. For example, Veracode explains how some common backdoors can be detected by scanning byte code.
But there are limits to what tools can find, as Mary Ann Davidson at Oracle, in a cranky blog post from 2014 points out:
"It is in fact, trivial, to come up with a “backdoor” that, if inserted into code, would not be detected by even the best static analysis tools. There was an experiment at Sandia Labs in which a backdoor was inserted into code and code reviewers told where in code to look for it. They could not find it – even knowing where to look."
If you’re lucky, you might find some of these problems through fuzzing, although it’s hard to fuzz code and interfaces that are intentionally hidden.
The only way that you can have confidence that your system is probably free of malcode – in the same way that you can have confidence that your code is probably free of security vulnerabilities and other bugs – is through disciplined and careful code reviews, by people who know what they are looking for. Which means that you have to review everything, or at least everything important: framework and especially security code, protocol libraries, code that handles confidential data or money, …
And to prevent programmers from colluding, you should rotate reviewers or assign them randomly, and spot check reviews to make sure that they are being done responsibly (that reviews are not just rubber stamps), as outlined in the DevOps Audit Defense Toolkit.
And if the stakes are high enough, you may also need eyes from outside on your code, like the Linux Foundations’s Core Infrastructure Initiative is doing, paying experts to do a detailed audit of OpenSSL, NTP and OpenSSH.
You also need to manage code from check-in through build and test to deployment, to ensure that you are actually deploying what you checked-in and built and tested, and that code has not been tampered with along the way. Carefully manage secrets and keys. Use checksums/signatures and change detection tools like OSSEC to watch out for unexpected or unauthorized changes to important configs and code.
This will help you to catch malicious insiders as well as honest mistakes, and attackers who have somehow compromised your network. The same goes for monitoring activity inside your network: watching out for suspect traffic to catch lateral movement should catch bad guys regardless of whether they came from the outside or the inside.
If and when you find something, the next problem is deciding if it is stupid/sloppy/irresponsible or malicious/intentional.
Cigital’s Kohler suggests that if you have serious reasons to fear insiders, you should rely on a small number of trusted people to do most of the review work, and that you try to keep what they are doing secret, so that bad developers don’t find out and try to hide their activity.
For the rest of us who are less paranoid, we can be transparent, shine a bright light on the problem from the start.
Make it clear to everyone that your customers, shareholders and regulators require that code must be written responsibly, and that everybody’s work will be checked.
Include strict terms in employment agreements and contracts for everyone who could touch code (including offshore developers and contractors and sys admins) which state that they will not under any circumstances insert any kind of time bomb, backdoor or trap door, Trojan, Easter Egg or any kind of malicious code into the system – and that doing so could result in severe civil penalties as well as possible criminal action.
Make it clear that all code and other changes will be reviewed for anything that could be malcode.
Train developers on secure coding and how to do secure code reviews so that they all know what to look for.
If everyone knows that malcode will not be tolerated, and that there is a serious and disciplined program in place to catch this kind of behavior, it is much less likely that someone will try to get away with it – and even less likely that they will be able to get away with it.
You can do this without destroying a culture of trust and openness. Looking out for malcode, like looking out for mistakes, simply becomes another part of your SDLC. Which is the way it should be.
Tuesday, April 7, 2015
Infrastructure as Code is fundamental to DevOps. Automating the work of setting up and maintaining systems infrastructure. Making it defined, efficient, testable, auditable and standardized.
For the many of us who work in regulated environments, we need more. We need Compliance as Code.
Take regulatory constraints and policies and compliance procedures and the processes and constraints that they drive, and wire as much of this as possible into automated workflows and tests. Making it defined, efficient, testable, auditable and standardized.
DevOps Audit Defense Toolkit
Some big steps towards Compliance as Code are laid out in the Devops Audit Defense Toolkit, a freely-available document which explains how compliance requirements such as separation of duties between developers and operations, and detecting/preventing unauthorized changes, can be met in a DevOps environment, using some common, basic controls:
- Code Reviews. All code changes must be peer reviewed before check-in. Any changes to high-risk code must be reviewed a second time by an expert. Reviewers check code and tests for functional and operational correctness and consistency. They look for coding and design mistakes and gaps, operational dependencies, for back doors and for security vulnerabilities. Which means that developers must be trained and guided in how to do reviews properly. Peer reviews also ensure that changes can’t be pushed without at least one other person on the team understanding what is going on.
- Static analysis. Static analysis is run on all changes to catch security bugs and other problems. Any violations of coding rules will break the build.
- Automated testing is done in Continuous Integration/Continuous Delivery – unit and integration testing, and security testing. The Audit Toolkit assumes that developers follow TDD to ensure a high level of test coverage. All tests must pass.
- Traceability of all changes back to the original request, using a ticketing system like Jira (you can’t just use index cards on a wall to describe stories and throw them out when you are done).
- Operations checks/asserts after deployment and startup, and feedback from operations monitoring and especially from production failures. Metrics and post mortem review findings are used to drive improvements to testing and instrumentation, as well as deeper changes to policy definition, training and hiring – see John Allspaw’s presentation Ops Meta-Metrics: The Currency you use to pay for Change, from Velocity 2010, on how this can be done.
- All changes to code and infrastructure definitions, including bug fixes and patches, are deployed through the same automated, auditable Continuous Delivery pipeline.
A starting point
The DevOps Audit Defense Toolkit provides a starting point, an example to build on. You can add your own rules, checks, reviews, tests, and feedback loops.
It is also a work in progress. There are a few important problems still to be worked out:
The Audit Toolkit describes how standard changes can be handled in Continuous Delivery: small, well-defined, low-impact changes that are effectively pre-approved. Operations and management are notified as these changes are deployed (the changes are logged, information is displayed on screens and included in reports), but there is no upfront communication or coordination of these changes, because it shouldn’t be necessary. Developers can push changes out as soon as they are ready, and they get deployed immediately after all reviews and tests and other checks pass.
But the Audit Toolkit is silent on how to manage larger scale changes, including changes to data and databases, changes to interfaces with other systems, changes required to comply with new laws and regulations, major new customer features and technical upgrades. Changes that are harder to rollout, that have wider impact and higher risk, and require much more coordination. Which is, of course, the stuff that matters most.
You need clear and explicit hand-offs to operations and customer service for larger changes, so that all stakeholders understand the dependencies and risks and impact on how they work so that they can plan ahead. This can still be done in a DevOps way, but it does require meetings and planning, and some project management and paperwork. As an example, see how Etsy manages feature launches.
You also need to ensure that the policies for defining which changes are small enough and simple enough to be pre-approved, and for deciding which code changes are high risk and need additional review, are reasonable and unambiguous and consistent. You need to do frequent reviews to ensure that these policies are rigorously followed and that people don’t misunderstand or try to get away with pushing higher-risk, non-standard changes through without management/CAB oversight and explicit change approval.Done properly, this means that the full weight of change control is only brought to bear when it is needed – for changes that have real operational or business risk. Then you want to find ways to minimize these risks, to break changes down into smaller pieces, to simplify, streamline and automate as much of the work required as possible, leveraging the same testing and delivery infrastructure.
There is a lot of attention to responsible security testing in the Audit Toolkit. Because changes are made incrementally and iteratively, and pushed out automatically, you’ll need tools and tests that work automatically, incrementally and iteratively. Which is unfortunately not how most security tools work, and not how most security testing is done today.
There aren’t that many organizations using tools like Gauntlt or BDD-Security to write higher-level automated security tests and checks as part of Continuous Integration or Continuous Delivery. Most of us depend on dynamic and static scanners and fuzzers that can take hours to run and require manual review and attention, or expensive, time-consuming manual pen tests. This clearly can’t be done on every check-in.
But as more teams adopt Agile and now DevOps practices, the way that security testing is done is also changing, in order to keep up. Static analysis tools are getting speedier, and many tools can provide feedback directly to developers in their IDEs, or work against incremental change sets. Dynamic testing tools and services are becoming more scriptable and more scalable and simpler to use, with open APIs.
Interactive security testing tools like Contrast or Quotium Seeker can catch security errors at run-time as the system is being tested in Continuous Integration/Delivery. And companies like Signal Sciences are working on new ways to do agile security for online systems. But this is new ground: there’s still lots of digging and hoeing that needs to be done.
Do developers need access to production?
The Audit Toolkit assumes that developers will have read access to production logs, and that they may also need direct access to production in order to help with troubleshooting and support. Even if you restrict developers to read only access, this raises concerns around data privacy and confidentiality.
And what if read access is not enough? What if developers need to make a hot fix to code or configuration that can’t be done through the automated pipeline, or repair production data? Now you have problems with separation of duties and data integrity.
What should developers be able to do, what should they be able to see? And how can this be controlled and tracked? If you are allowing developers in production, you need to have solid answers for these questions.
Continuous Deployment or Continuous Delivery?
The Audit Toolkit makes the argument that with proper controls in place, developers should be able to push changes directly out to production when they are ready – provided that these changes are low-risk and only if the changes pass through all of the reviews and tests in the automated deployment pipeline.
But this is not something that you have to do or even can do – not because of compliance constraints necessarily, but because your business environment or your architecture won’t support making changes on the fly. Continuous Delivery does not have to mean Continuous Deployment. You can still follow disciplined Continuous Delivery through to pre-production, with all of the reviews and checks in place, and then bundle changes together and release them when it makes sense.
Selling to regulators and auditors
You will need to explain and sell this approach to regulators and auditors – to lawyers or wanna-be lawyers. Convincing them – and helping them – to look at code and logs instead of legal policies and checklists. Convincing them that it’s ok for developers to push low-risk, pre-approved changes to production, if you want to go this far.
Just as beauty is in the eye of the beholder, compliance is in the opinion of the auditor. They may not agree with or understand what you are doing. And even if one auditor does, the next one may not. Be prepared for a hard sell, and for set backs.
Disciplined, Agile and LeanThe DevOps Audit Defense Toolkit describes a disciplined, but Agile/Lean approach to managing software and system changes in a highly regulated environment.
This is definitely not easy. It’s not lightweight. It takes a lot of engineering discipline. And a lot of investment in automation and in management oversight to make it work.
But it’s still Agile. It supports the rapid pace and iterative, incremental way that development teams want to work today. And Lean. Because all of the work is clearly laid out and automated wherever possible. You can map the value chains and workflows, measure delays and optimize, review and improve.
Instead of detailed policies and procedures and checklists that nobody can be sure are actually being followed, you have automated delivery and deployment processes that you exercise all of the time, so you know they work. Policies and guidelines are used to drive decisions, which means that they can be simpler and clearer and more practical. Procedures and checklists are burned into automated steps and controls.
This could work. It should work. And it’s worth trying to make work. Instead of compliance theater and tedious and expensive overhead, it promises that changes to systems can be made simpler, more predictable, more efficient and safer. That’s something that’s worth doing.
Thursday, March 19, 2015
The researchers found that refactoring didn’t seem to make code measurably easier to understand or change, or even measurably cleaner (measured by cyclomatic complexity, depth of inheritance, class coupling or lines of code).
But as other people have discussed, this study is deeply flawed. It appears to have been designed by people who didn’t understand how to do refactoring properly:
The researchers chose 10 “high impact” refactoring techniques (from a 2011 study by Shatnawi and Li)
based on a model of OO code quality which measures reusability, flexibility, extendibility and effectiveness (“the degree to which a design is able to achieve the desired functionality and behavior using OO design concepts and techniques” – whatever that means), but which specifically did not include understandability. And then they found that the refactored code was not measurably easier to understand or fix. Umm, should this have been a surprise?
The refactorings were intended to make the code more extensible and reusable and flexible. In many cases this would have actually made the code less simple and harder to understand. Flexibility and extendibility and reusability often come at the expense of simplicity, requiring additional scaffolding and abstraction. These are long-term investments that are intended to pay back over the life of a system – something that could not be measured in the couple of hours that the study allowed.
The list of techniques did not include common and obviously useful refactorings which would have made the code simpler and easier to understand, such as Extract Class and Extract Method (which are the two most impactful refactorings, according to research by Alshehri and Benedicenti, 2014) Extract Variable, Move Method, Change Method Signature, Rename anything, … [insert your own shortlist of other useful refactorings here].
There is no evidence – and no reason to believe – that the refactoring work that was done, was done properly. Presumably somebody entered some refactoring commands in Visual Studio and the code was “refactored properly”.
The study set out to measure whether refactoring made code easier to change. But they attempted to do this by assessing whether students were able to find and fix bugs that had been inserted into the code – which is much more about understanding the code than it is about changing it.
The code base (4500 lines) and the study size (two groups of 10 students) were both too small to be meaningful, and students were not given enough time to do meaningful work: 5 minutes to read the code, 30 minutes to answer some questions about it, 90 minutes to try to find and fix some bugs.
- And as the researchers point out, the developers who were trying to understand the code were inexperienced. It’s not clear that they would have been able to understand the code and work with it even it had been refactored properly.
But the study does point to some important limitations to refactoring and how it needs to be done.
Good Refactoring takes Time
Refactoring code properly takes experience and time. Time to understand the code. Time to understand which refactorings should be used in what context. Time to learn how to use the refactoring tools properly. Time to learn how much refactoring is enough. And of course time to get the job done right.
Someone who isn’t familiar with the language or the design and the problem domain, and who hasn’t worked through refactoring before won’t do a good job of it.
Refactoring is Selfish
When you refactor, it’s all about you. You refactor the code in ways to make it easier for YOU to understand and that should make it easier for YOU to change in the future. But this doesn’t necessarily mean that the code will be easier for someone else to understand and change.
It’s hard to go wrong doing some basic, practical refactoring. But deeper and wider structural changes, like Refactoring to Patterns or other “Big Refactoring” or “Large Scale Refactoring” changes that make some programmers happy can also make the code much harder for other programmes to understand and work with – especially if the work only gets done part way (which often happens with well-intentioned, ambitious root canal refactoring work).
In the study, the researchers thought that they were making the code better, by trying to make it more extensible, reusable and flexible. But they didn’t take the needs of the students into consideration. And they didn’t follow the prime directive of refactoring:
Always start by refactoring to understand. If you aren’t making the code simpler and easier to understand, you’re doing it wrong.
Ironically, what the students in the study should have done – with the original code, as well as the “refactored code” – was to refactor it on their own first so that they could understand it. That would have made for a more interesting, and much more useful, study.
There’s no doubt that refactoring – done properly – will make code more understandable, more maintainable, and easier to change. But you need to do it right.
Wednesday, March 4, 2015
To build a secure app, you can’t wait to the end and hope to “test security in”. For teams who follow Agile methods like Scrum, this means you have to find a way to add security into Sprints. Here’s how to do it:
A few basic security steps need to be included upfront in Sprint Zero:
- Platform selection – when you are choosing your language and application framework, take some time to understand the security functions they provide. Then look around for security libraries like Apache Shiro (a framework for authentication, session management and access control), Google KeyCzar (crypto), and the OWASP Java Encoder (XSS protection) to fill in any blanks.
- Data privacy and compliance requirements – make sure that you understand data needs to be protected and audited for compliance purposes (including PII), and what you will need to prove to compliance auditors.
- Secure development training – check the skill level of the team, fill in as needed with training on secure coding. If you can’t afford training, buy a couple of copies of Iron-Clad Java, and check out SAFECode’s free seminars on secure coding.
- Coding guidelines and code review guidelines – consider where security fits in. Take a look at CERT’s Secure Java Coding Guidelines.
- Testing approach – plan for security unit testing in your Continuous Integration pipeline. And choose a static analysis tool and wire it into Continuous Integration too. Plan for pen testing or other security stage gates/reviews later in development.
- Assigning a security lead - someone on the team who has experience and training in secure development (or who will get extra training in secure development) or someone from infosec, who will act as the point person on risk assessments, lead threat modeling sessions, coordinate pen testing and scanning and triage the vulnerabilities found, bring new developers up to speed.
- Incident Response - think about how the team will help ops respond to outages and to security incidents.
The first few Sprints, where you start to work out the design and build out the platform and the first-ofs for key interfaces and integration points, is when the application’s attack surface expands quickly.
You need to do threat modeling to understand security risks and make sure that you are handling them properly.
- What are you building?
- What can go wrong?
- What are you going to do about it?
- Did you do an acceptable job at 1-3?
Delivering Features (Securely)
A lot of development work is business as usual, delivering features that are a lot like the other features that you’ve already done: another screen, another API call, another report or another table. There are a few basic security concerns that you need to keep in mind when you are doing this work. Make sure that problems caught by your static analysis tool or security tests are reviewed and fixed. Watch out in code reviews for proper use of frameworks and libraries, and for error and exception handling and defensive coding.
Take some extra time when a security story comes up (a new security feature or a change to security or privacy requirements), and think about abuser stories whenever you are working on a feature that deals with something important like money, or confidential data, or secrets, or command-and-control functions.
You need to think about security any time you are doing heavy lifting: large-scale refactoring, upgrading framework code or security plumbing or the run-time platform, introducing a new API or integrating with a new system. Just like when you are first building out the app, spend extra time threat modeling, and be more careful in testing and in reviews.
At some point later in development you may need to run a security Sprint or hardening Sprint – to get the app ready for release to production, or to deal with the results of a pen test or vulnerability scan or security audit, or to clean up after a security breach.
This could involve all or only some of the team. It might include reviewing and fixing vulnerabilities found in pen testing or scanning. Checking for vulnerabilities in third party and Open Source components and patching them. Working with ops to review and harden the run-time configuration. Updating and checking your incident response plan, or improving your code review or threat modeling practices, or reviewing and improving your security tests. Or all of the above.
Adding Security into Sprints. Just Do It.
Adding security into Sprints doesn’t have to be hard or cost a lot. A stripped down approach like this will take you a long way to building secure software. And if you want to dig deeper into how security can fit into Sprints, you can try out Microsoft’s SDL for Agile. Just do it.
Wednesday, February 25, 2015
Most of what we read about or hear about in DevOps emphases speed. Continuous Deployment. Fast feedback. Fail fast, fail often.
How many times do we have to hear about how many times Amazon or Facebook or Netflix or Etsy deploy changes every day or every hour or every minute?
Software Development at the Speed of DevOps
Security at the Speed of DevOps
DevOps at the Speed of Google
Devops Explained: A Philosophy of Speed, Not Momentum
It’s all about the Speed: DevOps and the Cloud
Even enterprise DevOps conferences are about speed and more speed.
Speed is Sexy, but...
Speed is sexy. Speed sells. But speed isn’t the point.
Go back to John Allspaw’s early work at Flickr, which helped kick off DevOps. Actually, look at all of Allspaw’s work. Most of it is about minimizing the operational and technical risk of change. Minimizing the chance of making mistakes. Minimizing the impact of mistakes. Minimizing the time needed to detect, understand and recover from mistakes. Learning from mistakes when they happen and improving so that you don't make the same kind of mistakes again or so that you can catch them and fix them quicker. Breaking down silos between dev and ops so that they can work together to solve problems.
Checking everything into version control – code, application configuration, server and network configurations… not about maximizing speed.
Breaking releases down into small change sets with fewer moving parts and fewer dependencies, makes changes easier to understand, easier to review, easier to test, simpler and easier to deploy and simpler and easier to roll-back or fix. This is not about maximizing speed.
Executing automated tests in Continuous Integration…
Building out test environments to match production so that developers can test and learn how their system will work under real-world conditions…
Building automated integration and deployment pipelines to test and to production so that you can push out a change or a fix immediately is…
Change controls based on transparency and peer reviews and repeatable automated controls instead of CCB meetings…
Auditing all of this so that you know what was changed by who and when…
Developers talking to ops and learning and caring about run-time infrastructure and operations procedures….
Ops talking to developers and learning and caring about the application and how it is built and deployed and configured…
Wiring monitoring and metrics and alerting into the system from the beginning…
Running game days and testing your incident response capabilities with developers and ops…
Injecting automated security testing and checks into your build and deployment chain…
None of this is about speed. It is about building better communications paths and feedback loops between the business and developers and operations. About building a safe, open culture where people can confront mistakes and learn from them together. About building a repeatable, reliable deployment capability. Building better, more resilient software and a better, more resilient and responsive IT delivery and support organization.
DevOps is not a Race
Ignore the vendors who tell you that their latest “DevOps solution” will make your enterprise faster.
And unless you actually are an online consumer startup, ignore the hype about the Lean Startup and Continuous Deployment – this has nothing to do with running an enterprise.
DevOps is a lot of work. Don’t go into it thinking that it’s a race.
Tuesday, February 10, 2015
For the last couple of years we’ve been tracking technical debt in our development backlog. Adding debt payments to the backlog, making the cost and risk of technical debt visible to the team and to the Product Owner, prioritizing payments with other work, is supposed to ensure that debt gets paid down.
But I am not convinced that it is worth it. Here’s why:
Debt that’s not worth tracking because it’s not worth paying off
Some debt isn’t worth worrying about.
A little (but not too much) copy-and-paste. Fussy coding-style issues picked up by some static analysis tools (does it really matter where the brackets are?). Poor method and variable naming. Methods which are too big. Code that doesn’t properly follow coding standards or patterns. Other inconsistencies. Hard coding. Magic numbers. Trivial bugs.
This is irritating, but it’s not the kind of debt that you need to track on the backlog. It can be taken care of in day-to-day opportunistic refactoring. The next time you’re in the code, clean it up. If you’re not going to change the code, then who cares? It’s not costing you anything. If you close your eyes and pretend that it’s not there, nothing really bad will happen.
Somebody else’s debt
Some of this is bad – seriously bad. Exploitable security vulnerabilities. Think Heartbleed. This shouldn’t even make it to the backlog. It should be fixed right away. Make sure that you know that you can build and roll out a patched library quickly and with confidence (as part of your continuous build/integration/delivery pipeline).
Everything else is low priority. If there’s a newer version with some bug fixes, but the code works the way you want it to, does it really matter? Upgrading for the sake of upgrading is a waste of time, and there’s a chance that you could introduce new problems, break something that you depend on now, with little or no return. Remember, you have the source code – if you really need to fix something or add something, you can always do it yourself.
Debt you don’t know that you have
Some of the scariest debt is the debt that you don’t know you have. Debt that you took on unconsciously because you didn’t know any better… and you still don’t. You made some bad design decisions. You didn’t know how to use your application framework properly. You didn't know about the OWASP Top 10 and how to protect against common security attacks.
This debt can’t be on your backlog. If something changes – a new person with more experience joins the team, or you get audited, or you get hacked – this debt might get exposed suddenly. Otherwise it keeps adding up, silently, behind the scenes.
Debt that is too big to deal with
There’s other debt that’s too big to effectively deal with. Like the US National Debt. Debt that you took on early by making the wrong assumptions or the wrong decisions. Maybe you didn’t know you were wrong then, but now you do. You – or somebody before you – picked the wrong architecture. Or the wrong language, or the wrong framework. Or the wrong technology stack. The system doesn’t scale. Or it is unreliable under load. Or it is full of security holes. Or it’s brittle and difficult to change.
You can’t refactor your way out of this. You either have to put up with it as best as possible, or start all over again. Tracking it on your backlog seems pointless:
As a developer, I want to rewrite the system, so that everything doesn’t suck….
Fix it now, or it won’t get fixed at all
Technical debt that you can do something about is debt that you took on consciously and deliberately – sometimes responsibly, sometimes not. h
You took short cuts in order to get the code out for fast feedback (A/B testing, prototyping). There’s a good chance that you’ll have to rewrite it or even throw it out, so why worry about getting the code right the first time? This is strategic debt – debt that you can afford to take it on, at least for a while.
Or you were under pressure and couldn’t afford to do it right, right then. You had to get it done fast, and the results aren’t pretty.
The code works, but it is a hack job. You copied and pasted too much. You didn’t follow conventions. You didn’t get the code reviewed. You didn’t write tests, or at least not enough of them. You left in some debugging code. It’s going to be a pain to maintain.
If you don’t get to this soon, if you don’t clean it up or rewrite it in a few weeks or a couple of months, then there is a good chance that this debt will never get paid down. The longer it stays, the harder it is to justify doing anything about it. After all, it’s working fine, and everyone has other things to do.
The priority of doing something about it will continue to fall, until it’s like silt, settling to the bottom. Eventually you’ll forget that it’s there. When you see it, it will make you a little sad, but you’ll get over it. Like the shoppers in New York City, looking up at the US National Debt Clock, on their way to the store to buy a new TV on credit.
And hey, if you’re lucky, this debt might get paid down without you knowing about it. Somebody refactors some code while making a change, or maybe even deletes it because the feature isn’t used any more, and the debt is gone from the code base. Even though it is still on your books.
Don’t track technical debt. Deal with it instead
Tracking technical debt sounds like the responsible thing to do. If you don’t track it, you can’t understand the scope of it. But whatever you record in your backlog will never be an accurate or complete record of how much debt you actually have – because of the hidden debt that you’ve taken on unintentionally, the debt that you don’t understand or haven’t found yet.
More importantly, tracking work that you’re not going to do is a waste of everyone’s time. Only track debt that everyone (the team, the Product Owner) agrees is important enough to pay off. Then make sure to pay it off as quickly as possible. Within 1 or 2 or maybe 3 sprints. Otherwise, you can ignore it. Spend your time refactoring instead of junking up the backlog. This isn’t being irresponsible. It’s being practical.