copy & paste

Virtually all of us are guilty of copy and paste “reuse.”  There are different types of reuse though and each come with their own set of problems and gotchas.

Copy/paste a method or large code snippet

This is the kind of reuse that almost everyone will agree is bad.  When all is said and done there are two copies of the code.  If a defect is found, it is likely to be fixed in one and not the other.  Which is bad because an avoidable defect now stays in the code.  This practice is hinting we should extract a new method and call it from the original code and our new code.

Copy/paste a code snippet and then change some values

While less blatant than the copy/paste a method code, it is still a code smell.  This practice is hinting we should be refactoring into a new method that takes some parameters.

Copy/paste a code snippet and then change the body

This approach tends to be used for boilerplate code.  One common example is copying the Java code for a regular expression replace loop and then changing the middle where the replace logic.  In languages with closures, the need for refactoring is more glaring.  In Java, it becomes a call on whether the reuse is worth it.  For tiny examples, it tends to not be.  For more complex examples, an inner class or inheritance is often the answer.

Copy/paste an idea

The hardest type of copy/paste reuse to detect is when it is on a conceptual level.  A copy paste detection tool is unlikely to find it as the code wasn’t copy/pasted.  This kind of copy/paste reuse occurs when someone is trying to figure out how to accomplish a task – say develop a DAO.  Suppose the developer is new and doesn’t know what to do.  The developer sees some code that loops through a result set and decide to use that idea.  Then the developer sees some code that says getString() and copies that for all the types.  This is all well and good until the data types aren’t varchars.  Now the developer has getString() being used to get numbers and Dates too.  Worse yet, this will not crash and appear to work.  However, it will create other problems such as figuring out how to parse the date.  If you were to ask the developer why they are using getString() instead of getDate(), the developer won’t know.  This is fairly low level example.  The same can happen on a higher level such as why the developer is using JDBC instead of JPA/entity beans/etc.

JavaRanch’s statement on Not being a code mill gets into this.  Experts from the page “We are not, however, a code mill and make a point not to give out working code to someone who wants to … dump it into their project without knowing what it does. … that will help you to learn how to solve your problem using best practices.”  I think this is an important point in all levels of reuse.  While copy/paste reuse may appear to work at first, it doesn’t help in the long run.

When the copy/paste code causes defects in the future or needs to be modified, it needs to be understood.  “I did it that way because some other code does” doesn’t make for a helpful reason to the team down the road.

Error Checking Across Three-tiered Systems

For today’s post we’re going to delve into one of the least talked about but extremely common tasks a software developer works with, input validation, or error checking as its more commonly referred to as. Input validation is defined as taking what a user enters on screen, verifying it meets certain requirements, and returning a message to the user if it fails validation.

Let’s say you have a web page that requires a user to enter their zip code as part of account creation. What are the possible paths the user might take? Let’s list them:

  • Success: The user enters a 5-digit zip code
  • Error #1: The user leaves the zip code blank
  • Error #2: The user enters a non-number, such as “Hello”
  • Error #3: The user enters a zip code that has less than 5 digits such as “93”
  • Error #4: The user enters a zip code that has greater than 5 digits such as “9319299”
  • Error #5: The user enters a non-existent zip code such as “00000”

Let’s further add the conditions that the application has been developed using the common three tiered architecture pattern with a web-based HTML UI, a java-based application server, and a SQL-based database server.

The first question to ask is, what needs to be validated on which levels?

  • Top Tier: User Interface Validation

Stepping away from zip code for a second, let’s say you want to know the user’s birthday. You could ask them to enter it as a text field such as “10/11/1970” or, more commonly, ask them to use drop down menus to select the month, the day, and the year. The first type of input, where you give the user a lot of control such as a text field is referred to as unstructured input. The second type of input, where you bound the user’s choices to a fixed number of inputs is referred to as structured input. It should go without saying that structured input is far easier for a developer to work with than unstructured since the ‘number of places things can go wrong’ is significantly reduced.

Turning back to the zip code example, a structured input version might have a drop-down of every zip code in the country. That would certainly limit users from entering bad data and reduce most the errors above, but there are nearly 43,000 zip code in the US! The drop-down box would be hard to navigate, not to mention the bandwidth costs of sending every user the list.

For zip code input, we are stuck with unstructured input, but there are ways to reduce the chaotic nature of the input. For example, we could set an HTML width of 5 characters preventing the user from entering more than 5 digits, thereby preventing Error #4 all together.

For Errors #1, #2, and #3, we could use JavaScript to validate the input without every connecting to the application server. If we have extra time we should certainly implement this validation in JavaScript given obvious reduction in server load this would add. Unfortunately, web browsers are not well controlled parts of software systems and users have the freedom to turn JavaScript off. Therefore, no matter how good the front-tier validation is, its really only icing on the cake to improve performance and usability, the ‘meat’ of the validation belongs to the middle tier.

  • Middle Tier: Server side validation

As discussed, unless you have 100% control of a front-end application, which I can argue you never have, things can always reach the middle tier application server that are invalid. For example, a user could be connecting via a web service or by typing in URLs in a browser window. In both of these cases there is no front end to validate the user’s input. Therefore, the primary job of the application server is to provide services that handle all data input from clients and properly store this information in the database.

It is inferred by this logic, then, that the applications server needs someway of reporting errors to the its user. For example, if the zip code is entered incorrectly and discovered on the middle tier, the application server should send a nice, clean message to the user reporting the problem. When a developer forgets to handle this properly, you end up with web pages with ugly stack traces that I’m sure most of you have seen from time to time. In those instances, the developer forgot to properly encapsulate an error message with a user friendly one. It’s a good practice to put a large ‘catch-all’ around each application server entry point, so that in the event the developer missed taking care of an error, the user sees a generic ‘General System Error’ message. While generic messages such as this may not help the user out, it is far better than having them see a huge stack trace on the screen, which may give them private knowledge of the system such as source code paths and method names.

You may have noticed I skipped validating Error #5 on the UI tier, and with good reason. Although zip codes in the US may be 5 digits long, not all 5 digit long numbers are zip codes (logic 101)! For example, ‘00000’ is not a zip code in any state. In order to validate Error #5, you need a database table listing all possible zip codes to check again. Clearly, this is something that should not be done on the UI side since it would require the download of a long list of zip codes. A further validation might be to take the city and state a user enters and verify they belong to a particular zip code. The problem with such excessive validation is that if you’re database less than 100% accurate, the users may have issues in which a valid zip code is declared invalid, or a false positive to use testing terminology.

  • Bottom Tier: Database validation

The final validation is the place where the data ultimately ends up: the database. It is most often accomplish in the form of structured fields or uniqueness constraints. Regardless of whether the input is validated on the front or middle tier, the database ultimately owns the data and its rules cannot be violated.

If the database is so powerful, why not just do all input validation within the database? In the past, people have tried and the short answer is, performance suffers and it is difficult to maintain. For example, you shouldn’t need to go down all 3 tiers to check that zip code is a number; that sort of thing can be easily validated on the first two tiers. You do, on the other hand, need to go down all three tiers to check if a username is unique since it requires the knowledge from the databases to validate. That doesn’t mean you should just insert a user and wait for the database to fail, you should always check for possible errors ahead of time and catch them gracefully before moving on to the next level.

There are times, though, where the database validation is going to throw errors the other two layers cannot possibly check. For example, let’s say two users try to insert a a record at the same time with the same username ‘MrWidget’ and this field is declared UNIQUE in the database. Both users checked ahead of time to see if ‘MrWidget’ was available, found that the username was free, and committed to creating accounts with username ‘MrWidget’. Unfortunately, only one of these users will get name ‘MrWidget’, the other will get an error message. These race conditions are not very common in most systems, but are something your system should be designed to detect and handle when they do happen. A correct solution here would be to allow the user that submitted first to proceed and display a friendly error message to the second user alerting them the name is no longer available. This is also a good example of where a generic system exception is not going to help the user correct their situation since the username.

  • Final Thoughts

We’ve talked a lot about ‘where’ validation needs to take place but not necessarily ‘how’ we should implement it. For large enough systems, there is often a common validation package or method for each type of form submission that verifies the data both on the UI and middletier server. Database validation happens automatically, but recall in mind its better to avoid throwing SQL exceptions ahead of time if you can detect them. Some more advantages approaches, such as Struts, allow you to define basic validation rules in XML file then can then be used to generate Java form submission validation as well as JavaScript validation automatically. Keep in mind though, more advanced validation like checking to make sure a username exists cannot be accomplished with even these advanced validation techniques and always require a trip down all three tiers. The purpose of validation is to protect the system, but validation should always be implemented in a way that helps and supports the performance of the system.

EJB3 – annotations vs xml

Scott’s recent post on EJB3 got me thinking about annotations as a “replacement” for XML.

By now, we all know why shoving everything in XML isn’t the best of ideas.  I think shoving everything in Java code is bad too.  In particular deployment time concerns (like security) shouldn’t require a recompile.

JEE 5 offers the ability to choose whether to use all annotations, all XML or a mix of annotations and a partial deployment descriptor.  Yet most of the articles and books I’ve seen encourage using annotations for everything and the deployment descriptor as some kind of legacy practice or anti-pattern.  This reminds me of the situation where most people agree code in a JSP is bad practice and yet it keeps coming up due the vast quantity of beginner books with code in the JSP.  From reading the books, it sounds like EJB deployment descriptor XML = bad regardless of whether that is the case in practice.

I do think the annotation approach is fine for the mapping – unless you are developing a common component that will be deployed to different schema definitions.  Most of the time, the schema is stable as Scott noted.

Let’s look at some of the things in an EJB deployment descriptor (for a session bean):

  • Bean Type – This is a coding concern and as such fits well in the Java code.  If my bean changes from stateful to stateless, I should be looking at my code.
  • Security Settings -This is a deployment time concern.  There’s no reason a role change should mandate a redeployment.  Or that the developers know this information.  The application assembler or deployer could add this in.  As such, the security settings are well suited to being in an XML file.  This also has the advantage of a reusable component provider being able to provide generic information and the integrating applications specify XML info specific to their application.
  • Transaction Settings – This one could be argued either way.  If certain settings such as “required new” are needed, it makes sense to specify in the code to hint at this.  At the same time, the correct transaction setting could depend on the integrating application.  I think transaction settings could be a use case for specifying as a Java annotation and allowing/suggesting the integrating application override in XML.
  • Resource References -Resources are both a coding concern and a deployment time concern.  The code certainly cares that the correct resources exist.  And the deployer cares that they are linked correctly.  Luckily, this scenario has existed for years and there already exists an approach.  The reference name and link to the JNDI can be specified as Java annotations since they are coding concerns and likely stable.  The deployer has always been responsible for setting up the correct resource in the JNDI.

None of these are hard and fast rules.  They are just meant to get people thinking about when to use Java annotations vs XML for the deployment descriptors.  We don’t want to just use Java annotations blindly because they are there.

As more people migrate to EJB 3, I think we are going to see some of a “I’m not going to have a deployment descriptor at all now that I can do everything in Java” mentality.  We’ll have to see if it is going to take a swing to far in the other direction (no XML) before people realize some things belong in Java while others belong in XML.