The next critical success factor for successful usability testing is to observe real behavior.
Leading tasks, highly scripted sessions that place page or layout elements out of context, and lots of conversation with the moderator, are some of the more common mistakes I see teams making when they think they're evaluating the usability of the an interface objectively.
Instead, what they're doing is taking users on a guided tour of an interface, and in the worst cases even trying to convince the user their design is good. Even when they're doing a decent job being objective, those who use highly scripted, interview-based "usability" sessions, are at best gathering small sample qualitative preference data.
There's a big difference between preference data and behavioral data. Some may have heard a good illustration from NN/g's Kara Pernice. She uses the example of a cappuccino machine. Imagine you are standing in front of a cappuccino machine. It is brand new and the box and manual were thrown away. If you wanted to know if the design of the cappuccino machine was usable -- you could assign 20-30 people who had never seen or used it before to walk up and make themselves a cappuccino. If you did this in different regions of the country, or even around the world, you would find there is little regional variation in that
behavioral data. You could have confidence that despite your small-sample qualitative methods, you had identified any usability challenges.
However, if your goal really were to find out what flavor of cappuccino people liked -- this one-on-one qualitative method would be all wrong for that kind of
preference data. For one thing you would quickly discover that regional variations across the country, and throughout the world would become very important. You would also discover that 20-30 people recruited via a recruiting agency database are not a statistically valid "sample" of the broader population of cappuccino drinkers. You could over-react when 15 out of your 30 people told you they liked peanut-butter flavored cappuccino, and your boss would be angry when you went to market trumpeting a new Skippy flavored brew that didn't sell well.
The tricky part of "user testing" -- is that it is often funded by the marketing group, or other product managers who aren't exclusively interested in the time-on-task, error rate, or learnability of a cappuccino interface. Empathizing with their need to be reassured about how well the cappuccino machine is going to sell will be important to communicating with them effectively. But if the team is really focused on improving the product and repairing any usability flaws -- you'll need to educate them about methods.
In my current role with an e-commerce retail focus, I regularly have clients come to me with "comps" for a new product page layout, or homepage. They tell me they want to do a "usability test" of the new page, or see if their new images work well and contribute to a purchase decision. There are several "usability" consultancies out there that are happy to take the clients' money, and using the clients' very stilted "usability" test script, bring in a mere 5-10 users, and proceed to "lead the witness."
These consultancies will call it a listening session, or usability group, or whatever the moniker, but by plonking users in front of a single page, outside the context of a realistic behavior (in this case making a real purchase on the wider Web), pointing to a new element and asking users to talk about whether they like it, or if it is helpful, or to describe its usability, they're unfortunately not learning much that is valid or reliable.
The marketing or product leads will sit behind the mirror and furiously scribble down comments both positive and negative, but as with the cappuccino machine preference example, they're using a flawed methodology that has serious potential to not only fail to uncover real behavioral usability problems, but mislead researchers and teams into thinking users prefer one interface element over another.
So how do you avoid this problem?
You do it by watching real behavior that is as non-leading as possible. There are always going to be test effects and distortions caused by the fact that we most often study users outside the normal environment of their home or office, on a computer or browser they are not familiar with, in a situation where they know they are being watched, with a sometimes learned motivation to speak in the animated and adjective-laden style that they think will get them invited back to another focus group to make another 100 bucks, etc.
The potential sources of variance for lab-based testing are well known and well-documented. So I try not to pretend to myself, or to my clients, that the lab doesn't impact what we see. But by following some simple rules we can limit those effects as much as possible.

For starters, I try to set an overall goal for users, and then leave the room. It's a bit difficult for users to verbally describe behavior (instead of actually doing things and trying out the design), if there is no one else in the room to talk to. Second, if I'm interested in a particular part of an interface, such as a new informational element on the homepage of a pharmaceutical company's homepage, or a new larger, interactive, zoomable image module on a retailers product page, I'm much better off if I can observe user interactions with that element that are natural and un-scripted.
As I'm currently in the retail e-commerce space, I insist that users have the broad goal of making a purchase. While I do have to limit them to one particular website (broader studies of them purchasing, say, a pair of pants without any limit to where they can go would have obvious strengths in terms of learning user behavior patterns with search engines, comparison behaviors between sites, etc.), I don't sit next to them and tell them which pages to click, or stop and point out elements of the interface and say "ooh, what do you think of that? Do you like it? How much do you like it on a scale of say 1-10?"
Whenever possible, I like to see users interact with fully clickable, functional prototypes or live sites. Again, in an e-commerce context I like them to be using their own credit card, making selections and actually purchasing such that they know this stuff is actually going to get shipped through the mail and arrive on their doorstep.
You'd be amazed at the difference in behavior between a user who is "pretending" to shop, and one that knows this item they're evaluating will either have to be used, worn, or shipped back via the hassle of a return.
After only a few minutes, I find that users forget I'm even on the other side of the mirror or watching on a dual screen monitor.
As a result, when 20-30 or so users arrive on the homepage the team wanted tested, or land on a product page with the new image "zoom" functionality, I get to see 1) what other elements of the page or overall site they use to solve their problems, 2) at what point in their process they do interact with the new element, and 3) for how long.
Because we use eyetracking technology, I'm able to watch their eyegaze in real-time from behind the mirror and understand intra-page navigation.
Now, dear reader, I suspect you're going to ask -- what if they don't interact with my new interface element, the new whiz-bang thing that the person paying for the study is so desperately wanting feedback on?
Well, sometimes that does happen. And that of course is instructional in and of itself. If 30 folks come in with the goal of making a real purchase, and zero of them use the new zoom feature (that is supposed to help them choose between products), that should give the team pause. But despite my commitment to natural, non-scripted user behaviors, I'm of course a big fan of the good old-fashioned debrief. After we've observed a natural purchase, we then transition to having users talk us through what they've done -- we've already seen the natural behavior so we don't risk altering or influencing what they do by asking follow-up or probing questions.
And if, during the natural user-guided portion of the session the user didn't interact with an important element, I'm happy to assigned a moderator-contrived task during the debrief, or "prompt" them to notice something and try interacting with it so I can seek feedback. Although I know I'm leading the user at that point -- I at least am able to place their comments and behaviors in the context of their more natural behavior which I have just observed. Again, I am likely to get some preference data, but at least it's not all I'm getting out of the study.
So to sum up, watching real behavior is a critical success factor for effective usability evaluation. Instead of tightly scripted, moderator-contrived tasks (with me sitting close to the user and breathing down their neck), assigning a broad goal and letting the user "do their thing" is more likely to uncover unexpected problems and give us confidence as to whether the design really works. Leaving the room can often help users relax and start "doing," instead of merely talking about doing. As Jakob Nielsen has said, what users do, versus what users say they do, can be very different things.